DTrace : détection de problèmes en temps réel
Contents
1 Introduction
Dtrace est un système de trace conçu par Sun Microsystems pour la détection de problèmes en temps réel au niveau noyau ou au niveau applicatif. Il est disponible depuis Novembre 2003, et a été intégré en tant que partie de Solaris 10 en Janvier 2005. Dtrace est le premier composant du projet OpenSolaris dont le code a été délivré sous la license Common Development and Distribution License (CDDL).
Dtrace est un système de trace conçu pour donner des informations qui permettent aux utilisateurs d'ajuster des applications et le système d'exploitation lui-même. Il est conçu pour être utilisé dans des environements de production. Ainsi, l'effet des sondes est minimal quand l'action de trace est en cours, et il n'y a pas d'impact de performance pour les sondes non actives. C'est important car un système comprend des dizaines de milliers de sondes dont beaucoup peuvent être actives.
On écrit les programmes de trace (souvent appelés script) en utilisant le langage de programmation D (à ne pas confondre avec D). D est un sous-ensemble du langage C avec l'addition de fonctions et variables prédéfinies spécifiques à l'opération de trace. Un programme écrit en D ressemble par sa structure à un programme écrit en AWK.
Sachant qu'il est toujours un peu long de se faire ses propres scripts à chaque fois, j'ai préférer mettre ici tous ceux que j'ai utilié.
2 Print Utilization statistics per process
Brendan Gregg developed prustat to display the top processes sorted by CPU, Memory, Disk or Network utilization:
This script is super useful for getting a high level understanding of what is happening to a Solaris server. Golden!
3 File System Flush Activity
On Solaris systems, the pagedaemon is reponsible for scanning the page cache and adjusting the MMU reference bit of each dirty page it finds. When the fsflush daemon runs, it scans the page cache looking for pages with the MMU reference bit set, and schedules these pages to be written to disk. The fsflush.d D script provides a detailed breakdown of pages scanned, and the number of nanoseconds that were required to scan "SCANNED" pages:
Now you might be wondering why "SCANNED" is less than "EXAMINED?" This is due to a bug in fsflush, and a bug report was filed to address this anomaly. Tight!
4 Seek Sizes
Prior to Solaris 10, determining if an application accessed data in a sequential or random pattern required reviewing mounds of truss(1m) and vxtrace(1m) data. With the introduction of DTrace and Brendan Gregg's seeksize.d D script, this question is trivial to answer:
This script measures the seek distance between consecutive reads and writes, and provides a histogram with the seek distances. For applications that are using sequential access patterns (e.g., dd in this case), the distribution will be small. For applications accessing data in a random nature (e.g, sched in this example), you will see a wide distribution. Shibby!
5 Print Overall Paging Activity
Prior to the introduction of DTrace, it was difficult to extract data on which files and disk devices were active at a specific point in time. With the introduction of fspaging.d, you can get a detailed view of which files are being accessed:
This is a super useful script! Niiiiiiiiiiice!
6 Getting System Wide errno Information
When system calls have problems executing, they usually return a value to indicate success or failure, and set the global "ERRNO" variable to a value indicating what went wrong. To get a system wide view of which system calls are erroring out, we can use Brendan Gregg's errinfo D script:
This will display the process, system call, and errno number and description from /usr/src/sys/errno.h! Jeah!
7 I/O per process
Several Solaris utilities provide a summary of the time spent waiting for I/O (which is a meaningless metric), but fail to provide facilities to easily correlate I/O activity with a process. With the introduction of psio.pl, you can see exactly which processes are responsible for generating I/O:
Once you find I/O intensive processes, you can use fspaging, iosnoop, and rwsnoop to get additional information:
Smooooooooooth!
8 I/O Sizes Per Process
As a Solaris administrator, we are often asked to identify application I/O sizes. This information can be acquired for a single process with truss(1m), or system wide with Brendan Gregg's bitesize.d D script:
If only Dorothy could see this!
9 TCP Top
Snoop(1m) and ethereal are amazing utilities, and provide a slew of options to filter data. When you don't have time to wade through snoop data or download and install ethereal, you can use tcptop to get an overview of TCP activity on a system:
Now this is some serious bling!
10 Who's paging and DTrace enhanced vmstat
With Solaris 9, the "-p" option was added to vmstat to break paging activity up into "executable," "anonymous" and "filesystem" page types:
This was super useful information, but unfortunatley doesn't provide the executable responsible for the paging activity. With the introduction of whospaging.d, you can get paging activity per process:
whospaging.d |
$ whospaging.d Who's waiting for pagein (milliseconds): Who's on cpu (milliseconds): svc.configd 0 sendmail 0 svc.startd 0 sshd 0 nscd 1 dtrace 3 fsflush 14 dd 1581 sched 3284 |
Once we get the process name that is reponsible for the paging activity, we can use dvmstat to break down the types of pages the application is paging (similar to vmstat -p, but per process!):
Once we have an idea of which pages are being paged in or out, we can use iosnoop, rwsnoop and fspaging.d to find out which files or devices the application is writing to! Since these rockin' scripts go hand in hand, I am placing them together. Shizam!
And without further ado, number 1 goes to ... (*drum roll*)
11 I/O Top
After careful thought, I decided to make iotop and rwtop #1 on my top ten list. I have long dreamed of a utility that could tell me which applications were actively generating I/O to a given file, device or file system. With the introduction of iotop and rwtop, my wish came true:
12 References
http://brendangregg.com/
DTrace User Guide
Observing I/O Behavior with the DTraceToolkit
DTrace Toolkit
DTrace Topics
12.1 Dtrace GUI
http://www.netbeans.org/kb/docs/ide/NetBeans_DTrace_GUI_Plugin_0_4.html