Make Linux performance analysis easier with oprofile

If you’ve had to do much performance tuning on a Linux system, you may have come across the oprofile system profiler. It may sound like a great piece of software, but it’s got a name for being difficult to use. In this article, I’m going to try to show that oprofile is decidedly non-scary, and is in fact quite easy to use. On most of my development systems, I run Fedora Core 6. What I’m going to describe is tailored to that environment, but shouldn’t be hard to adapt to other distributions. The following instructions work just fine for me on 32-bit x86, 64-bit x86, and 64-bit PowerPC. On Fedora, installing oprofile is simple:
$ yum -y install oprofile
To get a whole-system profile, where you can see what both userspace and the kernel are up to, you’ll need to install the kernel-debuginfo package, too. In Fedora, this is where the vmlinux file lives; this is the uncompressed kernel image that oprofile will read kernel symbols from. (Some Linux distributions don’t provide an equivalent of debuginfo packages, which is a shame; they’re enormously useful. That said, you can usually find vmlinux bundled in some package on your distro of choice.)
$ yum -y install kernel-debuginfo
After a few minutes of watching the package manager thrash about, you’ll have everything you need to get started. The oprofile package is bundled as a few commands. In most instances, you’ll only use two of the commands that oprofile provides: opcontrol and opreport. The opcontrol command configures oprofile and manages the kernel and userspace profiling components, while the opreport command generates reports. The first thing you’ll want to do is make sure that you can use oprofile at all. To do this, run the following command:
$ sudo opcontrol --init
Notice my use of the sudo command; you must be able to run opcontrol with root privileges. This is an important limitation of oprofile; you can only use it on systems where you’re a trusted user. Back from our digression. In typical Linux fashion, if everything goes well, opcontrol will print nothing at all. The next command to run starts the profiling daemon.
$ sudo opcontrol --start-daemon
No vmlinux file specified. You must specify the correct vmlinux file, e.g.
opcontrol --vmlinux=/path/to/vmlinux
Wait; what’s that about “no vmlinux file specified“? Only a minor annoyance, as it turns out. The Fedora team has taught several debugging and performance packages how to find the information they need in debuginfo RPMs, but not oprofile. Don’t worry, you don’t need to go looking for the vmlinux file that was installed with the kernel-debuginfo package; you can just ask the system where it is.
$ rpm -ql kernel-debuginfo | grep vmlinux
/usr/lib/debug/lib/modules/2.6.18-1.2849.fc6/vmlinux
Armed with this file name, you can now start the oprofile daemon.
$ opcontrol --start-daemon --vmlinux=/usr/lib/debug/lib/modules/2.6.18-1.2849.fc6/vmlinux
To keep things simple, I’m going to profile the the system running a popular memory performance microbenchmark called STREAM. You can download the necessary C source file from here. Compiling and running it is easy:
$ gcc -g -O3 stream.c -o stream
$ ./stream
This benchmark runs quickly, but prints a lot of output; here are the interesting parts. (I bumped up the amount of memory used by editing the value of N on line 57 of the stream.c source file.)
STREAM version $Revision: 5.6 $
Total memory required = 114.4 MB.
Function:    Rate (MB/s)
Copy:        2604
Scale:       2400
Add:         2691
Triad:       2678
We’ll see what the individual benchmarks are in a few minutes; for now, it’s enough to know that they perform streaming operations on big vectors. Given that a 2GHz Opteron (my benchmarking machine) should have upwards of 5GB/sec of memory bandwidth, these numbers aren’t terribly impressive. (In fact, the gcc compiler is known to do a poor job on this benchmark.) So I want to find out what’s going on while the benchmark is running. To do this, I tell the daemon to start profiling; then I run the benchmark; then I tell the daemon to dump out the profile data.
$ sudo opcontrol --start
$ ./stream
$ sudo opcontrol --dump
(Notice that I had to run opcontrol as root using the sudo command, but not the benchmark itself.) Next, I’d like to see a report of what was happening while I was profiling. To get this, I use the opreport command.
$ opreport
CPU: AMD64 processors, speed 2004.57 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state)
 with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|samples|%|
------------------
33051 79.7313 stream
5719 13.7963 vmlinux
883  2.1301 libc-2.5.so
The output from opreport is ferociously ugly, but we can see a few things of interest in the header alone.
  • The profiler chose a default type of event to count: CPU cycles during which the CPU wasn’t idle.
  • It sampled the state of the system every time that event counter ran up to 100,000.
  • It tells me what kind of system I’m working on, in case I didn’t know (not an unusual case if you’re working on a machine in a rack somewhere).
Following the header are columns of information.
  1. The number of samples during which the CPU was executing in this piece of code.
  2. What percentage of all samples this represents.
  3. The identity of the piece of code in question.
The opreport command helpfully sorts its output by decreasing frequency, so we can see that our benchmark spent most of its time executing the actual benchmark, a surprising amount (almost 14%!) in the kernel, and just over 5% in other bits of code (I’ve trimmed the remaining numbers). This isn’t a terribly helpful piece of information, really. It’s great that the system mostly ran the benchmark, but what was it doing inside the benchmark and the kernel? Let’s try adding an option to the opreport command line.
$ opreport --symbols
warning: /autofs4 could not be found.
warning: /ext3 could not be found.
warning: /ide_cd could not be found.
Hey! What are all those warnings? They look just like the names of kernel modules, and that’s what they are. Just as opcontrol didn’t know where to find vmlinux, opreport needs to know where the kernel modules live, so let’s give it a little hand-holding:
$ opreport --symbols --image-path /lib/modules/$(uname -r)/kernel
Notice that we don’t have to care what kernel we’re running; the embedded output from “uname -r” in the command line above tells opreport for us. Now that we have told oprofile where both the kernel and its modules live, opreport should be happier:
$ opreport --symbols  --image-path /lib/modules/$(uname -r)/kernel
samples   %       image name   app name   symbol name
132748    71.15   stream       stream     main
6950       3.73   vmlinux      vmlinux    copy_page
3383       1.81   stream       stream     checkSTREAMresults
That’s better! We now have some more information than the first time we ran opreport:
  1. The number of samples taken for a particular entry.
  2. The percentage of the total number of samples.
  3. The shared object file within which the samples were taken.
  4. The application that was executing the shared object.
  5. The name of the symbol (i.e. the function) that was executing.
Now I know that I’m spending most of my time executing a routine called main, inside the stream executable. But that’s still not terribly specific; it turns out that main is 150 lines long. We can burrow in further, though, and get an annotated listing of the source code to see exactly which lines of source are our hot spots.
$  opannotate --source stream | less
The opannotate command prints an annotated copy of the program source, and feeds it into my pager. I can then scroll around and look at the hot spots. The STREAM benchmark has four. The first hot spot is the copy benchmark, accounting for 18.85% of samples. In the output below, there are two columns of numbers, followed by the source code. The first number is the sample count; the second is the percentage of the total sample count.
               :#ifdef TUNED
               :        tuned_STREAM_Copy();
               :#else
               :#pragma omp parallel for
   135  0.0916 :        for (j=0; j<N; j++)
 27651 18.7570 :            c[j] = a[j];
               :#endif
Second is the scale benchmark, with 20.12%.
               :#ifdef TUNED
               :        tuned_STREAM_Scale(scalar);
               :#else
               :#pragma omp parallel for
 17669 11.9857 :        for (j=0; j
Next is add, with 26.54%.
               :#ifdef TUNED
               :        tuned_STREAM_Add();
               :#else
               :#pragma omp parallel for
 20638 13.9997 :        for (j=0; j<N; j++)
 18484 12.5386 :            c[j] = a[j]+b[j];
               :#endif
And finally, with 27.28%, is triad.
               :#ifdef TUNED
               :        tuned_STREAM_Triad(scalar);
               :#else
               :#pragma omp parallel for
  2178  1.4774 :        for (j=0; j
From looking the body of each loop, we can see how the complexity of the operations involved affects the performance of each microbenchmark. Thus ends today's lesson. For more examples of how to make use of oprofile, see the official examples page. I'll be returning to the subjects of both oprofile and the STREAM benchmark soon, so check back.
Posted in linux
0 comments on “Make Linux performance analysis easier with oprofile
1 Pings/Trackbacks for "Make Linux performance analysis easier with oprofile"
  1. […] In an earlier post, I briefly discussed the oprofile system profiler. I was going somewhere with that; here’s another step along the path. […]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>