Friday 6 July 2007

Valgrind it

A student was executing his tasks on the same machines as mine, and over the night his task allocated almost all the available memory causing incredible amount of swapping and stopping any progress of all the tasks (not to mention that he started the task by directly logging in to a computer rather than through a queue). Here is a very important advice, and if you follow it you can avoid upsetting many people in a shared computing resources environment: always check your program for memory management errors and memory leaks. It is pretty easy to do with a modern Linux system, just use a tool called valgrind.

I recommend to start in a standard way, and read valgrind manual page, just type
$ man valgrind
in your terminal.

Going beyond referring users to a manual page, here is a little example of how this tool can be used. Imagine starting with the following C program:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char** argv) {
char* string = (char*) malloc(sizeof(char)*strlen(argv[0]));
strcpy(string,argv[0]);
printf("My name is %s\n",string);
return 0;
}
Well done! You have won 5 points if you noticed all the three problems. Let's imagine you haven't.
First, you compile your program with debugging options:
$ gcc -ggdb ex1.c -o ex1
No compilation time errors, no warnings. If you run the program, the result usually depends on how lucky you are. I am lucky, and the result is exactly what I expected:
$ ./ex1
My name is ./ex1

But something can be wrong about this program. Let's use valgrind to check. Start with the default checking, and run:
$ valgrind ./ex1
The tool detects 2 errors and a memory leak of 3 bytes. The first error is:
==10523== Invalid write of size 1
==10523== at 0x4006A2C: strcpy (mc_replace_strmem.c:272)
==10523== by 0x8048406: main (ex1.c:7)
==10523== Address 0x401F02B is 0 bytes after a block of size 3 alloc'd
==10523== at 0x4005400: malloc (vg_replace_malloc.c:149)
==10523== by 0x80483EF: main (ex1.c:6)
So, at line 7 of my program I am writing 1 byte beyond the allocated memory block. Did you read Kernighan-Ritchie book? Every string has to be ended by a zero (\0) character. This character is not counted when computing the length of a string, but is copied when copying a string.
The second error is:
==10523== Invalid read of size 1
==10523== at 0x4006283: strlen (mc_replace_strmem.c:246)
==10523== by 0xB2A0C1: vfprintf (in /lib/libc-2.5.so)
==10523== by 0xB2F602: printf (in /lib/libc-2.5.so)
==10523== by 0x8048419: main (ex1.c:8)
==10523== Address 0x401F02B is 0 bytes after a block of size 3 alloc'd
==10523== at 0x4005400: malloc (vg_replace_malloc.c:149)
==10523== by 0x80483EF: main (ex1.c:6)
this is directly caused by the previous one. That 1 byte written outside of the memory block is now read when printing the string out. So, I should fix my program by extending the size of the allocated memory block. I create a new program called ex2.c:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char** argv) {
char* string = (char*) malloc(sizeof(char)*(strlen(argv[0])+1));
strcpy(string,argv[0]);
printf("My name is %s\n",string);
return 0;
}
See? I added 1 symbol to the end of my string. Compile and run the program in exactly the same way.
Running with valgrind produces the following output:
==10812== ERROR SUMMARY: 0 errors from 0 contexts
Congratulations, we fixed both of the errors. But there is a little more to this program:
==10812== LEAK SUMMARY:
==10812== definitely lost: 6 bytes in 1 blocks.
==10812== possibly lost: 0 bytes in 0 blocks.
==10812== still reachable: 0 bytes in 0 blocks.
==10812== suppressed: 0 bytes in 0 blocks.
==10812== Use --leak-check=full to see details of leaked memory.
Let's follow the advice, and run leak detection tool:
$ valgrind --leak-check=full ./ex2
It produces the following report:
==10932== 6 bytes in 1 blocks are definitely lost in loss record 1 of 1
==10932== at 0x4005400: malloc (vg_replace_malloc.c:149)
==10932== by 0x80483F2: main (ex2.c:6)
This means that 6 bytes allocated with our malloc in line 6 were never released. Adding
free(string);
to the end of the program (and naming it ex3.c) gives us an ideal result:
==11013== ERROR SUMMARY: 0 errors from 0 contexts
==11013== All heap blocks were freed -- no leaks are possible.
Now you can be more sure that your program is correct.

A couple of closing remarks.
  1. Remember, valgrind checks only that part of your program which was executed. If you have several branches (if-then-else) or subroutines, make sure that you have a decent set of test scenarios to cover it all. You might want to search Google for test coverage methodologies (just type man gcov if you do not know how to use Google)
  2. Leaving such dynamic memory management errors causes segmentation faults. The most nasty ones appear when such a problem is located inside a shared library called from Java native method. The java virtual machine will crash leaving you frustrated and incapable of catching an exception or something to recover your program on the fly.
  3. Just having a memory leak in a shared library will eventually waste all your virtual memory, and the system will need to be rebooted. Guess why Windows servers have to be rebooted every month or so?

2 comments:

Britto said...

Nice article

Sujeet Kausallya Gholap said...

It was very helpful. Thanks a lot.