Close

Work on PC, Memory leaks, headdesks, and Valgrind

A project log for Hardware-orientated mesh networks

Our attempts to create a mesh network, plug-'n'-play product.

connorwood71connorwood71 08/24/2014 at 20:470 Comments

Time for another long, rambly post about dhcpext! Yay! Since last time, the first thing I did was start PC: specifically, to start with, have RX offload all of its work to PC, over a stream created by CT, and sent to both. RX is now simply a network endpoint, to forward traffic on to PC (although initially, in the event I needed to roll back in a hurry, I left all the code in rx.c, commented out.) This has the unfortunate, though temporary, side effect that, until RP is written, PC will contain some network code.

This happened without incident, so I moved on to having TX transmit all of the heartbeats it sends over the network to PC as well. This exposed a nasty bug in the stream API: if messages are sent too quickly, they have a tendency to overwrite each other, resulting in a memory leak, and lost information. In this instance the impact was small (the heartbeat wouldn't be printed to stdout), however it could easily get more serious if left unchecked, so a rewrite of the stream.c internals ensued (although the API stayed largely the same), to use linked lists to allow arbitrary storage of messages.

These heartbeats are stored in a linked list, which will be used later on to correlate when heartbeats and replies are received, and track timeouts. This is important, and caused me a lot of pain, as I'll discuss in just a second.

The next thing I did was move from usleep() to gettimeofday() in TX (rather than sleeping, compare times to when the heartbeats were last sent). This would, so the theory went, be more accurate, however exposed a massive memory leak in my code somewhere, which caused dhcpext to lock up entirely, followed by the kernel OOM manager to kill off every process barring init, which caused much head scratching.

Needless to say, I quickly rolled back this change, however the problem persisted, which meant that during one of my tests, I neglected to scp the software to the router. D'oh.

With that done, I set about to find the memory leak. Initally, this consisted of my "standard" debugging technique of littering printfs around the place, looking into various important variables. After a few days of getting nowhere with this technique, it was clear I needed something better. After some research, I found mtrace, which reports various issues with malloc() and free(), among other things.

After getting a trace on dhcpext using mtrace (it had to be compiled for my desktop for this to work), I had some idea of where the errors were occurring (although it wasn't entirely accurate, due to its limited reporting). After looking at the places it pointed me to, and its callers and callees, some errors were fixed, but the majority remained. I needed more information still.

After more digging, and much time spent on Stack Overflow reading up on people with similar issues, I found Valgrind. This is, to give the understatement of the year, an absolute godsend. There were several small-fry issues with the source that Valgrind pointed out to me, most of which added up to another major rewrite of stream - this time, with a new API to follow. This time, stream_rcv(_nblock) handles the allocation of memory, rather than the caller through stream_wait_full or stream_size. The caller still frees the data, however, as it could contain global pointers if a data structure is sent, that stream isn't aware of.

This fixed most issues, but a few still remained. Firstly, I stopped sending void**s, and started sending void*s and freeing. This fixed a bunch more errors, leaving me with about 8 at a time. 3 actual errors produced all of these: firstly, the linked list of sent heartbeats, in PC (it has a companion linked list for received heartbeats, which I'm not sure why I included and needs deleting), was being re-initialized every itteration of PCs main loop. This meant that memory usage grew rapidly, and every time the pointer to the last one was lost.

Secondly, at no point was the linked list contents actually freed. This should have been done when the thread was shut down. Later on, the linked list will be emptied real time, as replies are received, or the entries have existed for >400ms. But, not yet.

Thirdly, a race condition. If PC shut down just before TX (async, remember), data could be sent through the stream between the two, and never actually received and freed. This was a reordering of instructions in main.c.

So, there you have it. I've learned a lot about debugging tools, and have a newfound respect for them, and the code was cleaned up dramatically. Next, I want to finish off PC as best I can without UCI integration, merge into the master branch (which is happening regularly right now, as I clean up the code), try again with gettimeofday, and clean up the code some more, with other tools like Coverity, and enabling compiler warnings (and no, not just -Wall). I think we could also do with a bug tracker at this point.

Discussions