Tracking Down an Annoying Bug

When I built my first breadboard prototype for my university class, I had a problem where only the first eight blocks would be recognised. When I checked the voltage on each module, I realised that after about eight blocks, the voltage had dropped too low for the Arduino Pro Mini clones to effectively operate. It was actually at around the 11th or 12th module that the Pro Minis would not boot, but I figured that the 9th module didn't have enough power to send or receive serial communications. Luckily my showcase program (the Fibonacci Sequence) only needed eight modules, so I waved the issue off as a power problem and stuck to programs of eight steps or fewer.

With the move to ATTiny841-based blocks I was able to reduce the power requirements considerably, so I ought to be able to have more than eight blocks in a program. Unfortunately my testing showed otherwise. The power was getting through okay, so something else must be at fault here.

I did some testing by tweaking the code. Instead of sending a broadcast message to all modules, I sent an "identify yourself" query to a small handful of select blocks. That showed that I could communicate with the 10th, 11th, 12th blocks with no problems. But if I sent a broadcast query, only the first eight blocks would respond.

Maybe forwarding the broadcast message was failing after eight hops? That didn't seem right, but I through I would try to rule that out. My next test was to send queries to each block individually - i.e. not using a broadcast message. Here is where it started to get weird: Sending 13 ID queries in order (from 0 through 12) still only returned eight results, but the fourth block's response didn't appear. This seemed to indicate to me that timing was an issue, and the timing issue was most likely in the master control unit (MCU), rather than the nodes.

I decided to manually trace the code in the MCU and noticed that I was sending the messages and then clearing the screen in the setup() function and the loop() had no particularly intricate operations during the receipt of the messages. So I moved the code to clear the screen to immediately before the message(s) were sent and bingo! All 13 connected blocks responded and were displayed on screen.

So I can now create programs up to 13 blocks long - any more and I won't be able to display the full program on the tiny screen I am using! As an added bonus, switching the the '841 has reduced the power consumption such that I can power the whole system off a USB power bank. (The Pro Mini modules were too hungry for battery operation and I needed a wall wart to power even four or five modules, let alone eight or more.)

Now that I know what was causing the issue I will revisit the code when I have some spare time and try to make it even faster. I will also look into how I can display longer programs. I am quite relieved that I found this problem and that it was such a simple fix. My biggest fear was that there was a problem with the circuit board, or that there was a more fundamental error in my design.

8^)

Update (16 April 2019)

Today I was showing this project to my Honours supervisor while we discussed my Honours project for next semester. I want to do an Honours project based on the communication protocol I am using - i.e. are there any existing protocols that might be better, or am I on the right track with this protocol? I explained to him this issue and how I fixed it, but I still wasn't 100% certain why it was happening - just that it was a timing issue of some sort. As I said this to him it struck me what the underlying issue was...

When the master control unit (MCU) sends out the broadcast message and then starts the screen clear, the blocks receive the message and start replying immediately. The MCU's serial input buffer starts filling up with the incoming replies while the screen is being cleared. By the time the screen clear has completed and the MCU gets into the main loop and starts checking for incoming data, the serial buffer has filled up and and further data is being dropped. As the MCU reads and processes messages, the buffer gets emptied, but all the blocks have already replied, so no more data will be incoming. When the MCU reaches the end of the buffer, it will be part-way through the last message in the buffer and will be waiting for the rest of the message, which will never come. Also, all the nodes had already sent their responses while the screen was still being cleared, so once the buffer started to be read there would not be any more data incoming to re-fill the buffer - it was already too late.

Moving the screen clear code avoids this problem (for now) as the MCU processes incoming messages fast enough that the serial buffer never fills up.

Several things need to be fixed:

Even though moving the screen clear code has alleviated this problem, I should still address it, just in case other long-running operations need to happen in the future.
I am reluctant to add too much overhead to the communications protocol, but maybe some form of ACK/NAK should be implemented? If a node send a message but doesn't get an acknowledgement it might re-send?
I should definitely include some timeout code while waiting for incoming data - if I receive part of a message but never receive the end, I should send a "retransmit" message to the sending node.
Maybe I need to look at using interrupts?

It is good to understand what was causing the issue - and hey, this might be good for my Honours project and give me something to do for that. 8^)

Update to the update (a few minutes later...)

I forgot to add:

This also explains why I had "random" blocks' messages being dropped when I sent individual queries to all the nodes. I didn't mention this in my initial log, but while hunting for this bug I replaced the single broadcast message with n queries being sent to individual blocks. I knew I was plugging in 13 blocks, so I wrote 13 queries and was surprised to find that I would receive responses from nodes 0 through 3, skipped 4, then 5 through 8, skip 9, etc. I couldn't understand why some blocks were being skipped...

Now I understand what was happening there. While sending the individual messages, blocks would start responding immediately and the buffer filled up. When I had sent all the messages and started the main loop and checking for data, the buffer would be drained, I would reach the end of the buffer which most likely contained half a message. As the buffer was being emptied, more data would be able to be received, but the processing speed given a full buffer must not be quite fast enough to keep the buffer from filling up.

There are some interesting implications to the partial message at the end of the buffer - when the buffer starts being emptied allowing more data to come in, the partial message will eventually be completed by the new data, causing a corrupted message. I definitely need error checking (CRCs?) to ensure that this does not happen.

Discussions

davedarko wrote 04/16/2019 at 14:49

yaaay, cool to hear you've fixed that communications issue!

Are you sure? yes | no

Amos wrote 04/16/2019 at 14:53

Thanks Dave - it was an annoying bug, but I knew it had to be something in the software. Now that I fully understand what was happening I have some good material for next semester's Honours project at university. 8^)

Are you sure? yes | no

Building Some New Colourful Blocks!

A brief tour of the latest PCB (plus some discussion of issues)

Discussions

Become a Hackaday.io Member