USB-CDC Serial port for PIC18F, in under 1KiB. Refactored down from a USB-DFU bootloader, hand written in assembler to be light and fast

Similar projects worth following
Based on a separate project to build an extensible USB bootloader for the PIC18F microcontroller series. This firmware is the result of combining that bootloader with a USB serial program (or sketch for the Arduino folks) and pruning/refactoring to make it fit in a kilobyte.

Supporting virtually any baudrate from 183 up to a theoretical maximum of 12 megabaud. However, at that rate, USB becomes the bottleneck having only the bandwidth for average 5/8ths duplex, not factoring in USB overhead. In tests it functions reliably at 2 megabaud, and produces any baud rate between 183 and 200,000 with less than 1% error. Above this range, any baud rate produced by dividing 12,000,000 by an integer is possible as long as it is below the tested 2 meg.

USB is a surprisingly involved protocol, so the DFU bootloader alone was 2KB. The CDC support was nearly an additional full KB. Making it fit into 1K was a fun exercise.

Alright, so what does this thing actually do?

It is a USB-CDC compliant serial port. That is, you plug it into USB, and you get old-school serial communications out the other side at 5V for 1, and 0V for 0. It also includes support for RTS/CTS & DSR/DTR signalling. Perhaps the best analogy, is this program does the same job as the widely used FTDI USB/Serial chips.

Perhaps the biggest difference between this chip and FTDI's chips however, is FTDI uses proprietary drivers while this project supports the USB standard CDC device. This allows it to work with the drivers available for Windows/Linux and Macintosh. In my testing, no "inf" files or drivers need to be manually installed. (Though, I've read that this may be different on systems running Windows 8 or newer)

So, is this program 1KB? It's slightly smaller in fact!

Reviewing the end of 1kcdc.lst we see the following:

"Program Memory Words Used: 509"

This means 509 16-bit words are used for the entire ROM image. Weighing in at a total 1018 bytes. Just 6 bytes to spare if we include the config registers at address 0x300000.

I guess the final question question remains. What is this project missing?

Currently it does not support the following:

  • BREAK signal detection/generation
  • Parity modes/parity failure detection
  • More than 1 stop bit
  • 5,6,7, or 16 bits of data
  • Ring signal
  • Framing error detection
  • Buffer overrun detection

This means, aside from BREAK support, that this module only supports serial format 8N1, at baud rates from 183 through 2,000,000.

The PIC serial interface only supports 8 or 9 bit serial. Supporting anything else would require serial transmit/receive in firmware. This would have serious performance implications and I'm not really interested in doing the work to support it.

However, with the 8 to 9 bit serial support, either parity or extra stop bits could be supported. The total number of bits including parity and stop bits would have to equal 8 or 9, but supporting parity is something I've explored as I've seen it used in the field.

And a final shortcoming of the setup, is the way I hacked the baud rate generator. It produces standard baud rates just fine, however some baud rates will not result in the optimal divider. Error in these cases will be quite low, in less than a 10th of a bit per byte sent, but it's a kludge I'd leave out if not for the space constraints.


Assembler listing file of the compiled output. This is the file used to calculate the total size of the assembled output.

lst - 119.88 kB - 01/03/2017 at 03:00



Compiled output, specific to PIC18F4450, with a 24MHz external oscillator.

x-hex - 2.78 kB - 01/03/2017 at 03:00



Tar-gzip of the source & makefile

x-compressed-tar - 9.39 kB - 01/03/2017 at 02:58



The 1KB USB CDC serial port!

asm - 27.13 kB - 01/03/2017 at 02:58



Makefile for building on *nix with gputils to compile, and pk2cmd for programming (on PICKIT2)

makefile - 891.00 bytes - 01/03/2017 at 02:58


View all 7 files

  • 1 × PIC18f4450 Any 18F series with USB should work with some header changes.
  • 2 × 220nF Capacitor One for on-chip USB voltage regulator, one for VCC decoupling
  • 1 × 24 MHz Crystal Oscillator Any value from 4Mhz to 24Mhz will work with header changes
  • 1 × USB type-B connector In my case, I used a usb cable cannibalized from an old keyboard. The type-b connector itself does not breadboard well without a breakout board.

  • Postmortem

    Jesse02/09/2017 at 22:58 0 comments

    I've been meaning to write an update here since the contest winners were announced.

    Before I do, congratulations to the winners! Some truly great entries were submitted, and from my earlier peeks at what my competitors were working on, I have to say that I feel the winners are representative of the projects I thought were the most worthy of taking the prize.

    In retrospect, I saw this challenge as an opportunity to take code I was already working with, and make something simplified with general utility for others. In this spirit, general utility does not imply a 1KB limit on the firmware size, and I'll be returning some of the functionality to the state it was in before trying to reduce the code to qualify for the challenge.

    While there's still work to do in achieving general utility, these are the features that can be expected in these future releases:

    1. Baud rate support down to 78 baud
    2. Baud rate dividers which produce bitrates as close as possible to the requested baud rate
    3. Serial transmit that is non-blocking (one of the compromises to favor code size was to send bytes inside the USB transfer interrupt handler, and this byte transmission would block the handler until all bytes were sent)
    4. Parity support for transmit and receive
    5. Buffer overrun detection and reporting
    6. BREAK generation/detection and reporting

    Admittedly, as this work does not have a hard deadline, it will be done in my leisure time where not spent on other projects. I encourage anyone interested in these features to post a comment as motivation, but bear in mind that this is a hobby project and I will not be able to commit to deadlines.

    Thanks for reading!

  • Honorable Mention

    Jesse01/05/2017 at 03:06 0 comments

    I'd like to talk a little about some of the changes that cut this project down to almost 1/3rd of it's original size. In earlier logs I already mentioned the USB Device Firmware Upgrade (DFU) support, extra USB descriptors, accurate baud-rate calculation, and bootloader/application separation as features that got the axe.

    Here follows the remaining bits on the editing room floor.

    As the bootloader was meant to handle USB core functionality while not interfering with user firmware, any registers modified by the bootloader had to be backed up prior and restored after use. This included WREG, BSR, STATUS, FSR1, FSR2, and TBLPTR. Some of which are two or even three bytes, and TBLPTR also needed to have bootloader specific values cached in ram for descriptors larger than 1 packet.

    Every backup and restore operation required 4 bytes, and removing this constraint reduced the flash size by 104 bytes of code.

    Originally, USB descriptors were referenced by a series of tables to simplify the application code required in user application firmware. This table setup supported multiple configurations, strings and any other type of descriptor supported by the USB specification. In the end, this table was replaced with code to specify the descriptor length, and address in flash of the remaining device and configuration descriptors. The tables themselves would have occupied at least 28 bytes of flash, and an additional 84 bytes for the code to scan the tables when a descriptor is requested.

    Instead, a SWITCH/CASE block reduced the total footprint of descriptor lookups by 86 bytes.

    Initializing the PIC for USB and even general operation typically involves loading default values into a number of registers that control operation of the device. After reviewing the datasheet to determine the reset and power on states of those registers, I found several either didn't need to be specified in code, or could be initialized by a single bit operation. In 16-bit PICs, loading an immediate value to a register is a 2 instruction operation, while single bit flips can be done in a single instruction. By re-arranging code to take advantage of common immediate values, changing immediate constants to single bit-flips, and removing initialization of registers that are already in the correct state at power on/reset I managed to save an additional 50 bytes.

    USB Stall and USB Error interrupts had empty handlers assigned that did nothing more than clear the interrupt flag and exit the interrupt handler. During development of the bootloader, they also served to send a notification on the serial port which I'll go into a little more detail shortly. However, in this project they served no purpose, and both these interrupts could be disabled.

    Removing the code for these interrupts freed up an additional 16 bytes.

    Handing over buffers of filled data to the USB interface engine on the PIC is something done repeatedly by this firmware. This was a very modest saving, but in each case 3 instructions could be replaced with 1 call instruction, and two separate functions were added to handle these cases. Two were required, as these instructions are responsible for setting the number of bytes to be transmitted and setting the USB data toggle bit while handing over ownership of the buffer. The USB data toggle bit tracks if an even or odd packet is being sent for a specific request. This is independent somewhat to the ping/pong even or odd buffering which also has to be managed.

    While scheming for ways to reduce the code size I had the idea, which I never made good on, to analyze the binary looking for common sequences of words in the compiled flash. The idea being, that if enough common sequences were found that could be condensed into function calls, then identifying many of these cases could be automated. While I never did get around to doing this, this specific change is exactly the kind of case I expect this analysis to identify. It would likely obfuscate the source code, but this kind of...

    Read more »

  • Submission

    Jesse01/03/2017 at 03:15 0 comments

    After spending a few hours reviewing the source code and commenting it so that it will hopefully make some sense, I'm happy to submit the finished source code.

    Included in the submitted tgz is a Makefile, and all the source code for assembly. The assembler used is gpasm, which is available in GPUTILS. It also has a line for programming my MCU via the PICKIT2 programmer, using the pk2cmd tool.

    To build, the ubiquitous "make all" should work as long as GPUTILS are installed.

    Making this code work on any USB capable PIC18F chip should be a simple matter of modifying the Makefile, and adding the necessary block to If you need any help with this, or if you've added a chip and would like to share your changes, please post your comments here on

    Stay tuned, I'll be sharing some of the work involved in getting to this point in the days ahead.

  • All the Stops

    Jesse01/01/2017 at 01:21 0 comments

    While reviewing the code changes to write about all the work required for this project, I found myself fighting to include RTS/CTS support in the final submission. It's been challenging, but as of about five minutes ago I've succeeded in including this feature into the final 1KB hex file.

    Accomplishing this took just about every trick I had up my sleeve, but it's done. The only features I would say are missing at this point are serial break generation/detection, parity bit support and framing error detection. While all are "nice to have" features, at this point I'm confident there isn't the remaining space available.

    The current code is 7 instructions short of 1KB. Added to this are the configuration register values which take up 14 bytes. It's not clear to me presently if these 14 bytes will be counted in the judges' tally, though I suppose it's better safe than sorry.

    If I do find out that the config registers are not counted by the judges, that opens up 7 instructions worth of available space. Probably not enough to add the remaining features, but maybe worth exploring should that situation arise.

    With that, I'm off to have a celebratory beer and get ready to ring in the new year. Cheers all!

  • Case Optimization

    Jesse12/31/2016 at 20:28 0 comments

    While reviewing the new code for identifying which descriptor is being requested by the USB host, I noticed that instead of using the Switch/Case macros I had instead manually implemented the same mechanism of using XOR and conditional branches. "Why not be consistent?" I thought to myself, and set about converting it into the Switch/Case pattern.

    While using the macros cleans up the code appearance, and gets rid of some labels... it also increases the code size by a few bytes. Reviewing the reason why, I found that the 'default' case at the end of the block imposes 2 extra branch instructions. And in this specific scenario, the code inside the default block in this example is a branch to the setup_error code. This means the default block is literally 3 branches, when ideally it could be 1.

    I looked at the other switch blocks in the code, and found 2 other cases of a default block that contain a single branch. Crafting a specialized case macro that includes this branch, instead of requiring a separate default block, will shave 4 bytes off of each case. It's a little ham-fisted for general use, but for the purposes of the 1KB challenge, seems like a win.

    So, instead of this:

        bra somewhere_else

    I'll wind up with this:

    CASEELSE    SOME_VALUE, somewhere_else

    And in the process, clean up the descriptor lookup code as well as save 4 bytes per altered statement.

  • Switch, Case and Hardware Call Stacks

    Jesse12/31/2016 at 03:13 0 comments

    So... a lot of code is going to be coming up in the posts ahead and it'll probably easier to digest if I explain a couple facets of what the heck it is I'm doing.

    First off, I'd like to dive into the DFU bootloader which has given rise to this project and hopefully many other projects to come. The basic idea, is to take what makes Arduino so great and do it better. That is, take a general purpose microcontroller and make it easy to prototype and deploy. While I started with Microchip's PIC before I'd ever heard the word Arduino, I'm among be the first to complain about their two tiered compiler offering. I chose to write my code in assembler, because it was much easier to write code that would work with both Microchip's and open source assemblers. I also had a fascination with USB, which offers to make devices that can really integrate with our existing computer systems, rather than COM ports requiring additional software.

    Ultimately, the goal was a firmware platform that could handle virtually any USB standard, support open specifications for firmware updates, and stay the hell out of the programmer's way.

    One of the quirks of PIC's 8-bit architecture is a hardware call stack. This means the devices can handle a depth of N function calls (in this particular case, 32) before the stack overruns and execution effectively halts. Earlier prototypes for the USB logic in the bootloader involved a few nested levels of function calls. This eats into the 32 call budget, and for the bootloader to stay out of the way a better means of segregating the logic was desirable.

    The cleanest way to structure the code was to replicate the switch/case model often used in higher languages, and replicating it in assembler wasn't too much trouble. Some creative liberties are taken (break statements are implied) and implementation is tuned to the architecture.

    The value being tested if first loaded into the working register, or WREG as it's called on PICS. The case statements flip bits in the working register using a bit-wise exclusive-or (XOR), testing for a zero. A conditional branch skips to the next case statement if the result doesn't match the current case statement, and the process repeats. However, using XOR makes it very intuitive. In a naive implementation, any changes made to WREG need to be undone before checking the next case statement. This would mean XORing the value we just tried to check, and then XORing the value we want to check for a match to. However, these two steps can be combined and these macros simplify the production of this mask by XORing two consecutive case values together. This produces the exact bit flips required to convert the previous case into the current case.

    Little moment to blather here... I love XOR. I think mathematically it is one of the neatest transformations around. AND and OR are both very useful and have specific jobs to do, but XOR is very versatile. Taking a value and XORing it with a max value, will perform the bit-wise NOT. Any XOR operation is reversible, by repeating the same operation. And, as a consequence of this last fact, XORing doesn't destroy any data. Let me share a snippet of PIC assembly that sums up the utility of this operation. Lets say, we have a value in the working register, and we want to swap this value with a value in some memory location called regX.

    xorwf	regX, F    ; WREG XOR regX => regX
    xorwf	regX, W    ; WREG XOR regX => WREG
    xorwf	regX, F    ; WREG XOR regX => regX

    Maybe I'm going a little overboard, but this is poetry. Like moving two decks of cards through each other, having them come apart in the same order they went in. To describe the code above step by step, the first instruction takes regX and turns it into a map of the differences between regX and the working register. It then uses this map to transform the value in the working register into the value that was in regX. It then uses this copy of what was in regX (now in the working register, stay with me) and uses this value to transform...

    Read more »

  • Stress Testing

    Jesse12/30/2016 at 07:58 0 comments

    Took the time today to hook up two chips running the firmware and do some bandwidth testing. I have to say, I'm pretty impressed.

    Testing consisted of sending a 3MB file of random data through 2 chips, and using WinMerge to compare the files. I've been testing on my machine using RealTerm to send and capture the data, but noticed an issue where even at higher bitrates the throughput wasn't increasing. Google to the rescue.

    Turns out, the version of RealTerm I was using had a known issue where transfers would max out near 15KB/s, and all I had to do was download an updated version.

    Transfers are very reliable up to 2 megabits per second with both chips on different USB endpoints. At 3 megabits however, I start to see what appears to be byte mangling. As the errors often appear to be bit errors localized to a few single bytes, this seems to indicate that the loss either due to noise on the serial line (about 12cm of breadboarding wire) or else a manifestation of the clocks of the two PICs being slightly out. I might be able to run both off of a single clock to test that theory, as well as pairing the serial connection with a twisted ground to try cut out noise.

    The clocks are suspect, as both chips are using PLLs to kick up the frequency to what's required for USB full speed operation. One chip has a 24MHz oscillator, and the other has a 20MHz crystal resonator. The PLL in particular might be making the chips more sensitive to clock problems, but running both chips of the single 24MHz oscillator should be simple enough as long as the wires aren't too long. Call me lazy, but I don't think I'll relocate one of the PICs to shorten the clock traces to test a theory.

    These speed tests have only been half duplex as well. At higher speeds, RealTerm seems to jam up on one side. That is, it seems like it's RealTerm jamming up based on the two instances I run becoming non-responsive after initiating the test. (It's not my code! I swear!) I might have to try sending between two PCs, as it seems to be related to running two instances of the program on the same PC.

    Also: while it puts the firmware over 1KB, I'm actually testing with RTS/CTS currently. With the deadline looming, I've yet to decide if I'll be dropping that feature or not. Maybe I can squeeze another 50 bytes out. I think before I open that can of worms, I'll write a couple more logs going over what I've removed to make the firmware fit. The source is littered with code commented out, and once I pay tribute to the fallen opcodes, I can proceed to remove them and publish the code.

    So, yeah! 2 megabits per second sustained transfer speed! not too shabby!

  • Flow Control

    Jesse12/29/2016 at 01:25 0 comments

    Next is exploring the possibility of adding RTS/CTS signalling to the demo.

    Doing the outbound RTS/DTR signals were pretty easy, and only add an additional 22 bytes to the code. (22 bytes, I just happened to have made available) Just set up an IO port for the signals (3 instructions, 6 bytes) and add an additional case to the class request handler (8 instructions, 16 bytes). However, putting out these signals is only somewhat useful if we can't detect these signals from a remote device using the CTS&DSR inputs.

    	read2w	wValueL
    	xorwf	LATB, W, ACCESS
    	andlw	0x03
    	xorwf	LATB, F, ACCESS     ;bit 0=DTR, 1=RTS
    	bra	setdone_xmitzero

    Adding support for CTS/DSR is where it gets ugly. I've considered using interrupts for this, however, noisy inputs could easily generate unwanted state updates to the host. The next best solution is to check if the input pins for CTS/DSR change when handling the USB Start Of Frame (SOF) interrupt. This works, and guarantees we see at most 1 update per millisecond, however sending this SERIAL_STATUS message to the USB host requires not only this change detection, but also sending a 10-byte message to the host.

    	lfsr	FSR1, 0x418
    	rcall	fsr2_to_fsr1
    	movlw	0xA1
    	movwf	POSTINC2, ACCESS	;bmRequestType
    	movlw	0x20
    	movwf	POSTINC2, ACCESS	;bRequest
    	clrf	POSTINC2, ACCESS	;wValue
    	clrf	POSTINC2, ACCESS
    	movlw	0x01
    	movwf	POSTINC2, ACCESS	;wIndex
    	clrf	POSTINC2, ACCESS
    	movlw	0x02
    	movwf	POSTINC2, ACCESS	;wLength
    	clrf	POSTINC2, ACCESS
    	rrncf	ctsdtr, W, BANKED
    	andlw	0x03
    	movwf	POSTINC2, ACCESS	;UART state
    	clrf	POSTINC2, ACCESS
    	movlw	0x0A
    	movwf	PREINC1, ACCESS	; respond with 10 bytes
    	decf	FSR1L, F, ACCESS
    	movlw	0x80 | USBPARITY
    	movwf	INDF1, ACCESS

    My naive attempts to do this add at most 37 instructions, totaling 74 bytes! Even under ideal circumstances, like more USB ram for data buffering will still require at least 62 additional bytes of code.

    Lets consider for a moment, instead of sending the SERIAL_STATE header inline as in the snippet above, what if I load the 8 byte-header into flash and send it from there? The 8 byte header is currently 12 instructions, that take 24 bytes of code space. Instead, it could be 8 bytes for the header itself, and the existing send_descriptor function used to send at a cost of 4 instructions for 8 bytes to setup and make the function call. That could reduce the size to a possible 54 additional bytes.

    At this point, I doubt that kind of memory is hiding in the code base to be scavenged. Though, perhaps it would be possible if I scrap the baud-rate divider, and hard-code support for 2 or 3 baudrates.

    Sorry flow control, looks like you won't make the cut.

  • Innocence lost

    Jesse12/26/2016 at 08:01 0 comments

    This is the story of how I hacked my beautiful baud rate divider down to a measly 74 bytes. It wasn't without compromise, but as with any good task of engineering, it's all about trade-offs. So lets begin.

    Start: 144 bytes

    To start off, the PIC18F USART supports different divider ranges selected using the BRGH bit in the TXSTA register. Having BRGH clear has almost no redeeming value, except for supporting baud rates below 183 when running at 48MHz. Setting BRGH however, increases the resolution of the divider by exactly 4. What this implies, is that any baud rate producible with BRGH clear, can also be done with BRGH set with the exception of rates below 183. So, removing support for low baudrates has no other drawback. This shaved off 42 bytes.

    Remaining: 102 bytes

    Next up, after the division is completed, the PIC specifies the divider to be one less than the result of the division. But, of course, the division doesn't always result in a nice round number, so the code I had to refactor down would use the remainder to determine whether the answer would round up or down, and include the implied subtraction in that rounding operation. It was tight. It felt like an elegant solution. Alas, fudging the numerator, and doing a subtraction still results in the best divider for all of the standard rates above 300 baud which I've checked. I bumped up the numerator from 12,000,000 to 12,001,024, and subtract 1 from the result. Net savings for this compromise came in at another 28 bytes.

    Final size: 74 bytes

    On to the next one...

  • Baud rates & Ping-pong

    Jesse12/26/2016 at 06:08 0 comments

    After much pruning, it looks like the baud rate division calculation can be saved. I did remove support for very low baud rates (under 180 baud) which shaved off a few instructions. But I'll admit I'm rather proud of that feature, and am eager to include it in the final submission.

    I also looked into removing ping-pong buffering which will reduce the code overhead slightly, however, supporting data from Serial to USB Host will require more code without ping-pong unless I allow for some occasional byte loss when the USB buffer is busy sending to the host. However, the MCU has a mode where all endpoints have ping-pong buffering enabled EXCEPT endpoint 0. This should reduce the code overhead for the more common control-transfer case, but still streamline serial data transfer. I'll explore this in the days to come, a quick look suggests that ping pong on EP0 is responsible for 38 bytes.

    Also, the division function for baud rates is not original art. It started from a copy of a 24-bit division routine found online by Tony Nixon. In a later post, I'll also document the process I took in refactoring it.

View all 11 project logs

  • 1
    Step 1

    This will get fleshed out more shortly.

    In the meantime, the setup consists of just a PIC18F4550, a crystal resonator, usb cable and a handful of capacitors on a breadboard.

    Initial firmware loading will require a PIC programmer. I have a PICKIT2 clone, but apparently Arduino and RaspberryPI can be used in place of one.

View all instructions

Enjoy this project?



jaromir.sukuba wrote 01/05/2017 at 07:08 point

Big thumbs up for this project.

Did you check 18F26J50 family, or 18F45K50 family MCUs? Is the firmware portable to those? Not sure if there are any substantial differences in USB engines.

  Are you sure? yes | no

Jesse wrote 01/05/2017 at 07:59 point

I haven't, but it wouldn't take much to check.  I did a little recon on the cheaper 18F chips and was dismayed to see the USB RAM page is different.  There's already some support for that, but the serial buffering code will need a little attention as page 4 is currently hard coded there.

I'll look into the two chips you've mentioned and get back to you.
Many thanks for your support 😀

  Are you sure? yes | no

Jesse wrote 01/05/2017 at 17:18 point

So I did a little digging on those chips, and they both should work without code changes.  Relevant entries will need to be added to for the config registers, but the USB buffer descriptors are in the same place in RAM as the chips I've tested with.  I also specify what chip is being built for in the Makefile, however if you're not using make then that can be skipped.

I'm actually surprised by the specs on the chips you linked.  The 4550 I'm using only has 256 bytes of RAM that's shared with the USB SIE, and 48 bytes of that is taken up by the buffer descriptors.  The 26J50 series actually has over 3KB of shared RAM which would would certainly be useful.  The CDC data buffer size is currently set to 32 bytes for each ping pong buffer for a total of 64 bytes of space.  For high baudrates, increasing this to 64 bytes each might improve throughput, but will require a little more work.

Let me know if you'd like any assistance with getting this running on those chips.


  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates