Close

improved encoder reading

A project log for Whiteboard Drawbot

Putting together a drawbot for my whiteboard.

matthew-reevesMatthew Reeves 11/12/2018 at 18:430 Comments

The software I'm modifying does an encoder read as follows:

  int rot = ((digitalRead(BTN_EN1) == LOW) << BLEN_A)
          | ((digitalRead(BTN_EN2) == LOW) << BLEN_B);
  // potentiometer uses grey code.  Pattern is 0 3 
  switch (rot) {
    /** logic to interpret the value goes here ** /
  }

This routine is ok, but it could be made a little better.

First I cut and paste the routine in a separate test project so that I can gather timings.

To collect timing I'm collecting 1000 iterations of the encoder reading.

void consume_input() {
    // grab the time prior to mucking with serial output
    unsigned long now = micros();
    unsigned long delta_ms = now - delta_time_start;

    Serial.print("Turn: ");
    Serial.print(lcd_turn);
    Serial.print(" uS: ");
    Serial.println(delta_ms);
    lcd_turn = 0;
    delta_time_start = micros();
}

static int g_loopctr = 0;
void loop() {
  read_input_pot();
  ++g_loopctr;
  if(g_loopctr>1000)
  {
      g_loopctr = 0;
      consume_input();
  }
}

I'm running this code on an Arduino Mega 2650 and using Arduino 1.8.5 (which is using avr-gcc/5.4.0-atmel3.6.1)

Starting with the existing implementation:

Turn: 0 uS: 12028
Turn: 0 uS: 12028
Turn: 1 uS: 12108
Turn: 2 uS: 12232
Turn: 1 uS: 12040
Turn: 1 uS: 12148
Turn: 3 uS: 12224

There's a fairly consistent time of 12028 uS per 1000 iteration and an additional cost of 50 - 100uS when the pin detection code runs

Measuring 1000 loop iterations with the pin reading code disabled
 microseconds (uS): 3080
 
 12028 - 3080 = 8948
 

Measuring 1000 loop iterations and calling read_input_pot() twice:

Turn: 0 uS: 23108
Turn: 0 uS: 23108
Turn: 0 uS: 23108
Turn: 0 uS: 23108

23108 - 12028 = 11080

That's a bit higher than I expected, but within reason. It seems like the timing numbers being collected have some validity.

So, first easy win is to switch some variables over to 8 bit integers. There's basically 2 pins that are being used to read values, so there's no reason to use anything larger than 8 bits.

Turn: 0 uS: 11648
Turn: 1 uS: 11740
Turn: 2 uS: 11876
Turn: 1 uS: 11664

 That's a small but measurable improvement.

Looking at the assembly code it's easy to see why this was a win

 //rot is int
  switch (rot) {
 4ba: 21 30        cpi r18, 0x01 ; 1
 4bc: 31 05        cpc r19, r1
 4be: 61 f1        breq .+88      ; 0x518 <__LOCK_REGION_LENGTH__+0x118>
 4c0: 24 f4        brge .+8      ; 0x4ca <__LOCK_REGION_LENGTH__+0xca>
 4c2: 21 15        cp r18, r1
 4c4: 31 05        cpc r19, r1
 4c6: 41 f0        breq .+16      ; 0x4d8 <__LOCK_REGION_LENGTH__+0xd8>
 4c8: 2c c0        rjmp .+88      ; 0x522 <__LOCK_REGION_LENGTH__+0x122>
 4ca: 22 30        cpi r18, 0x02 ; 2
 4cc: 31 05        cpc r19, r1
 4ce: c9 f0        breq .+50      ; 0x502 <__LOCK_REGION_LENGTH__+0x102>
 4d0: 23 30        cpi r18, 0x03 ; 3
 4d2: 31 05        cpc r19, r1
 4d4: d9 f0        breq .+54      ; 0x50c <__LOCK_REGION_LENGTH__+0x10c>
 4d6: 25 c0        rjmp .+74      ; 0x522 <__LOCK_REGION_LENGTH__+0x122>
// rot is uint8_t
{
switch (rot) {
 4ae: 91 30        cpi r25, 0x01 ; 1
 4b0: 31 f1        breq .+76      ; 0x4fe <__LOCK_REGION_LENGTH__+0xfe>
 4b2: 28 f0        brcs .+10      ; 0x4be <__LOCK_REGION_LENGTH__+0xbe>
 4b4: 92 30        cpi r25, 0x02 ; 2
 4b6: c9 f0        breq .+50      ; 0x4ea <__LOCK_REGION_LENGTH__+0xea>
 4b8: 93 30        cpi r25, 0x03 ; 3
 4ba: e1 f0        breq .+56      ; 0x4f4 <__LOCK_REGION_LENGTH__+0xf4>
 4bc: 24 c0        rjmp .+72      ; 0x506 <__LOCK_REGION_LENGTH__+0x106>

Changing to uint8_t made a significant improvement in the generated instructions, but only resulted in a small improvement to overall performance. This suggests that the majority of time is being spent in something that the code is calling. In this case the time is coming from the calls to digitalRead.

digitalRead works and is safe, but is not the most efficient way to read from a pin. In this case the pins being read are 31 and 33. On an Arduino Mega these pins are on port 3 (see https://github.com/hazcat/QueryPinconfig ). That data plus the bitmasks to read the values can be cached off.

#define BTN_EN1 31 //[RAMPS14-SMART-ADAPTER]  
#define BTN_EN2 33 //[RAMPS14-SMART-ADAPTER]  

// cache of the port for the pins

volatile uint8_t * cacheport = nullptr;

// bitmasks for the two values

uint8_t bm_a = 0;
uint8_t bm_b = 0;

void setup() {

  /* ... */

  bm_a = digitalPinToBitMask(BTN_EN1);
  uint8_t port1 = digitalPinToPort(BTN_EN1);
  bm_b = digitalPinToBitMask(BTN_EN2);
  uint8_t port2 = digitalPinToPort(BTN_EN2);

}

void read_input_pot_direct_read_from_pins()
{
  // doing two reads. Grab the port value and then do the two reads
  uint8_t port = (*cacheport);
    
  uint8_t rot = ( (port&bm_a) == bm_a) ? B01 : 0;
  if((port&bm_b)==bm_b)
  {
    rot |= B10;
  }

Collecting timings:

Turn: 0 uS: 4596
Turn: 1 uS: 4596
Turn: 1 uS: 4588
Turn: 0 uS: 4588
Turn: 2 uS: 4592

That's quite a lot better.

The next thing I wanted to check was to see if there was any advantage to switching to a lookup table based approach as described here.

https://www.circuitsathome.com/mcu/reading-rotary-encoder-on-arduino/

The first attempt produced significantly worse results. The primary issue was that it's a massive optimization to detect that the encoder hasn't moved. The second issue was that the right shift operator produces some assembly that is suboptimal. 

  last_input = (last_input >> 2);
 a36:    90 e0           ldi    r25, 0x00    ; 0
 a38:    95 95           asr    r25
 a3a:    87 95           ror    r24
 a3c:    95 95           asr    r25
 a3e:    87 95           ror    r24
 a40:    80 93 3f 02     sts    0x023F, r24    ; 0x80023f <_ZL10last_input>

Switching over to dividing by 4 I get the two shift rights that I'd expect

    last_input = (last_input /4);
 a36:    86 95           lsr    r24
 a38:    86 95           lsr    r24
 a3a:    80 93 3f 02     sts    0x023F, r24    ; 0x80023f <_ZL10last_input>

In the end I came up with the following

constexpr int8_t kDIR_A = -1;
constexpr int8_t kDIR_B = 1;
static const int8_t enc_states[] PROGMEM = { 0, kDIR_A, kDIR_B, 0, kDIR_B, 0, 0, kDIR_A, kDIR_A, 0, 0, kDIR_B, 0, kDIR_B, kDIR_A, 0 };

void read_input_pot() {
  uint8_t port = (*cacheport);
  uint8_t rot = ((port & bm_a) == bm_a) ? B01 : 0;
  if ((port & bm_b) == bm_b) {
    rot |= B10;
  }

  if(rot!= lcd_rot_old) {
    uint8_t index = (rot*4)+lcd_rot_old;
    lcd_turn += pgm_read_byte(&enc_states[index]);
    lcd_rot_old = rot;
  }
}

The end product produces some pretty reasonable assembly and timings that are in line with the other routine.

Turn: 0 uS: 4592
Turn: -1 uS: 4596
Turn: 0 uS: 4600
Turn: -2 uS: 4592
Turn: 0 uS: 4596
Turn: 0 uS: 4584
Turn: -1 uS: 4588
Turn: 0 uS: 4600
  if(rot!= lcd_rot_old) {
 a6a:    e8 17           cp    r30, r24
 a6c:    79 f0           breq    .+30         ; 0xa8c <main+0x204>
 a6e:    98 2f           mov    r25, r24
 a70:    99 0f           add    r25, r25
      uint8_t index = (rot*4)+lcd_rot_old;
      lcd_turn += pgm_read_byte(&enc_states[index]);
 a72:    99 0f           add    r25, r25
 a74:    e9 0f           add    r30, r25
 a76:    f0 e0           ldi    r31, 0x00    ; 0
 a78:    ee 5d           subi    r30, 0xDE    ; 222
 a7a:    fe 4f           sbci    r31, 0xFE    ; 254
 a7c:    e4 91           lpm    r30, Z
 a7e:    90 91 30 02     lds    r25, 0x0230    ; 0x800230 <lcd_turn>
 a82:    e9 0f           add    r30, r25
 a84:    e0 93 30 02     sts    0x0230, r30    ; 0x800230 <lcd_turn>
 a88:    80 93 2b 02     sts    0x022B, r24    ; 0x80022b <_ZL11lcd_rot_old>
loop():

The next step in improving this routine is probably to read the pins off of an interrupt rather than polling for state change. But that'll have to wait until another time.

Code is here:

https://github.com/hazcat/RotEnc_speedtest

Discussions