this project uses code that allows between 4 to 256 rectangles to be written in a single instruction or single set of instructions without need for command mode, windowset, pixel address commands for each pixel. this allows effective and efficient spi data transfer of data with minimal command overhead. for 64 rectangle at a time look at program links.

if using the spi optimized driver files and a st77xx display (i'm optimizing for ILI9341 by end of october 2018) you will get incredible update speeds to display. if using a display other than the ones i have mentioned, i have created a wrapper that takes my display commands and translates them into universal drivers for adafruit. you'll get the performance from the reduced command overhead (4-10x) but wont see the 2x-3x boost that comes with spi bursting, memory optimizations as these need to be built into the display drivers)

video below shows project details and is only 3-4min. This is just for an article format to discuss methods and ways of speeding up lcd displays that use SPI. It explains why SPI is slow, and methods to increase performance by over 30 times! example of use can be found here https://www.thingiverse.com/thing:3050327 it is a complete thermal pointer, that works at 64x64 resolution on Arduino mega, 64x64 on uno! There is so much overhead to writing pixel information, writing rectangles reduces this overhead, but now a newer method, writing several rectangles at once with minimal overhead. this alone leads to 16x performance for Arduino, where buffering the entire screen is not an option due to limited memory.  there is also double speed to SPI on Arduino. SPI bursts are now twice as fast! (with everything it is able to do 64x64 at ~10fps!) newer code is in beta section of my downloads for speed with 64x64. 

weird tech jargon below!

Yes some people use buffers but the spi transfer itself is also not optimized. There is often a lot of overhead other than pixels for display of data.

originally code was written per pixel, and being that Arduino code didn’t know where pixel was, each pixel needed an address, so 2 bytes for x, 2 bytes for y address, and for color data, in 16 bits it required 2 bytes of data. Only one time was window size defined, at initialization, or at start of lcd use, the window size was set to full display size.

On Arduino buffering of the entire display is often not possible as the memory requirements are a lot.

There has been a really clever method that came out and it was to write rectangles, as long as data had same color it is faster to write areas of the display. Set windows size commands that takes 6 or more bytes, set start address, and then write same color 2 byte data, again and again. 

The display is usually set to a mode that automatically advances pixel location after color data is sent. This seems faster, but the closer the information gets to the display resolution the slower this method becomes because of the overhead to set the starting positions of the window size for the rectangles. At reasonable resolutions the display update rate gets down to a few frames a second.

What I have is a possible solution to the speed of spi, using the rectangle command to draw several rectangles at once, utilizing the set window command, but allowing rectangle command to write several rectangles within the larger rectangles with individual colors, and only uses up to an additional 32 bytes of ram, but speeds up display writes by up to 16 times because overhead is only used every 4 or 16 pixels, instead of each time!

Also removing the wait of spi. Spi data is usually checked to ensure that the write of serial 8 bit data is complete. Why do we have processor stall and wait for this to complete. Just ensure enough time has elapsed (16-20 cycles), and do other things in between, we can decrement numbers, and loop back in this time, possibly also shift bits, and still be under this time frame. As long as we make sure enough time has passed, we can write without the several command cycle time it takes for a while jump   while (!(SPSR & _BV(SPIF)))  loop to complete. This of course is only and advantage if spi is set to at least 1/2 micro controller speed. This alone doubles spi speed.

images, or code I have included that has an example for fast writing to spi display

here is the normal spi driver for rectangle update to adafruit spi display st77xx:

including top of page for code for licensing, my changes are in bold below

then including my changes. this page formats things as wrapped, '#define' lines should be only 1 line to work correctly, or so far has been my experience.

/***************************************************
  This is a library for the Adafruit 1.8" SPI display.

This library works with the Adafruit 1.8" TFT Breakout w/SD card
  ----> http://www.adafruit.com/products/358
The 1.8" TFT shield
  ----> https://www.adafruit.com/product/802
The 1.44" TFT breakout
  ----> https://www.adafruit.com/product/2088
as well as Adafruit raw 1.8" TFT display
  ----> http://www.adafruit.com/products/618

  Check out the links above for our tutorials and wiring diagrams
  These displays use SPI to communicate, 4 or 5 pins are required to
  interface (RST is optional)
  Adafruit invests time and resources providing this open source code,
  please support Adafruit and open-source hardware by purchasing
  products from Adafruit!

  Written by Limor Fried/Ladyada for Adafruit Industries.
  MIT license, all text above must be included in any redistribution
 ****************************************************/

/ fill a rectangle
void Adafruit_ST77xx::fillRect(int16_t x, int16_t y, int16_t w, int16_t h,
  uint16_t color) {

  // rudimentary clipping (drawChar w/big text requires this)
  if((x >= _width) || (y >= _height)) return;
  if((x + w - 1) >= _width)  w = _width  - x;
  if((y + h - 1) >= _height) h = _height - y;

  setAddrWindow(x, y, x+w-1, y+h-1);

  uint8_t hi = color >> 8, lo = color;
    
  SPI_BEGIN_TRANSACTION();

  DC_HIGH();
  CS_LOW();
  for(y=h; y>0; y--) {
    for(x=w; x>0; x--) {
      spiwrite(hi);
      spiwrite(lo);
    }
  }
  CS_HIGH();
  SPI_END_TRANSACTION();
}

*********************************************here is 4 rectangles with 4 colors at a time *************

void Adafruit_ST77xx::fillRectFast4colors(int16_t x, int16_t y, int16_t w, int16_t h,
  uint16_t color0,uint16_t color1,uint16_t color2,uint16_t color3) {
//this is needed for text but not fills with solid color!
  // rudimentary clipping (drawChar w/big text requires this)
 // if((x >= _width) || (y >= _height)) return;
 // if((x + w - 1) >= _width)  w = _width  - x;
 // if((y + h - 1) >= _height) h = _height - y;

  setAddrWindow(x, y, x+w-1, y+h-1);
  //uses more memory, but i think it is more efficient
  uint8_t hi0 = color0 >> 8, lo0 = color0;//we set this in advance, and we do the changes in between write cycles of spi! so no time lost!
  uint8_t hi1 = color1 >> 8, lo1 = color1;//we set this in advance, and we do the changes in between write cycles of spi! so no time lost!
  uint8_t hi2 = color2 >> 8, lo2 = color2;//we set this in advance, and we do the changes in between write cycles of spi! so no time lost!
  uint8_t hi3 = color3 >> 8, lo3 = color3;//we set this in advance, and we do the changes in between write cycles of spi! so no time lost!  
  SPI_BEGIN_TRANSACTION();
  //created complex commands in #define so code can be a lot cleaner
  //this is a complex command that complexspi= hi[0] value
#define complexspi SPCRbackup = SPCR; SPCR = mySPCR; SPDR 
//this is small delay
#define shortSpiDelay __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t")
//last comment out ';' not needed here, but can be added in commands later 
#define longSpiDelay __asm__("nop\n\t"); __asm__("nop\n\t");__asm__("nop\n\t"); __asm__("nop\n\t");  __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t")  
  
digitalWriteFast(TFT_DC,HIGH);// old DC_HIGH();
digitalWriteFast(TFT_CS,LOW);// old CS_LOW();

//  y=h;
  uint16_t countx;//we use these to understand what to draw!
  uint16_t county;

 SPCR = SPCRbackup;  //we place at top for next loop iteration
county=h/2;//+1;
while (county !=0 ){//first rectangle part
         countx=w/2;
          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi0;longSpiDelay; complexspi =lo0; shortSpiDelay;   
           countx--;    
    }
       countx=w/2;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi1;longSpiDelay;complexspi=lo1;shortSpiDelay;
           countx--;     
    }
    county--;
}//end of first half of rectange [0][1] drawn!

county=h/2;//+1; 
while (county !=0){
     countx=w/2;//we need to reset countx here as well
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
      complexspi = hi2;longSpiDelay;complexspi =lo2;shortSpiDelay;
      countx--;     
    }
 
    countx=w/2;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
 complexspi = hi3;longSpiDelay;; complexspi =lo3;shortSpiDelay;         
           countx--;
    }
    county--;
}//end of first half of rectange [2][3] drawn!

         

    digitalWriteFast(TFT_DC,HIGH);// old DC_HIGH();
 
  digitalWriteFast(TFT_CS,HIGH);//old CS_HIGH();
  SPI_END_TRANSACTION();
}

*********************************************here is 16 rectangles with 16 colors at a time *************

here is the 16 rectangles at a time, to simplify things i have created complex instructions in '#define'

this method below is up to 16 times faster as overhead is reduced and most of the work is just pushing spi color data. address and window size data are only sent one time instead of 16

void Adafruit_ST77xx::fillRectFast16colors(int16_t x, int16_t y, int16_t w, int16_t h,
             uint16_t color0, uint16_t color1, uint16_t color2, uint16_t color3,
             uint16_t color4, uint16_t color5, uint16_t color6, uint16_t color7,
             uint16_t color8, uint16_t color9, uint16_t color10, uint16_t color11,
             uint16_t color12, uint16_t color13, uint16_t color14, uint16_t color15){
//this is where squares are written together!
  setAddrWindow(x, y, x+w-1, y+h-1);
  //uses more memory, but i think it is more efficient
  uint8_t hi0 = color0 >> 8, lo0 = color0 ;
  uint8_t hi1 = color1 >> 8, lo1 = color1 ;
  uint8_t hi2 = color2 >> 8, lo2 = color2 ;
  uint8_t hi3 = color3 >> 8, lo3 = color3 ;
  uint8_t hi4 = color4 >> 8, lo4 = color4 ;
  uint8_t hi5 = color5 >> 8, lo5 = color5 ;
  uint8_t hi6 = color6 >> 8, lo6 = color6 ;
  uint8_t hi7 = color7 >> 8, lo7 = color7 ;
  uint8_t hi8 = color8 >> 8, lo8 = color8 ;
  uint8_t hi9 = color9 >> 8, lo9 = color9 ; 
  uint8_t hi10= color10>> 8, lo10= color10; 
  uint8_t hi11= color11>> 8, lo11= color11;
  uint8_t hi12= color12>> 8, lo12= color12;
  uint8_t hi13= color13>> 8, lo13= color13;
  uint8_t hi14= color14>> 8, lo14= color14;
  uint8_t hi15= color15>> 8, lo15= color15;  
  SPI_BEGIN_TRANSACTION();
  //created complex commands in #define so code can be a lot cleaner
  //this is a complex command that complexspi= hi[0] value
#define complexspi SPCRbackup = SPCR; SPCR = mySPCR; SPDR 
//this is small delay
#define shortSpiDelay __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); 
//last comment out ';' not needed here, but can be added in commands later 
#define longSpiDelay __asm__("nop\n\t"); __asm__("nop\n\t");__asm__("nop\n\t"); __asm__("nop\n\t");  __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t"); __asm__("nop\n\t");  
  //[0][1][2][3] 
  //[4][5][6][7] 
  //[8][9][A][B] //this is how command updates screen
  //[C][D][E][F] //a,b,c,d,e,f are 10,11,12,13,14,15
digitalWriteFast(TFT_DC,HIGH);// old DC_HIGH();
digitalWriteFast(TFT_CS,LOW);// old CS_LOW();

//  y=h;
  uint16_t countx;//we use these to understand what to draw!
  uint16_t county;

 SPCR = SPCRbackup;  //we place at top for next loop iteration
county=h/4;//+1;
while (county !=0 ){//first rectangle part
         countx=w/4;
          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi0;longSpiDelay; complexspi =lo0; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi1;longSpiDelay;complexspi=lo1;shortSpiDelay;
           countx--;     
                    }
         countx=w/4;          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi2;longSpiDelay; complexspi =lo2; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi3;longSpiDelay;complexspi=lo3;shortSpiDelay;
           countx--;     
                  }
    
    county--;
};//                     [0][1][2][3] drawn!
county=h/4;//+1;
while (county !=0 ){//first rectangle part
         countx=w/4;
          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi4;longSpiDelay; complexspi =lo4; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi5;longSpiDelay;complexspi=lo5;shortSpiDelay;
           countx--;     
                    }
         countx=w/4;          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi6;longSpiDelay; complexspi =lo6; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi7;longSpiDelay;complexspi=lo7;shortSpiDelay;
           countx--;     
                  }
    
    county--;
};  //                        [4][5][6][7] drawn!
county=h/4;//+1;
while (county !=0 ){//first rectangle part
         countx=w/4;
          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi8;longSpiDelay; complexspi =lo8; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi9;longSpiDelay;complexspi=lo9;shortSpiDelay;
           countx--;     
                    }
         countx=w/4;          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi10;longSpiDelay; complexspi =lo10; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi11;longSpiDelay;complexspi=lo11;shortSpiDelay;
           countx--;     
                  }
    
    county--;
}; //                         [8][9][A][B] drawn!
county=h/4;//+1;
while (county !=0 ){//first rectangle part
         countx=w/4;
          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi12;longSpiDelay; complexspi =lo12; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi13;longSpiDelay;complexspi=lo13;shortSpiDelay;
           countx--;     
                    }
         countx=w/4;          
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.
    complexspi = hi14;longSpiDelay; complexspi =lo14; shortSpiDelay;   
           countx--;    
                   }
       countx=w/4;
    while (countx !=0) {//we need to unroll this loop and add a command that cycles thru to push address one byte at a time.     
     complexspi = hi15;longSpiDelay;complexspi=lo15;shortSpiDelay;
           countx--;     
                  }
    
    county--;
}; //                         [8][9][A][B] drawn!         

    digitalWriteFast(TFT_DC,HIGH);// old DC_HIGH();
 
  digitalWriteFast(TFT_CS,HIGH);//old CS_HIGH();
  SPI_END_TRANSACTION();             
             }