So Norman's code (featuring only one CALL and RET) is certainly faster than my version!
Compare and contrast (for each pass through the Main Loop):
Primary to secondary pixel buffer = 128 raster lines
Primary to secondary attribute buffer = 16 character rows
So my method uses 144 CALLs and RETs.
Norman's method only requires 2 CALLs and RETs (for the pixel loop and for the attribute loop).
Unconditional CALL = 17 T-States
Unconditional RET = 10 T-States
So the difference in T-States is 142 x 27 = 3888 T-States (the amount by which Norman's method is faster than mine).
[There should be no difference in terms of the copying of the secondary buffers to the physical screen, because the Jagged Finger fix means that the data isn't copied contiguously (in terms of the way that it is stored in memory). So there are separate CALLs to the subroutine for each individual raster line, in both Norman's and my method.]
However, that 3888 is only a modest difference when you compare it with the overall saving achieved by abandoning LDIR in favour of the 32-consecutive-LDI method. Norman worked out that copying the pixels (4096 bytes) between buffers is faster by 22528 T-States. For the 512 bytes of attributes across 16 character rows of the playable screen, there is an additional saving of 2816 T-States.
So the total saving (per Main Loop pass) achieved is 25344 T-States before you account for the time taken to perform CALLs and RETs.
Edited by IRF, 17 July 2019 - 12:12 PM.