source code for JSW

IRF · July 11, 2019

For consistency and elegance, wouldn't it be best, in the case of operations which are self-modified by the code, to insert NOP command(s) (opcode #00) wherever they appear in the source code listing? That way the default value held at the pertinent address(es) would be zero, as is the case with the operands that are self-modified.

e.g. For your example of a direction label, list it in the source code as:

S_M_C_direction: NOP

And then use: LD A, #3C [for INC A] or LD A, #3D [for DEC A] or XOR A [to restore the default NOP]

followed by: LD (S_M_C_direction), A

for movement in whichever direction (or no direction).

Edited July 11, 2019 by IRF

Norman Sword · July 11, 2019

I would imagine you would end up with a large rule book.

example 1:-
S_M_C_counter1: equ $+1

ld a,12

inc a

and 7
or 8

ld (S_M_C_counter1),a

here the value varies between 8 and 15:- your example has failed, we never have a value of zero in the variable

example 2:-

S_M_C_opcode: inc a

direction_switch equ $3c xor $3d ; this is ("inc a") xor ("dec a")

ld hl,S_M_C_opcode

ld a,(hl)

xor direction_switch

ld (hl),a

here the opcode varies between either "inc a" or "dec a". the code is switching direction. again never zero

-----------------------------------------------------

The circumstances can change from once instance to another.

The S_M_C_ is alerting you to code that is modifying.

The $-$ is making the statement that the value will be changed before the opcode is executed.

In a lot of instances we must have an opcode or an initial value. In those cases the value is inserted or the opcode written out.

I suppose it is similar to saying a block move is always in this format:-

ld hl,source

ld de,destination

ld bc, count

LDIR

when the reality says it is a lot of the time, but the variations are vast.

Edited July 11, 2019 by Norman Sword

IRF · July 12, 2019

Thanks Norman.

As it happens, in example 1, if the initial value held by S_M_C_counter1 was zero, then it would quickly be overwritten by 8 and then it would return to the intended pattern of the operand incrementing during each pass through the code (looping back from 15 to 8).

(The initial value of zero might have an adverse impact though, depending on the context - especially if the variable is picked up by the program before it is first modified. e.g. an out-of-range guardian crashing into a wall at the edge of a room?)

But I can see that in example 2, if you had a default value of zero stored at S_M_C_opcode, then execution of the code would never cause the labelled address to reach either of its intended operations (INC A or DEC A).

Instead, the address S_M_C_opcode would toggle between acting as a NOP (00), and the 01 opcode - which would have the unintended effect of picking up the next pair of bytes which follow on from S_M_C_opcode, and loading those values into the BC register-pair!

Edited July 12, 2019 by IRF

IRF · July 16, 2019

A query [EDIT: Which I think I've answered myself in subsequent posts!]:

In the Main Loop of a recent project, I have this arrangement (repeated four times, for copying pixels twice and attributes twice - primary to secondary buffer and then buffer to physical screen):

LD HL, source

LD DE, destination

; No need to define BC; it's not used now, so LD BC, xxxx command has been deleted

LD A, #80 or LD A, #10 ; For copying the pixels (128 raster lines) or attributes (16 character rows) respectively
loop:
CALL subroutine
DEC A
JR NZ, loop
; Once A reaches zero, flow of execution continues through the Main Loop

The subroutine which is CALLed consists of 32 consecutive LDI commands, followed by a RET.

This was obviously based on one of Norman Sword's suggestions (duly credited in the readme file for the project in question). However, there is a slight difference - Norman's subroutine incorporates the DEC A and JR NZ commands (after the final LDI and before the RET), whereas in my version, those commands are located in the Main Loop.

In terms of memory, Norman's version is obviously more efficient (because I have to repeat the DEC A and JR NZ commands four times within the Main Loop, rather than just once in Norman's subroutine).

However - and here is my query - would my version be slightly faster? [i don't mean the game as a whole - Norman has done lots of other things to speed up the game - I mean purely in terms of comparing the two variants of the LDI method like-for-like.]

My thinking is that the number of T-States which it takes to perform a relative jump is proportional to the distance through the code which has to be jumped - 67 bytes in Norman's case, and only 5 bytes in mine.

?

****

N.B. My method may complicate things in cases where a chunk of code is being overwritten with a single value - where the first byte is overwritten directly and then the number of bytes to which the same values is to be copied in a loop is minus one. e.g. for attribute update with a single value (such as for a screen flash effect), use #01FF instead of #0200 to define the size of the loop.

Norman's code deals with such cases by CALLing a late entry point into his subroutine, coinciding with the second LDI command in the subroutine. (But the JR NZ at the end of the subroutine jumps back to the first LDI in the subroutine.)

In such cases, I think my method would unavoidably end up 'overshooting', and overwriting one more byte than it should. (But in the aforementioned project, I didn't actually use an LDI method for 'block fill' purposes, only for 'block move'.)

EDIT: For reference:
http://jswmm.co.uk/topic/375-a-total-rewrite-of-jsw-in-48k-using-matthews-core-code/page-4?do=findComment&comment=7745

Note also my comment/query here about a couple of presumed typos:
http://jswmm.co.uk/topic/375-a-total-rewrite-of-jsw-in-48k-using-matthews-core-code/page-6?do=findComment&comment=9047

Edited July 17, 2019 by IRF

IRF · July 17, 2019

My thinking is that the number of T-States which it takes to perform a relative jump is proportional to the distance through the code which has to be jumped - 67 bytes in Norman's case, and only 5 bytes in mine.

?

On reflection, my variant might not be faster after all - my subroutine is CALLed #10 or #80 times during every pass through each part of the Main Loop that performs a block copy operation.

The number of T-States for that many CALL/RET commands (versus just one CALL/RET in Norman's code) may well outweigh the saving in T-States achieved by shortening the length of the relative jump!

Further investigation is required...

Edited July 17, 2019 by IRF

IRF · July 17, 2019

Further investigation is required...

... And it seems I got completely the wrong end of the stick!

The number of T-States for a conditional relative jump loop is based on how many times the relative jump has to be executed (here determined by counting down the value of A, which doesn't change between Norman's method and mine), rather than the distance back through the code that each relative jump spans (the operand of the JR command), as I had previously understood to be the case. :blush:

So Norman's code (featuring only one CALL and RET per chunk of code copied) is certainly faster than my version!

Edited July 17, 2019 by IRF

IRF · July 17, 2019

So Norman's code (featuring only one CALL and RET) is certainly faster than my version!

Compare and contrast (for each pass through the Main Loop):

Primary to secondary pixel buffer = 128 raster lines

Primary to secondary attribute buffer = 16 character rows

So my method uses 144 CALLs and RETs.

Norman's method only requires 2 CALLs and RETs (for the pixel loop and for the attribute loop).

Unconditional CALL = 17 T-States

Unconditional RET = 10 T-States

So the difference in T-States is 142 x 27 = 3888 T-States (the amount by which Norman's method is faster than mine).

[There should be no difference in terms of the copying of the secondary buffers to the physical screen, because the Jagged Finger fix means that the data isn't copied contiguously (in terms of the way that it is stored in memory). So there are separate CALLs to the subroutine for each individual raster line, in both Norman's and my method.]

****

However, that 3888 is only a modest difference when you compare it with the overall saving achieved by abandoning LDIR in favour of the 32-consecutive-LDI method. Norman worked out that copying the pixels (4096 bytes) between buffers is faster by 22528 T-States. For the 512 bytes of attributes across 16 character rows of the playable screen, there is an additional saving of 2816 T-States.

So the total saving (per Main Loop pass) achieved is 25344 T-States before you account for the time taken to perform CALLs and RETs.

Edited July 17, 2019 by IRF

Norman Sword · July 17, 2019

Re branching/ jumping and calling.

The program counter is loaded during one of the clock cycles with data. This is the same with call's, Jump's and JR's. The number of clock cycles needed to set the data up is different. Once the program counter is loaded the next clock cycle we move to the new address. What this means is that the speed is fixed no matter where the Program Counter is asked to move to. A relative jump of 0 bytes is executed at the same speed as a relative jump of 127 bytes.

Calls and jump's and I will also include ret's are similar the Program counter is loaded and the next clock cycle we execute the operand pointed at by the (possibly) changed Program counter. Each is acted on with no consideration of the amount of relative displacement from the old value.

----------------------------------------------------------------------------

I will re read the posts above this one.... And perhaps comment further.

Edited July 17, 2019 by Norman Sword

Norman Sword · July 17, 2019

Using a call and a ret to 32 consecutive LDI's . A variation on my last version would probably do what you want.... And this assumption is based on a quick scan of all the changes listed in the above posts.

;copy work and attribute screens

    ld hl,att_work
    ld de,ATT0
;;;; ~~ld b,0~~ ; this was set for usage in a different routine
    exx
    ld hl,ytable
    ld bc,128   ; must be a multiple of 32 ; this is 4*32 ;- that is 4 raster lines before the attributes are written in
;loop executed 128 times on each game loop
raster:
ld e,(hl)
inc l
push hl
ld h,(hl)
ld l,e
ld d,h
res 5,d

call BLOCKX_MOVE32 ;executed 128 times on each game loop
jp pe,n_raster
exx ; this code is executed 16 times on each game loop
;;;; ~~ld c,32~~ ; this was set for usage in a different routine
call BLOCKX_MOVE32
exx
inc b
n_raster:
pop hl
inc l
jr nz,raster

;Note the a register is not used in either routine

---------------------------------------------------------------------

BLOCKX_MOVE32:

rept 32

ldi

endm

ret

ADDENDUM:- multiple reference through out these posts to BLOCK_MOVE32 or BLOCK_MOVE31.........I will go through all the posts and change the conflicting labels......In this post labels now called BLOCKX_MOVE32

Edited July 18, 2019 by Norman Sword

IRF · July 17, 2019

Thanks Norman!

I believe that relies on the fact that the LDI command resets the Overflow Flag if (and only if) the value of BC reaches zero after the operation?

Sign In

source code for JSW

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

Norman Sword

Norman Sword

Norman Sword

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Important Information