Jump to content
Jet Set Willy & Manic Miner Community

[File] Jet Set Willy - Mark Woodmass's fast version


Recommended Posts

Jet Set Willy - Mark Woodmass's fast version


Back in 2002, Mark Woodmass created a version of "Jet Set Willy" which speeds up the frame rate by replacing the slow LDIR instructions in the JSW game engine with a stack method to copy the screen data to the Spectrum's video RAM. He described the game as being "marginally faster but not as much as he'd hoped for". A ZIP file with his notes and the game in TAP format can be downloaded from http://www.oocities.org/andrewbroad/spectrum/download/fast.zip.

 

The changes to the game engine provoke a critical problem with the ropes, because the stack-copying code overwrites the rope trajectory data at #8300. John Elliott has released a patch to fix this bug, which can be downloaded from http://www.seasip.info/Jsw/patches.html.

 

Once the ropes problem has been fixed, the game is still impossible to complete, because it suffers from the four critical bugs which make the original "JSW" incompletable.

 

The ZIP file of this download contains fixed, completable files of the game in TAP and TZX formats. The file was fixed by applying John Elliott's patch and the four official Software Projects POKEs (#EB47,00 - removes a Fire cell from "Conservatory Roof"; #E9FD,52 - fixes the 'Attic Bug' arrow; #A4C7,0B - moves the invisible item from "First Landing" to "The Hall"; #DE2C,04 - Changes an Earth cell in the Banyan Tree to a Water cell so the tree is climbable).

 

I did mean to make this ZIP available for download here ever since I created an individual page of this version of the game on JSW Central back in March 2018, but it slipped my mind - I've just remembered it was one of the things I planned to do. So here it is :).


 

Link to comment
Share on other sites

screen copies:-
 

Here I will talk about the main feature of this version, which is the usage of a stack copy routine.

Whilst the game is playing, the program has a multitude of screen data to copy. These copies are the main slowing down of the game. The amount of data that is copied consists of an attribute screen and a pixel or data screen. These are copied repeatedly. The order of copies might not be exactly as I write them out.

Each game loop:-
Starts by copying the Master Attribute area (512 bytes) to the working Attribute area of ram
It then copies the Master Pixel area (4096 bytes) to the working pixel area of ram

The body of game -- moves and draws the sprites

Then game loop continues and:-
Copies the  working pixel area of ram to the Viewable screen (4096 bytes)
Followed by the working attribute area of ram to the Viewable screen attribute area of ram (512 bytes)

Goes back and repeats each game loop


The above is moving for the copying of screens a total of 9216 bytes

 

This shows a lot of memory is copied on each game loop, and ways to speed this up, will have a dramatic effect on the games overall speed
So from here on I will partly discus the main methods for copying huge chunks of memory. The three most commonly used on a spectrum are
1) LDIR 2)LDI 3)stack copy. written in normal order of speed

 
The best way to look at a copying routine, is to look at how long each method takes to move one byte. A copy routine is after all just moving bytes of memory.
.

Here I will split the time into two parts, the first part is the time it takes to move the single byte in the idealised form. And then the overhead needed to perform the copy in a loop. The reason I split the timing is because the idealised part of the timing has to be performed. The idealised part will always be written and we have no option but to write and execute that part of the code. The overheads are the parts that can be changed by the programmer and the overheads are the programmers responsibility. 
.

The idealised speed of moving a single byte

The idealised times are the times that the basic op code(s) takes to move one byte  

Idealised times 

  LDIR      21 T states per byte

  LDI       16 T states per byte

Stack copy  - pop and push - 2 bytes  at 10.5 T states per byte
.
.

The overheads for the loop

 

With LDIR  their are no loop overheads. We set the up the data for the op code, then execute it. The LDIR executes its own internal loop

.
With LDI the main overheads are dictated by how much the LDI is rolled out.

Fastest possible loop, using inline code has only one op code to perform the loop.

With a stack copy we need lots of op codes to perform the loop. The amount used is up to the programmers imagination. Here the only code looked at, is the code used by Mark Woodmass.

Total time to move the byte adding the overheads

 

LDIR  -- this stays at 21 T-states, being the sum of the idealised op coded time and no overheads

.  

LDI -- this would normally be written as a loop macro- but for clarity I will write it out

I have picked a roll out of 32 LDI's, the more we have the faster it is. But having more and more rolled out brings decreasing gains. I have picked 32 as a good starting point. 

The copy loop using LDI

loop:

   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI

           jp  pe,loop     10 T-states
.

here we have added for every 32 bytes moved a "jp pe,loop" into the loop overheads
.
This takes 10 T-states. This overhead is divided by how many bytes where moved in the loop
.

so the overheads are 10 T-state/32 or  0.3125 T-states  or (0.3 T-state approx) , so negligible compared to the 16 T-states of each byte moved.

This is moving a byte in around 16.3 T-states
.

.

.
Stack copy  using Pop push.

now lets look at the stack copy as used by Mark Woodmass

the body of the loop is

 

 

;          overhead           T-state / Total T-states

 

loop:

           ld sp,ix              10T  10T
 pop hl
 pop de
 pop bc
 pop af
          ex af,af'              4T  14T
          exx                    4T  18T
 pop hl
 pop de
 pop bc
 pop af
          ld sp,iy               10T  28T
 push af
 push bc
 push de
 push hl
           exx                   4T    32T
           ex af,af'             4T    36T
 push af
 push bc
 push de
 push hl
            ld de,$10        10T      46T
            add ix,de        15T      61T
            add iy,de        15T      76T
            ld hl,(counter)  16T      82T   92T  << correction of addition. (subsequent text altered to correct mistake)
            dec (hl)           11T      93T   103T
            jr nz,loop        12T      105T  115T

We move 16 bytes , but we have an overhead to add onto the pop and push ideal of 10.5 T-states

The overhead per byte is 115/16 T-states or  7.1875 T-states (7.2 approx)

This routine moves every byte at an idealised speed of 10.5 T-states, to which we now add the overheads per byte of 7.2 T-states.
Giving an overall speed of 17.7 T-states for each byte moved.

-------------------

Summery of total times taken to move a single byte in T-states

                  Total time                 Ideal Time               overheads

LDIR            21 T-states             21 T-states              0

LDI              16.3 T-states          16 T-states               0.3 

Stack Copy 17.7 T-states           10.5 T-states            7.2 T-states                    (Mark Woodmass version)

Stack Copy 13.1 T-states          10.5 T-states            2.6 T-states                    (last version I wrote. See addendum)
-------------------------
 

The version of stack copy used in this version of the game has managed to be slower than a very very simple rolled out LDI loop

This is a good example of being told that the stack copy method is the fastest method, and then writing some code to show how the version you write, manages to be slower.



*****************************************************************************************************************************************************
Addendum:-

In order to make a stack copy routine fast, we need to reduce the overheads to a minimum. This is best achieved by rolling the code out in a similar way that LDI is rolled out to cut the overheads. With stack copy rolling out the code consumes a lot of ram. But if we want the benefit of the speed the routine can bring. Then we must roll the code out and sacrifice the ram. 

In the last version of a stack copy I wrote. The main loop was slightly under 300 bytes in size. but the overheads per byte was around 2.6 T-states (the figures quoted are from a quick look at some of my code. the exact figures are unimportant. I am just trying to show how far the overheads can be reduced)
*****************************************************************************************************************************************************


** NOTE**  The idealised times for stack copy can be influenced by using the of IX and IY registers as part of the push pop registers. The above is an attempt at simplicity. We look only at the overheads and ignore the registers used. (I am not trying to explain in great detail) 

**NOTE** In the original timing for "LDI".  I used 1/3 this has been changed from 1/3 to .3. to correct inconsistency in numeric notation. 
 


 

Edited by Norman Sword
Link to comment
Share on other sites

 


add iy,de        15T      76T
            ld hl,(counter)  16T      82T

 

76+16=92, but in any case doesn't the LD HL, (counter) require 20 T-States (source: http://z80-heaven.wikidot.com/instructions-set:ld )

 

Plugging that in yields about 8 T-States for the overhead per byte in the Stack Copy method, therefore a total of 18.5 T-States per byte.

i.e. the LDI method is even more efficient in comparison with the Stack Copy method than your rough calculations suggest(?)

Link to comment
Share on other sites

Checking the

 

ld hl,(counter)

Zilog claim 16T states. So that's good enough for me. 

This was not a clinical dissection, it was me wondering why the game ran so slow. With the heading a stack copy version of Jet Set Willy I was expecting a bit of zip in the movement.

The addition mistake and the subsequent error I will correct. 
=============================================

Just to be doubly sure I have dug out an old copy of Rodney Zaks. 

Page 334

16 T states

post-125-0-14263000-1559074075_thumb.jpg

post-125-0-88499500-1559074092_thumb.jpg

Edited by Norman Sword
Link to comment
Share on other sites

  • 1 month later...

Further to the above, the Z80 Heaven website which I quoted makes no distinction between the number of T-States for LD HL, (xx) or LD IX, (xx)

 

hl, (XX) 20 ix, (XX) 20 iy, (XX) 20 (XX), hl 20 (XX), rr 20 (XX), ix 20 (XX), iy 20

 

Clearly the source I quoted is wrong (and Norman's source was right), because the shift byte inevitably slows down the operation (by 4 T-States).

 

The Z80 Heaven site also fails to give a T-State value (correct or otherwise) for LD rr, (XX) operations e.g. LD BC, (XX).  Which presumably does take 20 T-States because again there's a shift opcode involved.  (EDIT: Another source confirms this: http://z80.info/z80flag.htm )

 

 

****

 

An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546

 

e.g. at http://www.z80.info/z80oplist.txt you see these examples:

 

10 DJNZ &4546

18 JR &4546

etc

Link to comment
Share on other sites

An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546

 

e.g. at http://www.z80.info/z80oplist.txt you see these examples:

 

10 DJNZ &4546

18 JR &4546

etc

 

Actually, I've just noticed that the author of the above link is none other than J.G. Harston, occasional visitor to this parish!  I wonder if Jonathan will read this, and if so, would he be able to explain the rationale behind the format of the operands which I referred to above?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.