[File] Jet Set Willy - Mark Woodmass's fast version

jetsetdanny · May 22, 2019

Jet Set Willy - Mark Woodmass's fast version

Back in 2002, Mark Woodmass created a version of "Jet Set Willy" which speeds up the frame rate by replacing the slow LDIR instructions in the JSW game engine with a stack method to copy the screen data to the Spectrum's video RAM. He described the game as being "marginally faster but not as much as he'd hoped for". A ZIP file with his notes and the game in TAP format can be downloaded from http://www.oocities.org/andrewbroad/spectrum/download/fast.zip.

The changes to the game engine provoke a critical problem with the ropes, because the stack-copying code overwrites the rope trajectory data at #8300. John Elliott has released a patch to fix this bug, which can be downloaded from http://www.seasip.info/Jsw/patches.html.

Once the ropes problem has been fixed, the game is still impossible to complete, because it suffers from the four critical bugs which make the original "JSW" incompletable.

The ZIP file of this download contains fixed, completable files of the game in TAP and TZX formats. The file was fixed by applying John Elliott's patch and the four official Software Projects POKEs (#EB47,00 - removes a Fire cell from "Conservatory Roof"; #E9FD,52 - fixes the 'Attic Bug' arrow; #A4C7,0B - moves the invisible item from "First Landing" to "The Hall"; #DE2C,04 - Changes an Earth cell in the Banyan Tree to a Water cell so the tree is climbable).

I did mean to make this ZIP available for download here ever since I created an individual page of this version of the game on JSW Central back in March 2018, but it slipped my mind - I've just remembered it was one of the things I planned to do. So here it is :).

Submitter

jetsetdanny
Submitted

05/23/2019
Category

Jet Set Willy [Patched]

Norman Sword · May 27, 2019

screen copies:-

Here I will talk about the main feature of this version, which is the usage of a stack copy routine.

Whilst the game is playing, the program has a multitude of screen data to copy. These copies are the main slowing down of the game. The amount of data that is copied consists of an attribute screen and a pixel or data screen. These are copied repeatedly. The order of copies might not be exactly as I write them out.

Each game loop:-
Starts by copying the Master Attribute area (512 bytes) to the working Attribute area of ram
It then copies the Master Pixel area (4096 bytes) to the working pixel area of ram

The body of game -- moves and draws the sprites

Then game loop continues and:-
Copies the working pixel area of ram to the Viewable screen (4096 bytes)
Followed by the working attribute area of ram to the Viewable screen attribute area of ram (512 bytes)

Goes back and repeats each game loop

The above is moving for the copying of screens a total of 9216 bytes

This shows a lot of memory is copied on each game loop, and ways to speed this up, will have a dramatic effect on the games overall speed
So from here on I will partly discus the main methods for copying huge chunks of memory. The three most commonly used on a spectrum are
1) LDIR 2)LDI 3)stack copy. written in normal order of speed

The best way to look at a copying routine, is to look at how long each method takes to move one byte. A copy routine is after all just moving bytes of memory.
.

Here I will split the time into two parts, the first part is the time it takes to move the single byte in the idealised form. And then the overhead needed to perform the copy in a loop. The reason I split the timing is because the idealised part of the timing has to be performed. The idealised part will always be written and we have no option but to write and execute that part of the code. The overheads are the parts that can be changed by the programmer and the overheads are the programmers responsibility.
.

The idealised speed of moving a single byte
.
The idealised times are the times that the basic op code(s) takes to move one byte
.
Idealised times
.
LDIR 21 T states per byte
.
LDI 16 T states per byte
.
Stack copy - pop and push - 2 bytes at 10.5 T states per byte
.
.

The overheads for the loop

With LDIR their are no loop overheads. We set the up the data for the op code, then execute it. The LDIR executes its own internal loop

.
With LDI the main overheads are dictated by how much the LDI is rolled out.

Fastest possible loop, using inline code has only one op code to perform the loop.

With a stack copy we need lots of op codes to perform the loop. The amount used is up to the programmers imagination. Here the only code looked at, is the code used by Mark Woodmass.

Total time to move the byte adding the overheads

LDIR -- this stays at 21 T-states, being the sum of the idealised op coded time and no overheads

.

LDI -- this would normally be written as a loop macro- but for clarity I will write it out

I have picked a roll out of 32 LDI's, the more we have the faster it is. But having more and more rolled out brings decreasing gains. I have picked 32 as a good starting point.

The copy loop using LDI

loop:

LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI
   LDI

jp pe,loop 10 T-states
.

here we have added for every 32 bytes moved a "jp pe,loop" into the loop overheads
.
This takes 10 T-states. This overhead is divided by how many bytes where moved in the loop
.

so the overheads are 10 T-state/32 or 0.3125 T-states or (0.3 T-state approx) , so negligible compared to the 16 T-states of each byte moved.

This is moving a byte in around 16.3 T-states
.

.

.
Stack copy using Pop push.

now lets look at the stack copy as used by Mark Woodmass

the body of the loop is

; overhead T-state / Total T-states

loop:

ld sp,ix 10T 10T
pop hl
pop de
pop bc
pop af
ex af,af' 4T 14T
exx 4T 18T
pop hl
pop de
pop bc
pop af
ld sp,iy 10T 28T
push af
push bc
push de
push hl
exx 4T 32T
ex af,af' 4T 36T
push af
push bc
push de
push hl
ld de,$10 10T 46T
add ix,de 15T 61T
add iy,de 15T 76T
ld hl,(counter) 16T ~~82T~~ 92T << correction of addition. (subsequent text altered to correct mistake)
dec (hl) 11T ~~93T~~ 103T
jr nz,loop 12T ~~105T~~ 115T

We move 16 bytes , but we have an overhead to add onto the pop and push ideal of 10.5 T-states

The overhead per byte is 115/16 T-states or 7.1875 T-states (7.2 approx)

This routine moves every byte at an idealised speed of 10.5 T-states, to which we now add the overheads per byte of 7.2 T-states.
Giving an overall speed of 17.7 T-states for each byte moved.

-------------------

Summery of total times taken to move a single byte in T-states

Total time Ideal Time overheads

LDIR 21 T-states 21 T-states 0

LDI 16.3 T-states 16 T-states 0.3

Stack Copy 17.7 T-states 10.5 T-states 7.2 T-states (Mark Woodmass version)

Stack Copy 13.1 T-states 10.5 T-states 2.6 T-states (last version I wrote. See addendum)
-------------------------

The version of stack copy used in this version of the game has managed to be slower than a very very simple rolled out LDI loop

This is a good example of being told that the stack copy method is the fastest method, and then writing some code to show how the version you write, manages to be slower.

*****************************************************************************************************************************************************
Addendum:-

In order to make a stack copy routine fast, we need to reduce the overheads to a minimum. This is best achieved by rolling the code out in a similar way that LDI is rolled out to cut the overheads. With stack copy rolling out the code consumes a lot of ram. But if we want the benefit of the speed the routine can bring. Then we must roll the code out and sacrifice the ram.

In the last version of a stack copy I wrote. The main loop was slightly under 300 bytes in size. but the overheads per byte was around 2.6 T-states (the figures quoted are from a quick look at some of my code. the exact figures are unimportant. I am just trying to show how far the overheads can be reduced)
*****************************************************************************************************************************************************

** NOTE** The idealised times for stack copy can be influenced by using the of IX and IY registers as part of the push pop registers. The above is an attempt at simplicity. We look only at the overheads and ignore the registers used. (I am not trying to explain in great detail)

**NOTE** In the original timing for "LDI". I used 1/3 this has been changed from 1/3 to .3. to correct inconsistency in numeric notation.

Edited May 29, 2019 by Norman Sword

IRF · May 28, 2019

add iy,de 15T 76T
ld hl,(counter) 16T 82T

76+16=92, but in any case doesn't the LD HL, (counter) require 20 T-States (source: http://z80-heaven.wikidot.com/instructions-set:ld )

Plugging that in yields about 8 T-States for the overhead per byte in the Stack Copy method, therefore a total of 18.5 T-States per byte.

i.e. the LDI method is even more efficient in comparison with the Stack Copy method than your rough calculations suggest(?)

Norman Sword · May 28, 2019

Checking the

ld hl,(counter)

Zilog claim 16T states. So that's good enough for me.

This was not a clinical dissection, it was me wondering why the game ran so slow. With the heading a stack copy version of Jet Set Willy I was expecting a bit of zip in the movement.

The addition mistake and the subsequent error I will correct.
=============================================

Just to be doubly sure I have dug out an old copy of Rodney Zaks.

Page 334

16 T states

Edited May 28, 2019 by Norman Sword

IRF · July 23, 2019

Further to the above, the Z80 Heaven website which I quoted makes no distinction between the number of T-States for LD HL, (xx) or LD IX, (xx)

hl, (XX) 20 ix, (XX) 20 iy, (XX) 20 (XX), hl 20 (XX), rr 20 (XX), ix 20 (XX), iy 20

Clearly the source I quoted is wrong (and Norman's source was right), because the shift byte inevitably slows down the operation (by 4 T-States).

The Z80 Heaven site also fails to give a T-State value (correct or otherwise) for LD rr, (XX) operations e.g. LD BC, (XX). Which presumably does take 20 T-States because again there's a shift opcode involved. (EDIT: Another source confirms this: http://z80.info/z80flag.htm )

****

An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546

e.g. at http://www.z80.info/z80oplist.txt you see these examples:

10 DJNZ &4546

18 JR &4546

etc

IRF · July 23, 2019

An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546

e.g. at http://www.z80.info/z80oplist.txt you see these examples:

10 DJNZ &4546

18 JR &4546

etc

Actually, I've just noticed that the author of the above link is none other than J.G. Harston, occasional visitor to this parish! I wonder if Jonathan will read this, and if so, would he be able to explain the rationale behind the format of the operands which I referred to above?

Sign In

[File] Jet Set Willy - Mark Woodmass's fast version

Recommended Posts

jetsetdanny

Jet Set Willy - Mark Woodmass's fast version

Link to comment

Share on other sites

Norman Sword

Link to comment

Share on other sites

IRF

Link to comment

Share on other sites

Norman Sword

Link to comment

Share on other sites

IRF

Link to comment

Share on other sites

IRF

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Guides

Activity

Other

Important Information