jetsetdanny Posted May 22, 2019 Report Share Posted May 22, 2019 View File Jet Set Willy - Mark Woodmass's fast version Back in 2002, Mark Woodmass created a version of "Jet Set Willy" which speeds up the frame rate by replacing the slow LDIR instructions in the JSW game engine with a stack method to copy the screen data to the Spectrum's video RAM. He described the game as being "marginally faster but not as much as he'd hoped for". A ZIP file with his notes and the game in TAP format can be downloaded from http://www.oocities.org/andrewbroad/spectrum/download/fast.zip. The changes to the game engine provoke a critical problem with the ropes, because the stack-copying code overwrites the rope trajectory data at #8300. John Elliott has released a patch to fix this bug, which can be downloaded from http://www.seasip.info/Jsw/patches.html. Once the ropes problem has been fixed, the game is still impossible to complete, because it suffers from the four critical bugs which make the original "JSW" incompletable. The ZIP file of this download contains fixed, completable files of the game in TAP and TZX formats. The file was fixed by applying John Elliott's patch and the four official Software Projects POKEs (#EB47,00 - removes a Fire cell from "Conservatory Roof"; #E9FD,52 - fixes the 'Attic Bug' arrow; #A4C7,0B - moves the invisible item from "First Landing" to "The Hall"; #DE2C,04 - Changes an Earth cell in the Banyan Tree to a Water cell so the tree is climbable). I did mean to make this ZIP available for download here ever since I created an individual page of this version of the game on JSW Central back in March 2018, but it slipped my mind - I've just remembered it was one of the things I planned to do. So here it is :). Submitter jetsetdanny Submitted 05/23/2019 Category Jet Set Willy [Patched] Spider 1 Quote Link to comment Share on other sites More sharing options...
Norman Sword Posted May 27, 2019 Report Share Posted May 27, 2019 (edited) screen copies:- Here I will talk about the main feature of this version, which is the usage of a stack copy routine.Whilst the game is playing, the program has a multitude of screen data to copy. These copies are the main slowing down of the game. The amount of data that is copied consists of an attribute screen and a pixel or data screen. These are copied repeatedly. The order of copies might not be exactly as I write them out.Each game loop:-Starts by copying the Master Attribute area (512 bytes) to the working Attribute area of ramIt then copies the Master Pixel area (4096 bytes) to the working pixel area of ramThe body of game -- moves and draws the spritesThen game loop continues and:-Copies the working pixel area of ram to the Viewable screen (4096 bytes)Followed by the working attribute area of ram to the Viewable screen attribute area of ram (512 bytes)Goes back and repeats each game loopThe above is moving for the copying of screens a total of 9216 bytes This shows a lot of memory is copied on each game loop, and ways to speed this up, will have a dramatic effect on the games overall speedSo from here on I will partly discus the main methods for copying huge chunks of memory. The three most commonly used on a spectrum are1) LDIR 2)LDI 3)stack copy. written in normal order of speed The best way to look at a copying routine, is to look at how long each method takes to move one byte. A copy routine is after all just moving bytes of memory.. Here I will split the time into two parts, the first part is the time it takes to move the single byte in the idealised form. And then the overhead needed to perform the copy in a loop. The reason I split the timing is because the idealised part of the timing has to be performed. The idealised part will always be written and we have no option but to write and execute that part of the code. The overheads are the parts that can be changed by the programmer and the overheads are the programmers responsibility. . The idealised speed of moving a single byte. The idealised times are the times that the basic op code(s) takes to move one byte . Idealised times . LDIR 21 T states per byte. LDI 16 T states per byte. Stack copy - pop and push - 2 bytes at 10.5 T states per byte.. The overheads for the loop With LDIR their are no loop overheads. We set the up the data for the op code, then execute it. The LDIR executes its own internal loop .With LDI the main overheads are dictated by how much the LDI is rolled out. Fastest possible loop, using inline code has only one op code to perform the loop.With a stack copy we need lots of op codes to perform the loop. The amount used is up to the programmers imagination. Here the only code looked at, is the code used by Mark Woodmass.Total time to move the byte adding the overheads LDIR -- this stays at 21 T-states, being the sum of the idealised op coded time and no overheads . LDI -- this would normally be written as a loop macro- but for clarity I will write it out I have picked a roll out of 32 LDI's, the more we have the faster it is. But having more and more rolled out brings decreasing gains. I have picked 32 as a good starting point. The copy loop using LDIloop: LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI LDI jp pe,loop 10 T-states. here we have added for every 32 bytes moved a "jp pe,loop" into the loop overheads.This takes 10 T-states. This overhead is divided by how many bytes where moved in the loop. so the overheads are 10 T-state/32 or 0.3125 T-states or (0.3 T-state approx) , so negligible compared to the 16 T-states of each byte moved. This is moving a byte in around 16.3 T-states. . .Stack copy using Pop push. now lets look at the stack copy as used by Mark Woodmass the body of the loop is ; overhead T-state / Total T-states loop: ld sp,ix 10T 10T pop hl pop de pop bc pop af ex af,af' 4T 14T exx 4T 18T pop hl pop de pop bc pop af ld sp,iy 10T 28T push af push bc push de push hl exx 4T 32T ex af,af' 4T 36T push af push bc push de push hl ld de,$10 10T 46T add ix,de 15T 61T add iy,de 15T 76T ld hl,(counter) 16T 82T 92T << correction of addition. (subsequent text altered to correct mistake) dec (hl) 11T 93T 103T jr nz,loop 12T 105T 115TWe move 16 bytes , but we have an overhead to add onto the pop and push ideal of 10.5 T-states The overhead per byte is 115/16 T-states or 7.1875 T-states (7.2 approx) This routine moves every byte at an idealised speed of 10.5 T-states, to which we now add the overheads per byte of 7.2 T-states.Giving an overall speed of 17.7 T-states for each byte moved. -------------------Summery of total times taken to move a single byte in T-states Total time Ideal Time overheadsLDIR 21 T-states 21 T-states 0LDI 16.3 T-states 16 T-states 0.3 Stack Copy 17.7 T-states 10.5 T-states 7.2 T-states (Mark Woodmass version) Stack Copy 13.1 T-states 10.5 T-states 2.6 T-states (last version I wrote. See addendum)------------------------- The version of stack copy used in this version of the game has managed to be slower than a very very simple rolled out LDI loop This is a good example of being told that the stack copy method is the fastest method, and then writing some code to show how the version you write, manages to be slower.*****************************************************************************************************************************************************Addendum:-In order to make a stack copy routine fast, we need to reduce the overheads to a minimum. This is best achieved by rolling the code out in a similar way that LDI is rolled out to cut the overheads. With stack copy rolling out the code consumes a lot of ram. But if we want the benefit of the speed the routine can bring. Then we must roll the code out and sacrifice the ram. In the last version of a stack copy I wrote. The main loop was slightly under 300 bytes in size. but the overheads per byte was around 2.6 T-states (the figures quoted are from a quick look at some of my code. the exact figures are unimportant. I am just trying to show how far the overheads can be reduced)******************************************************************************************************************************************************* NOTE** The idealised times for stack copy can be influenced by using the of IX and IY registers as part of the push pop registers. The above is an attempt at simplicity. We look only at the overheads and ignore the registers used. (I am not trying to explain in great detail) **NOTE** In the original timing for "LDI". I used 1/3 this has been changed from 1/3 to .3. to correct inconsistency in numeric notation. Edited May 29, 2019 by Norman Sword IRF and Spider 2 Quote Link to comment Share on other sites More sharing options...
IRF Posted May 28, 2019 Report Share Posted May 28, 2019 add iy,de 15T 76T ld hl,(counter) 16T 82T 76+16=92, but in any case doesn't the LD HL, (counter) require 20 T-States (source: http://z80-heaven.wikidot.com/instructions-set:ld ) Plugging that in yields about 8 T-States for the overhead per byte in the Stack Copy method, therefore a total of 18.5 T-States per byte. i.e. the LDI method is even more efficient in comparison with the Stack Copy method than your rough calculations suggest(?) Spider 1 Quote Link to comment Share on other sites More sharing options...
Norman Sword Posted May 28, 2019 Report Share Posted May 28, 2019 (edited) Checking the ld hl,(counter)Zilog claim 16T states. So that's good enough for me. This was not a clinical dissection, it was me wondering why the game ran so slow. With the heading a stack copy version of Jet Set Willy I was expecting a bit of zip in the movement.The addition mistake and the subsequent error I will correct. =============================================Just to be doubly sure I have dug out an old copy of Rodney Zaks. Page 33416 T states Edited May 28, 2019 by Norman Sword IRF, Spider and jetsetdanny 3 Quote Link to comment Share on other sites More sharing options...
IRF Posted July 23, 2019 Report Share Posted July 23, 2019 Further to the above, the Z80 Heaven website which I quoted makes no distinction between the number of T-States for LD HL, (xx) or LD IX, (xx) hl, (XX) 20 ix, (XX) 20 iy, (XX) 20 (XX), hl 20 (XX), rr 20 (XX), ix 20 (XX), iy 20 Clearly the source I quoted is wrong (and Norman's source was right), because the shift byte inevitably slows down the operation (by 4 T-States). The Z80 Heaven site also fails to give a T-State value (correct or otherwise) for LD rr, (XX) operations e.g. LD BC, (XX). Which presumably does take 20 T-States because again there's a shift opcode involved. (EDIT: Another source confirms this: http://z80.info/z80flag.htm ) **** An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546 e.g. at http://www.z80.info/z80oplist.txt you see these examples: 10 DJNZ &4546 18 JR &4546 etc Spider 1 Quote Link to comment Share on other sites More sharing options...
IRF Posted July 23, 2019 Report Share Posted July 23, 2019 An unrelated query: I've often wondered? why the operand of relative jumps is sometimes given in the format &4546 e.g. at http://www.z80.info/z80oplist.txt you see these examples: 10 DJNZ &4546 18 JR &4546 etc Actually, I've just noticed that the author of the above link is none other than J.G. Harston, occasional visitor to this parish! I wonder if Jonathan will read this, and if so, would he be able to explain the rationale behind the format of the operands which I referred to above? Spider 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.