jetsetdanny Posted April 13, 2021 Report Share Posted April 13, 2021 That's strange. I *think* if something like this happens, there must be a reason for it. It can't be that one post gets archived/restored properly, and another one - doesn't, without some underlying conditions. Could it be that you don't have enough information, e.g. because the underlying condition is not related to the posts themselves (like their structure, etc.) but to the way the backup was made? I am just thinking aloud, trying to understand something that's rather puzzling... And yes, I also dread to think how many more posts might have been affected across the whole website. I almost prefer not to look to avoid despair 😒 andrewbroad and IRF 2 Quote Link to comment Share on other sites More sharing options...
crem Posted April 14, 2021 Report Share Posted April 14, 2021 Were all examples of truncated posts truncated on word boundary? If yes, I think maybe there are some Unicode characters that database doesn't understand? Like emojis with codepoint values larger than U+FFFF. Or maybe when copying/pasting around, some control characters (like byte order marks) were carried along, and database didn't like it? Do you know in which format the data is stored in database? Is it possible to look at database for raw posts? Another option could be that there's something weird at parsing level, not at database level (then in the database the posts are still intact). E.g. if there's some unmatched control codes (like [/bold] without [bold]), the post parser/formatter stops formatting. jetsetdanny and Spider 1 1 Quote Link to comment Share on other sites More sharing options...
Spider Posted April 14, 2021 Report Share Posted April 14, 2021 I did do a UTF conversion but that was after the 'missing' data, at least from what I can recall. I don't have a specific date for it unfortunately, not without examining several backup files, last years being compressed .gz's inside a folder for that month which as its last years is itself compressed into a huge single file. I've kept away from Cpanel's automated backup tool (invidivudla option rather than "take it all") as the last host I did notice there was a problem with truncated backups. I take them out via phpmyadmin manually and check they are complete, by this I mean the fact that the last table is present at the end of the file itself, I know this is not 100% however it is a reasonable guide compared to the truncated ones. There's nothing suitablte in WayBack machine, the collected data from that year iirc was before then and after in Sept it collected the 'we are offline' page twice, then things resumed. The backup I picked to restore the data to compare was the nearest "after the last post" I had. IRF and jetsetdanny 2 Quote Link to comment Share on other sites More sharing options...
Spider Posted April 14, 2021 Report Share Posted April 14, 2021 I mean to say utf8mb4 Pre conversion (some tables were already at this) , it only effected display. The parser would not be knocked unconscious by broken bbcode, I do know what you're referring to though. 🙂 I did get the data when I did the restore from the raw database via phpmyadmin directly 😉 Upon closer examination of a backup directly, the UTF8MB4 conversion appeared to have done this. Unfortunately had I delayed this, it would of happened anyway at the 3x to 4x upgrade. EDIT... Updated, read end of post info! Looking at post 11772 for instance here , I find the 'converted' (but pre-upgrade backup from last year) was truncated: 3x data (converted) 3x data (pre conversion) Unfortunately, as noted the conversion was not a recent thing, not even near the upgrade at least as I remember. 4x looks like this (just an empty post saying "test" really) I am able it seems to 'rescue' some posts, having found a pre-conversion example, despite having said otherwise previously. No guarantee's on this however i have just restored this post , with only a minor cleanup needed. IRF and jetsetdanny 2 Quote Link to comment Share on other sites More sharing options...
Spider Posted April 19, 2021 Report Share Posted April 19, 2021 Good / Neutral / Bad etc: https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=13393 🙂 IRF and jetsetdanny 2 Quote Link to comment Share on other sites More sharing options...
IRF Posted April 19, 2021 Author Report Share Posted April 19, 2021 (edited) On 4/13/2021 at 11:58 PM, jetsetdanny said: With your brilliant mind, do you see any pattern to the posts there are missing? Can you see any shared characteristics of the posts that are truncated (versus those that are OK)? Looking at the posts that are still truncated (after Andy managed to rescue some of the more substantial ones), a lot of them seem to involve the word "Roddelwald" getting truncated to "Rodd" - something which also afflicts the very title of the topic! - with all text after the "Rodd" being absent. But I can't think of anything regarding a double-d that would cause the problem to arise?!? Edited April 19, 2021 by IRF Spider 1 Quote Link to comment Share on other sites More sharing options...
Spider Posted April 19, 2021 Report Share Posted April 19, 2021 I thought about it. As best as I can tell (and I hate mysql charset concerns) , the issue was caused by mix encoding, that being that although most of the database was iirc ISO8859 , the checks appeared to fail during the conversion to UTF8MB4 as the tables themselves may of already "been there" , best way I can describe it sorry. Single (old) vs double(new) byte encoding per character in effect There's a few items about this kind of thing via Google, although the most relevant one seemed to be about a WordPress blog conversion, the user had a similar issue. The good news (as good as it can be) I would strongly suspect on this site that it is only going to effect posts that used what I'd call "non standard characters" so the actual 'missing content' is hopefully highly limited as not many members here use these. There are safety checks in the conversion routine however it seemed for reasons unknown to fail them. I can't post the code itself as its © , however I can post the comments in the relevant section that failed. Its quite likely this was updated in the core product some short time afterwards: Quote ... can have UTF8 Tables and Columns, but contain Latin1 data. We need to try and sniff this out and apply a different type of conversion, because in this instance you cannot just extract as BINARY and then insert into the source. Instead, you need to convert to latin1, then to binary, then to utf8. But you can't do that on actual UTF8 data because it will then truncate multibyte characters so we need to compare the conversion against what is stored ... I'd speculate here "nail/head" 🙂 The other good news really is as I'd performed the conversion well before the site's erm 'holiday' and then its restoration and upgrade on the newer host, there's no doubt as to when the problem happened. By this I mean there's no real question (as I can see it) of if it was an upgrade issue or a host issue or a conversion issue. It has to be the latter. Had I left the conversion until the upgrade it would of ran then anyway, potentially causing difficulty in (as we later found) discovering the likely root cause. So, summary all being well the damage -should- be relatively localised to said characters. 🙂 IRF 1 Quote Link to comment Share on other sites More sharing options...
IRF Posted April 19, 2021 Author Report Share Posted April 19, 2021 I think you hit the nail on the head when you referred to non-standard characters, Andy. 🙂 Because "Roddenwald" is actually spelt with an umlaut over the 'e', as in: "Roddënwald". There are still a few posts that remain truncated after "Rodd", namely these ones: https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/ (That's the first post in the topic, so the title of the topic is also affected.) https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11740 https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11742 https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11802 So it's very likely that the 'special e' (which comes just after the 'dd') was what caused the problem there. There are also a couple of posts where Danny seems to refer to Fabian (the author of the game) as 'Fabi'. Now, Danny isn't usually that informal, so I wonder whether the second 'a' in Fabian has an accent over it - i.e. Fabián? - and so those posts are also getting truncated because they involve another non-standard character? One of the posts potentially affected is this one, which ends with: "One can see that a lot of work has already gone into this project, and the results are impressive. I hope that you will continue it, Fabi" https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11773 However, I think that's one of the ones that you've already checked/fixed. Furthermore, Danny's final remark (in quotes above) seems to be made 'in conclusion', so I suspect the only thing still missing now from that post is the final two letters of Fabian's name. On the other hand, this post of Danny's is very short and ends with "Fabi", so there might be some missing stuff from here: https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11822 In conclusion, I think we've got to the bottom of the problem - which is a great relief! - and it probably isn't as widespread as we initially feared - which is even more of a relief!! Quote Link to comment Share on other sites More sharing options...
jetsetdanny Posted April 19, 2021 Report Share Posted April 19, 2021 33 minutes ago, IRF said: There are also a couple of posts where Danny seems to refer to Fabian (the author of the game) as 'Fabi'. Now, Danny isn't usually that informal, so I wonder whether the second 'a' in Fabian has an accent over it - i.e. Fabián? Yes, exactly, I did put an accent there as there should be one in the Spanish spelling of the name. Apparently, I shouldn't have bothered... 😯 Quote Link to comment Share on other sites More sharing options...
IRF Posted April 19, 2021 Author Report Share Posted April 19, 2021 I think it's safe to use such special characters now though - it's only a historic problem from before the site was moved over, as evidenced by the fact that I managed to post the special characters in my post earlier tonight. 👍 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.