Jump to content
Jet Set Willy & Manic Miner Community

Truncated Posts


IRF

Recommended Posts

That's strange.

I *think* if something like this happens, there must be a reason for it. It can't be that one post gets archived/restored properly, and another one - doesn't, without some underlying conditions.

Could it be that you don't have enough information, e.g. because the underlying condition is not related to the posts themselves (like their structure, etc.) but to the way the backup was made?

I am just thinking aloud, trying to understand something that's rather puzzling...

And yes, I also dread to think how many more posts might have been affected across the whole website. I almost prefer not to look to avoid despair 😒

Link to comment
Share on other sites

Were all examples of truncated posts truncated on word boundary?

If yes, I think maybe there are some Unicode characters that database doesn't understand? Like emojis with codepoint values larger than U+FFFF. Or maybe when copying/pasting around, some control characters (like byte order marks) were carried along, and database didn't like it?

Do you know in which format the data is stored in database? Is it possible to look at database for raw posts?

Another option could be that there's something weird at parsing level, not at database level (then in the database the posts are still intact). E.g. if there's some unmatched control codes (like [/bold] without [bold]), the post parser/formatter stops formatting.

Link to comment
Share on other sites

I did do a UTF conversion but that was after the 'missing' data, at least from what I can recall. I don't have a specific date for it unfortunately, not without examining several backup files, last years being compressed .gz's inside a folder for that month which as its last years is itself compressed into a huge single file.

I've kept away from Cpanel's automated backup tool (invidivudla option rather than "take it all") as the last host I did notice there was a problem with truncated backups. I take them out via phpmyadmin manually and check they are complete, by this I mean the fact that the last table is present at the end of the file itself, I know this is not 100% however it is a reasonable guide compared to the truncated ones.

 

There's nothing suitablte in WayBack machine, the collected data from that year iirc was before then and after in Sept it collected the 'we are offline' page twice, then things resumed.

 

The backup I picked to restore the data to compare was the nearest "after the last post" I had.

Link to comment
Share on other sites

I mean to say utf8mb4

Pre conversion (some tables were already at this) , it only effected display. The parser would not be knocked unconscious by broken bbcode, I do know what you're referring to though. 🙂

I did get the data when I did the restore from the raw database via phpmyadmin directly 😉

Upon closer examination of a backup directly, the UTF8MB4 conversion appeared to have done this. Unfortunately had I delayed this, it would of happened anyway at the 3x to 4x upgrade.

EDIT... Updated, read end of post info!

Looking at post 11772 for instance here , I find the 'converted' (but pre-upgrade backup from last year) was truncated:

3x data (converted)

post_conversion_3x_data.png

3x data (pre conversion)

pre_conversion_3x2.png

Unfortunately, as noted the conversion was not a recent thing, not even near the upgrade at least as I remember.

4x looks like this (just an empty post saying "test" really)

4x.png

I am able it seems to 'rescue' some posts, having found a pre-conversion example, despite having said otherwise previously. No guarantee's on this however i have just restored this post , with only a minor cleanup needed.

Link to comment
Share on other sites

On 4/13/2021 at 11:58 PM, jetsetdanny said:

With your brilliant mind, do you see any pattern to the posts there are missing? Can you see any shared characteristics of the posts that are truncated (versus those that are OK)?

Looking at the posts that are still truncated (after Andy managed to rescue some of the more substantial ones), a lot of them seem to involve the word "Roddelwald" getting truncated to "Rodd" - something which also afflicts the very title of the topic! - with all text after the "Rodd" being absent.

But I can't think of anything regarding a double-d that would cause the problem to arise?!?

Edited by IRF
Link to comment
Share on other sites

I thought about it.

As best as I can tell (and I hate mysql charset concerns) , the issue was caused by mix encoding, that being that although most of the database was iirc ISO8859 , the checks  appeared to fail during the conversion to UTF8MB4 as the tables themselves may of already "been there" , best way I can describe it sorry. Single (old) vs double(new) byte encoding per character in effect

There's a few items about this kind of thing via Google, although the most relevant one seemed to be about a WordPress blog conversion, the user had a similar issue.

The good news (as good as it can be) I would strongly suspect on this site that it is only going to effect posts that used what I'd call "non standard characters" so the actual 'missing content' is hopefully highly limited as not many members here use these.

There are safety checks in the conversion routine however it seemed for reasons unknown to fail them. I can't post the code itself as its © , however I can post the comments in the relevant section that failed. Its quite likely this was updated in the core product some short time afterwards:

Quote

... can have UTF8 Tables and Columns, but contain Latin1 data.
We need to try and sniff this out and apply a different type of conversion, because in this instance you cannot just extract as BINARY and then insert into the source.
Instead, you need to convert to latin1, then to binary, then to utf8.
But you can't do that on actual UTF8 data because it will then truncate multibyte characters so we need to compare the conversion against what is stored ...

I'd speculate here "nail/head" 🙂

The other good news really is as I'd performed the conversion well before the site's erm 'holiday' and then its restoration and upgrade on the newer host, there's no doubt as to when the problem happened. By this I mean there's no real question (as I can see it) of if it was an upgrade issue or a host issue or a conversion issue. It has to be the latter. Had I left the conversion until the upgrade it would of ran then anyway, potentially causing difficulty in (as we later found) discovering the likely root cause.

So, summary all being well the damage -should- be relatively localised to said characters. 🙂

Link to comment
Share on other sites

I think you hit the nail on the head when you referred to non-standard characters, Andy. 🙂

Because "Roddenwald" is actually spelt with an umlaut over the 'e', as in: "Roddënwald".  There are still a few posts that remain truncated after "Rodd", namely these ones:

https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/
(That's the first post in the topic, so the title of the topic is also affected.)
https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11740
https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11742
https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11802

So it's very likely that the 'special e' (which comes just after the 'dd') was what caused the problem there.

There are also a couple of posts where Danny seems to refer to Fabian (the author of the game) as 'Fabi'.  Now, Danny isn't usually that informal, so I wonder whether the second 'a' in Fabian has an accent over it - i.e. Fabián? - and so those posts are also getting truncated because they involve another non-standard character?  One of the posts potentially affected is this one, which ends with: "One can see that a lot of work has already gone into this project, and the results are impressive. I hope that you will continue it, Fabi"
https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11773
However, I think that's one of the ones that you've already checked/fixed.  Furthermore, Danny's final remark (in quotes above) seems to be made 'in conclusion', so I suspect the only thing still missing now from that post is the final two letters of Fabian's name.

On the other hand, this post of Danny's is very short and ends with "Fabi", so there might be some missing stuff from here:
https://jswmm.co.uk/topic/532-jet-set-jason-in-rodd/?do=findComment&comment=11822

In conclusion, I think we've got to the bottom of the problem - which is a great relief! - and it probably isn't as widespread as we initially feared - which is even more of a relief!!

 

 

 

Link to comment
Share on other sites

33 minutes ago, IRF said:

There are also a couple of posts where Danny seems to refer to Fabian (the author of the game) as 'Fabi'.  Now, Danny isn't usually that informal, so I wonder whether the second 'a' in Fabian has an accent over it - i.e. Fabián?

Yes, exactly, I did put an accent there as there should be one in the Spanish spelling of the name. Apparently, I shouldn't have bothered... 😯

Link to comment
Share on other sites

I think it's safe to use such special characters now though - it's only a historic problem from before the site was moved over, as evidenced by the fact that I managed to post the special characters in my post earlier tonight. 👍

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.