An Embarrassingly Simple Way to Future Proof without Plain Text

zkarj · September 28, 2022, 5:34am

When you say “plain”, I hope that you mean that you only use English and in fact mean 7-bit ASCII.

karlnyhus · September 28, 2022, 12:36pm

Most long time (United States?) programmers still think in 7-bit ASCII. So it is fortunate and not surprising that the most popular character encoding on the web is UTF-8 which for most practical purposes is the same as 7-bit ASCII. I wonder why they did it that way?

zkarj · September 29, 2022, 5:47am

Ah, yes, but UTF-8 is only equivalent to 7-bit ASCII if you stick to 7-bit ASCII.

UTF-8 was designed to support ancient, legacy systems, not encourage the continued use of 7-bit ASCII.

karlnyhus · September 29, 2022, 11:33am

(Muttering darkly) You young whippersnappers! Always changing things (grumble grumble) No respect for your elders.

zkarj · September 29, 2022, 7:27pm

It’s been a long time since I’ve been called that.

tomalmy · September 29, 2022, 9:47pm

I’ve got my doubts about HTML. About 20 years ago I produced a book written by my father in HTML and distributed it on a CD. While modern browsers will read it, I’ve also put it on my website and it gets loads of invalid HTML hits by Goggle’s watchdog. It also really won’t display correctly on a mobile device. And modern HTML relies heavily on scripting languages which could change breaking the HTML pages.

For reference, here is a link to the HTML book William Almy and His Descendants. If you have a look at it you see that it benefits hugely from hyperlinks and that a printed version would have been more difficult to use. But it also looks crude compared to today’s HTML.

karlnyhus · September 29, 2022, 10:03pm

HTML has evolved some but today’s websites wouldn’t look and feel modern either without CSS, Javascript, and lots of graphics.

AppliedMicro · September 30, 2022, 12:29pm

That‘s not unexpected - but not any cause for concern IMO.

The HTML standard has gone through some iterations, parts of its „vocabulary“ and syntax have been deprecated and it may not conform to the most recent spec or Google‘s parser.

But that’s mostly a non- or only minor issue. User agents (web browsing applications) follow the principle of robustness in being liberal in what they accept. They are designed to accept and render outdated or even invalid documents. HTML has been subject to expansions and is extensible. When creating a document, you don‘t even have to specify the version of the HTML you’re following, and you could even make up your own HTML tags, insert them into the document and it would be rendered just fine anyway. In this sense, the format is robust and fault-tolerant.

Also, it‘s at its core a well-documented very human-readable text file (not a binary) with open-source validators available - so even if something breaks rendering in a modern browser, you could, very easily, fix that with a few alterations in a text document.

It does - but this could be improved with very minor edits (removing the IFRAME) and some CSS. HTML separates content from presentation - and any markdown exporting application should do a good job at that nowadays, which makes it easy.

To be clear, I‘d definitely recommend staying from relying on JavaScript or any other scripting language for the final rendering of HTML content. In fact, you may want to verify that by disabling JavaScript in your browser / document viewer.

It isn’t a good format for facsimiles, exact visual representations of a document either. There‘s often subtle - and sometimes not-so subtle - rendering differences among today‘s commonly-used user agents. If you want your archived file to visually look exactly as created and intended, PDF will be a much better format - maybe the only good one at that.

But I believe HTML is a good, robust and fault-tolerant archival format for documents. It ca preserve not only the structure of structured text but also the quality of embedded or linked media files (images, videos, sound) in their original quality.