Saving, managing, and annotating readings/reference material in HTML (or: will we ever evolve past PDFs?)

ryanjamurphy · April 10, 2021, 2:23am

@anon85228692’s discovery of a Chrome extension to save webpages as single HTML files is very interesting.

I have thousands of PDFs… and I hate them all. The idea that so many key resources in the world are skeuomorphically archived—that everything needs to emulate paper—just seems like such a missed opportunity.

Sure, you can do some fancy things in PDFs. They can be a little interactive. You can add annotations. You can even sign forms with little vector signatures (whatever excites the lawyers, I guess!

Ultimately, though, the format limits what’s possible. Annotations, for instance, are literal digital sticky notes or highlights, stuck to the document as if you’d put them in a piece of paper and stuffed them in a filing cabinet. The best options we have for interacting with PDF annotations either extract them and make them into something more robust for use (e.g., Skim, DEVONthink, Highlights) or add a database layer (e.g., MarginNote, LiquidText) to PDFs.

HTML, on the other hand, is incredibly flexible and powerful, made for the machines we’re using to view them. I’ve long wondered why there isn’t a better ecosystem out there taking advantage of HTML as an offline reference/resource format.

I’m hoping SingleFile is a sign of things to come. So I’m asking the MPU hive mind: what are the best resources out there to manage and annotate HTML-based resources? What would you hope to see in such an ecosystem? Will I ever escape PDFs?

(There are some other signs of an ecosystem emerging. Hypothes.is and Memex and Readwise seem to be doing more with non-PDF references these days. Still, they mostly seem to be taking a database approach. I have yet to use these services, though, but I’m curious what others’ experiences with them have been.)

zkarj · April 10, 2021, 10:02pm

There’s another thread here (see below) where discussion of this very issue led me to accept Apple’s Web Archive format as a viable option, in the near to medium term at least, and on Apple platforms of course. Be sure to read the whole thread.

ryanjamurphy · April 11, 2021, 2:08am

Great thread! I must’ve overlooked it because it focused on app-based solutions (Pocket, Reeder, etc.) at first.

This quote of yours nails the problem:

Your rationale for sticking with .webarchive here certainly seems well-founded. I wonder if there’s someone at Apple we can write to ask about the status of the format. The VPs sometimes reply, after all… I’d be happy to send a message and report back if anyone can suggest who the best person to reach out to would be.

Here’s my new question, then. Whether you’re saving things in HTML or webarchive, what do you use to read/markup/annotate them? Outside of preserving semantics and adaptive presentation, have you been able to take advantage of the format further?

I am dreaming of the equivalent of a good PDF reader, but for webarchive/HTML format, with ways of adding richer (e.g., being able to link directly to those comments and highlights) comments and highlights.

zkarj · April 11, 2021, 3:35am

I checked out the Firefox extension mentioned above and the first page I tried it on (a Wikipedia page) went well. The second page I tried it on (a forum page — not this forum) caused my MacBook Pro M1 to heat up!! I had to kill it in the end. I’m thinking it didn’t like all of the Javascript ads.

OogieM · April 11, 2021, 6:29pm

I must have missed the thread on saving web page data. I use Rich text and am in the process of converting and cleaning up those with saved images and markdown files with links.

WebArchives are always way too cluttered for me. They never get all the images and graphics I want and often have lots of ones I don’t want. Print to a PDF file is an acceptable substitute some of the time but the resulting PDFs usually have lots of junk included. Save as Markdown excludes all images.

anon41602260 · April 11, 2021, 8:55pm

Sorry to be dense, but I’m missing something. “HTML” in my mind is the nuts and bolts markup language, scripts, CSS, and all the glop behind the scenes. In other words – it is, technically, a subset of this:

Surely, that’s not what you are talking about?

(Have you ever used Curio, by the way?)

zkarj · April 12, 2021, 11:39pm

HTML (with extras) is what lets you see the pages today on a 6K monitor and on a phone and anything in between. If done well, it is nearly infinitely adaptable. Once you make a PDF, that’s pretty much game over. Your content will have to be adapted to one size you choose now and that will be very inappropriate for either a 6K screen or a phone.

The trick is to encapsulate the HTML with the extras (CSS and images for the most part) so that you can keep an atomic copy of the resource that can still be adaptively rendered to whatever size is required in the future. This is what Web Archive and SinglePage strive to do.

I installed Ghostery and re-tried this page. It archived in seconds.

anon41602260 · April 13, 2021, 12:35am

I think you missed my point. I know what HTML is. One doesn’t easily create the sort of pages you mention without some software tool or tools build that HTML. Surely, neither you nor Ryan are suggesting that people should manually “manage, create, annotate readings & reference material” by handcrafting HTML.

Pardon my denseness, but what’s the process and tool to build those HTML pages that you’re suggesting would replace PDFs?

zkarj · April 13, 2021, 1:26am

I think I did miss your point.

I read Ryan’s post as meaning “ways of archiving web content that may disappear from its origin” particularly because of the mention of a tool to do exactly that. So in this case, the content has already been created and people want to preserve it.

But if we go back to your point about how to expand on the content, then I see the problem. Annotation tools for PDFs are common, but I don’t think I’ve ever heard of annotating web pages directly. That is an interesting concept, though. I have ideas in my head about “child DOMs” that attach to the main DOM at a specified element.

tomalmy · April 13, 2021, 1:54am

Well SGML looks a lot like HTML, and was intended for documents. I wrote a program decades ago that would convert SGML (or 1st generation HTML) to plain text. The problem with HTML today is that it it no longer just represents document structure but has so many dynamics that you really can’t save it away and expect to have all the resources to reproduce it in the future. In some ways I expect this is intentional. Use the Internet Wayback Machine to view a site from 20 years ago and it reconstructs much better than a site from a year ago.

I’ve relied on PDFs for over 20 years now and they are still robust. A great choice when plain text doesn’t hack it. I hate trying to extract from HTML (web pages).

ChristinWhite · April 13, 2021, 2:36am

I couldn’t agree more, PDF and the paginated approach is a terrible format for archiving screen-based content and it’s frustrating that there aren’t better options. I use webarchives and they are great in theory but poorly implemented and the tools for managing them are as well. DEVONthink does a pretty good job but even then it’s as a second-class citizen to PDFs.

ryanjamurphy · April 13, 2021, 3:25am

This! This is what I’m hoping to find. A tool that can manage and annotate offline HTML docs (plus a page’s extras).

Like a good PDF reader, but for archived webpages.

I suppose such a thing doesn’t exist.

anon41602260 · April 13, 2021, 10:18am

Thank you @ryanjamurphy

I’ve used some so-called web page “markup” and “annotation” extensions in Chrome – unfortunately (or maybe fortunately) I do not remember their names because none of them were useful.

It seems to annotate web pages one needs to clone the page, then take steps to preserve the integrity of the original page while adding personal data (annotations) to it – and that path leads back to PDFs or WebArchives.

msteffens · April 13, 2021, 11:34am

At least on the Mac, there‘s much better system framework support for annotation of PDFs, which may partly explain why PDF annotation is much more common among tools.

FWIW, I have long-standing plans to implement annotation of web archives. But the required tool chains aren’t as straightforward as with PDF annotation. And since the latter is still the current status quo, I have to focus on that first.

Both formats have their advantages & disadvantages, and PDFs have their use cases, so I don‘t see them as mutually exclusive. That said, I am really looking forward to working with a text+markup-based format such as HTML, the ability to easily read & modify the DOM, to parse or embed additional metadata, and to facilitate better semantic relationships.

ChristinWhite · April 13, 2021, 11:42pm

That sounds extremely interesting, I’ll definitely be following your progress, the app looks great! Interesting way to implement inline metadata, I would have expected YAML frontmatter.

msteffens · April 14, 2021, 7:18am

Thanks @ChristinWhite! Not to derail from this thread (I’ve been guilty of this before), I’ve replied to your comment here.

rms · April 14, 2021, 7:31am

For what it’s worth, Apple’s documentation says (or strongly implies) their Web Archive format is deprecated.

See Apple Developer Documentation

Also PDF is an “open standard” with ISO. What is a PDF? Portable Document Format | Adobe Acrobat DC

ChristinWhite · April 14, 2021, 2:52pm

Unless I’m reading it wrong I think it explicitly says it is depreciated, I had no idea.

I wish Safari would let you save out the Reader view, 90% of the time it’s exactly what I’d love to save and it works so much better than any of the clippers do. I’ve tried saving out Instapaper, Outline, etc. but then you often get URL mismatches or other weirdness.

zkarj · April 16, 2021, 9:15pm

Folks, please see the thread I linked to in post 2 above. I don’t think the Web Archive format has been deprecated at all. The page in the Developer documentation is about the Class developers could use to work with one in their apps, and while this does certainly infer the format is on the wane as well, I’ve suggested evidence in the other thread that goes against this notion.

rms · April 17, 2021, 5:42am

That’s my interpretation also … Apple’s being fuzzy about their intentions for this Apple-only technology. Someday it will all become clear.