There’s another thread here (see below) where discussion of this very issue led me to accept Apple’s Web Archive format as a viable option, in the near to medium term at least, and on Apple platforms of course. Be sure to read the whole thread.
Great thread! I must’ve overlooked it because it focused on app-based solutions (Pocket, Reeder, etc.) at first.
This quote of yours nails the problem:
Your rationale for sticking with .webarchive here certainly seems well-founded. I wonder if there’s someone at Apple we can write to ask about the status of the format. The VPs sometimes reply, after all… I’d be happy to send a message and report back if anyone can suggest who the best person to reach out to would be.
Here’s my new question, then. Whether you’re saving things in HTML or webarchive, what do you use to read/markup/annotate them? Outside of preserving semantics and adaptive presentation, have you been able to take advantage of the format further?
I am dreaming of the equivalent of a good PDF reader, but for webarchive/HTML format, with ways of adding richer (e.g., being able to link directly to those comments and highlights) comments and highlights.
I must have missed the thread on saving web page data. I use Rich text and am in the process of converting and cleaning up those with saved images and markdown files with links.
WebArchives are always way too cluttered for me. They never get all the images and graphics I want and often have lots of ones I don’t want. Print to a PDF file is an acceptable substitute some of the time but the resulting PDFs usually have lots of junk included. Save as Markdown excludes all images.
Sorry to be dense, but I’m missing something. “HTML” in my mind is the nuts and bolts markup language, scripts, CSS, and all the glop behind the scenes. In other words – it is, technically, a subset of this:
Surely, that’s not what you are talking about?
(Have you ever used Curio, by the way?)
HTML (with extras) is what lets you see the pages today on a 6K monitor and on a phone and anything in between. If done well, it is nearly infinitely adaptable. Once you make a PDF, that’s pretty much game over. Your content will have to be adapted to one size you choose now and that will be very inappropriate for either a 6K screen or a phone.
The trick is to encapsulate the HTML with the extras (CSS and images for the most part) so that you can keep an atomic copy of the resource that can still be adaptively rendered to whatever size is required in the future. This is what Web Archive and SinglePage strive to do.
I installed Ghostery and re-tried this page. It archived in seconds.
I think you missed my point. I know what HTML is. One doesn’t easily create the sort of pages you mention without some software tool or tools build that HTML. Surely, neither you nor Ryan are suggesting that people should manually “manage, create, annotate readings & reference material” by handcrafting HTML.
Pardon my denseness, but what’s the process and tool to build those HTML pages that you’re suggesting would replace PDFs?
I think I did miss your point.
I read Ryan’s post as meaning “ways of archiving web content that may disappear from its origin” particularly because of the mention of a tool to do exactly that. So in this case, the content has already been created and people want to preserve it.
But if we go back to your point about how to expand on the content, then I see the problem. Annotation tools for PDFs are common, but I don’t think I’ve ever heard of annotating web pages directly. That is an interesting concept, though. I have ideas in my head about “child DOMs” that attach to the main DOM at a specified element.
Well SGML looks a lot like HTML, and was intended for documents. I wrote a program decades ago that would convert SGML (or 1st generation HTML) to plain text. The problem with HTML today is that it it no longer just represents document structure but has so many dynamics that you really can’t save it away and expect to have all the resources to reproduce it in the future. In some ways I expect this is intentional. Use the Internet Wayback Machine to view a site from 20 years ago and it reconstructs much better than a site from a year ago.
I’ve relied on PDFs for over 20 years now and they are still robust. A great choice when plain text doesn’t hack it. I hate trying to extract from HTML (web pages).
I couldn’t agree more, PDF and the paginated approach is a terrible format for archiving screen-based content and it’s frustrating that there aren’t better options. I use webarchives and they are great in theory but poorly implemented and the tools for managing them are as well. DEVONthink does a pretty good job but even then it’s as a second-class citizen to PDFs.
This! This is what I’m hoping to find. A tool that can manage and annotate offline HTML docs (plus a page’s extras).
Like a good PDF reader, but for archived webpages.
I suppose such a thing doesn’t exist.
Thank you @ryanjamurphy
I’ve used some so-called web page “markup” and “annotation” extensions in Chrome – unfortunately (or maybe fortunately) I do not remember their names because none of them were useful.
It seems to annotate web pages one needs to clone the page, then take steps to preserve the integrity of the original page while adding personal data (annotations) to it – and that path leads back to PDFs or WebArchives.
At least on the Mac, there‘s much better system framework support for annotation of PDFs, which may partly explain why PDF annotation is much more common among tools.
FWIW, I have long-standing plans to implement annotation of web archives. But the required tool chains aren’t as straightforward as with PDF annotation. And since the latter is still the current status quo, I have to focus on that first.
Both formats have their advantages & disadvantages, and PDFs have their use cases, so I don‘t see them as mutually exclusive. That said, I am really looking forward to working with a text+markup-based format such as HTML, the ability to easily read & modify the DOM, to parse or embed additional metadata, and to facilitate better semantic relationships.
That sounds extremely interesting, I’ll definitely be following your progress, the app looks great! Interesting way to implement inline metadata, I would have expected YAML frontmatter.
For what it’s worth, Apple’s documentation says (or strongly implies) their Web Archive format is deprecated.
Also PDF is an “open standard” with ISO. What is a PDF? Portable Document Format | Adobe Acrobat DC
Unless I’m reading it wrong I think it explicitly says it is depreciated, I had no idea.
I wish Safari would let you save out the Reader view, 90% of the time it’s exactly what I’d love to save and it works so much better than any of the clippers do. I’ve tried saving out Instapaper, Outline, etc. but then you often get URL mismatches or other weirdness.
Folks, please see the thread I linked to in post 2 above. I don’t think the Web Archive format has been deprecated at all. The page in the Developer documentation is about the Class developers could use to work with one in their apps, and while this does certainly infer the format is on the wane as well, I’ve suggested evidence in the other thread that goes against this notion.
That’s my interpretation also … Apple’s being fuzzy about their intentions for this Apple-only technology. Someday it will all become clear.
I have long agreed with OP, and have been trying for years to be able to archive and markup html. I know it’s possible because I used to do it, using an old Firefox extension called “Scrapbook”. It would save a good replica of the page without the ads, and you could highlight, add notes inline, etc. Scrapbook somehow relied on something insecure in the browser though - I didn’t understand the details - and it’s no longer available. I now run Scrapbook X in Pale Moon, which is basically a fork of old insecure versions of Firefox. I use Pale Moon just for this extension. There has to be a better way?!