I must have missed the thread on saving web page data. I use Rich text and am in the process of converting and cleaning up those with saved images and markdown files with links.
WebArchives are always way too cluttered for me. They never get all the images and graphics I want and often have lots of ones I don’t want. Print to a PDF file is an acceptable substitute some of the time but the resulting PDFs usually have lots of junk included. Save as Markdown excludes all images.
Sorry to be dense, but I’m missing something. “HTML” in my mind is the nuts and bolts markup language, scripts, CSS, and all the glop behind the scenes. In other words – it is, technically, a subset of this:
HTML (with extras) is what lets you see the pages today on a 6K monitor and on a phone and anything in between. If done well, it is nearly infinitely adaptable. Once you make a PDF, that’s pretty much game over. Your content will have to be adapted to one size you choose now and that will be very inappropriate for either a 6K screen or a phone.
The trick is to encapsulate the HTML with the extras (CSS and images for the most part) so that you can keep an atomic copy of the resource that can still be adaptively rendered to whatever size is required in the future. This is what Web Archive and SinglePage strive to do.
I installed Ghostery and re-tried this page. It archived in seconds.
I think you missed my point. I know what HTML is. One doesn’t easily create the sort of pages you mention without some software tool or tools build that HTML. Surely, neither you nor Ryan are suggesting that people should manually “manage, create, annotate readings & reference material” by handcrafting HTML.
Pardon my denseness, but what’s the process and tool to build those HTML pages that you’re suggesting would replace PDFs?
I read Ryan’s post as meaning “ways of archiving web content that may disappear from its origin” particularly because of the mention of a tool to do exactly that. So in this case, the content has already been created and people want to preserve it.
But if we go back to your point about how to expand on the content, then I see the problem. Annotation tools for PDFs are common, but I don’t think I’ve ever heard of annotating web pages directly. That is an interesting concept, though. I have ideas in my head about “child DOMs” that attach to the main DOM at a specified element.
Well SGML looks a lot like HTML, and was intended for documents. I wrote a program decades ago that would convert SGML (or 1st generation HTML) to plain text. The problem with HTML today is that it it no longer just represents document structure but has so many dynamics that you really can’t save it away and expect to have all the resources to reproduce it in the future. In some ways I expect this is intentional. Use the Internet Wayback Machine to view a site from 20 years ago and it reconstructs much better than a site from a year ago.
I’ve relied on PDFs for over 20 years now and they are still robust. A great choice when plain text doesn’t hack it. I hate trying to extract from HTML (web pages).
I couldn’t agree more, PDF and the paginated approach is a terrible format for archiving screen-based content and it’s frustrating that there aren’t better options. I use webarchives and they are great in theory but poorly implemented and the tools for managing them are as well. DEVONthink does a pretty good job but even then it’s as a second-class citizen to PDFs.
I’ve used some so-called web page “markup” and “annotation” extensions in Chrome – unfortunately (or maybe fortunately) I do not remember their names because none of them were useful.
It seems to annotate web pages one needs to clone the page, then take steps to preserve the integrity of the original page while adding personal data (annotations) to it – and that path leads back to PDFs or WebArchives.
At least on the Mac, there‘s much better system framework support for annotation of PDFs, which may partly explain why PDF annotation is much more common among tools.
FWIW, I have long-standing plans to implement annotation of web archives. But the required tool chains aren’t as straightforward as with PDF annotation. And since the latter is still the current status quo, I have to focus on that first.
Both formats have their advantages & disadvantages, and PDFs have their use cases, so I don‘t see them as mutually exclusive. That said, I am really looking forward to working with a text+markup-based format such as HTML, the ability to easily read & modify the DOM, to parse or embed additional metadata, and to facilitate better semantic relationships.
Unless I’m reading it wrong I think it explicitly says it is depreciated, I had no idea.
I wish Safari would let you save out the Reader view, 90% of the time it’s exactly what I’d love to save and it works so much better than any of the clippers do. I’ve tried saving out Instapaper, Outline, etc. but then you often get URL mismatches or other weirdness.
Folks, please see the thread I linked to in post 2 above. I don’t think the Web Archive format has been deprecated at all. The page in the Developer documentation is about the Class developers could use to work with one in their apps, and while this does certainly infer the format is on the wane as well, I’ve suggested evidence in the other thread that goes against this notion.
I have long agreed with OP, and have been trying for years to be able to archive and markup html. I know it’s possible because I used to do it, using an old Firefox extension called “Scrapbook”. It would save a good replica of the page without the ads, and you could highlight, add notes inline, etc. Scrapbook somehow relied on something insecure in the browser though - I didn’t understand the details - and it’s no longer available. I now run Scrapbook X in Pale Moon, which is basically a fork of old insecure versions of Firefox. I use Pale Moon just for this extension. There has to be a better way?!