Reading through the comments, one of which is from the original author, they admit it seems like some of the hits are coming as part of searches.
So is Federico mad Perplexity scraped his site to learn (and has that been confirmed?)? Or is he mad Perplexity hits his site as part of user-generated searches/requests?
I have more of a problem with the first than the second. I’m sure I’ll get lit up here and I’m also sure I don’t fully understand/appreciate the whole problem (willing to learn!), however as a Perplexity user, I absolutely want it to hit his site as part of a search.
When I search for something in Perplexity, I am hoping it acts as a WAY SMARTER search engine like Google. I want it to cut through the fake websites and the SEO junk that’s out there now tailored to Google searches, and I want it to return actual information. While I know it’s not true for everyone, but I often click each source footnote which takes me to that material just to sanity check the source. If I’m being honest, I wish Perplexity would make it more obvious the sources it uses. Yes, I expand the quick drop down at the top to see the sources, so a site like MacStories would still get credit in my head, but it would be nice if that was a little more obvious…with site and author information provided.
So all this to say, I don’t want my search engine neutered because creators don’t like AI. I do get the scraping thing beforehand, and training models on peoples’ information without their consent…that seems icky. But also, it’s the open web and it’s kind of done now.
I hope Apple getting into the game raises the conversation (because people love to complain about Apple) in this space. I hope it draws attention to people being able to get credit for the work they do. I hope it makes search-type AI like Perplexity better, and makes generative AI more considerate.
He was also angry at Apple’s admission that it had scraped the web to train its Apple Intelligence models. I’ll be honest, I don’t get that. I’ve got thousands of words, a blog dating back to 2003. I don’t care if Apple used it to train its model.
As I understand it the point here is that Apple and others used the content of the web to train the model to understand language and expression in a very generic sort of way. I don’t see the problem in that specifically.
Note: This is NOT the same as chatbots like ChatGPT referencing original content and re-writing to present as a search result. Very different thing. And, as you’ve pointed out, using Perplexity as a replacement for Google to cut through the garbage so that you can then click on the referenced articles would be the same thing as using Google and clicking through.
Yeah that’s a good distinction and the way I understand it as well. I do think there have been times the models kick out exact passages from people’s work if prompted in just the perfect way. And that also doesn’t feel good. If you are pulling directly from someone’s work it should be cited.
The people that have a problem with AI search even with citations baffle me because that’s how academia works. I cite a website for an academic paper…no one sees the website unless someone really wants to follow the citation. To me it’s no different as long as the AI cotes the sources like Perplexity does.
Let’s call it what it is: people want ad impressions.
Edited: grammar/clarity
Apple allows websites to opt out of their training models.
But that opt-out is only after the act?
Probably, since Apple only announced on Monday that they will be releasing AI features this year. But since they aren’t even available in the Betas yet, I’d say they’re giving website owners fair warning now. Opt out now and your data won’t be available in the Fall when they roll out these features.
Again, that’s only on data they’ve not already ingested.
Craig Federici said recently that the data they’d used to train their models so far included public data from the web. Not cool Apple, not cool.
This is more about the limitations of robots.txt than Apple. Robots.txt was to encourage/discourage inclusion in search indexes. Now it’s for both search indexes and LLM training, often with the same user agent. Apple has been a good robots.txt Internet citizen; anyone who blocked AppleBot or blocked GoogleBot (Apple respects this as well) isn’t in Apple Intelligence. But we need a way to list one user agent (AppleBot) and give direction based on the purpose of the crawl (Siri search index vs. training.)
Edit: this is only a comment about Applebot, not Perplexity.
It’s my understanding that what Apple will be doing with Apple Intelligence would never, ever involve the sharing of any website’s content. It’s not doing a ChatGPT type world knowledge chatbot that would result in sharing source text of any kind.
What Apple will be offering in terms of text is writing improvement suggestions. Apple’s models, it’s brain so to speak, will help you with your style, grammar etc. It’s more like Grammarly, nothing like ChatGPT.
Maybe I’m confused, I thought Apple was only using the scraping for LLM training? How is it using search indexes?
They’ve been crawling the web for years as AppleBot for Siri and Spotlight search indexes.
It’s very different indexing a website for a search, than taking someone else’s content and using it for your own purposes though.
Totally agreed with you.
And there’s more. Perplexity Is a Bullshit Machine | WIRED
Wired have verified the findings at MacStories
Perplexity.ai can helpfully offer a summary of the WIRED article:
“The article raises questions about Perplexity’s transparency, respect for intellectual property rights, and the accuracy and ethics of its AI-powered search engine”
This could be about to get very interesting.
I can’t see how anyone would still be paying for this garbage.
Those are both fascinating and humorous examples - great demonstration.
Here’s a forward-compatible solution from Cloudflare. It doesn’t prevent stealth mode crawlers but it does mean webmasters don’t have to keep up with every startup building a model.