Which AI model has the fewest ethical problems?

brookter · April 7, 2025, 11:10am

As is probably obvious from my previous involvement in AI threads, I have big concerns about the ethical nature of Gen AI that go beyond its practical effectiveness, and I wondered if anyone had any experience in choosing a more ethical version.

Leaving aside the damage to the environment and the social cost to human workers for a moment, I want to concentrate on the issue of training the models on copyright data without recompense to the creators.

Which AI model is the most responsible in this regard? Do any of them guarantee that they have not used copyrighted data uncompensated? If not, which is the least bad?

Obviously, Meta, OpenAI, and Musk can’t be trusted, but who are the relative ‘good guys’ in this? Are there any?

Thanks!

nfdksdfkh · April 7, 2025, 11:21am

The exact data used to train models isn’t public, and may not even be recorded by the companies that train them, so the answer is essentially unknowable.

At most you can use your general trust of the companies in question as a proxy.

webwalrus · April 7, 2025, 11:35am

Web server admin here. Based on some recent analysis of logs, I can tell you that none of the major players (Claude, OpenAI, Perplexity, Meta, etc.) are asking for permission. Copyright is the default, but the companies are treating data collection as something you have to opt out of.

And their crawling behavior is positively abusive.

brookter · April 7, 2025, 11:42am

@nfdksdfkh and @webwalrus – thank you for the quick answers!

I already knew from their public statements (or from emerging internal documents) that OpenAI and Meta cannot be trusted on this (and Musk on anything), but I was hoping that one of the players – perhaps one of the Europeans – had at least gestured towards being responsible. It seems not.

Thanks again.

cornchip · April 7, 2025, 12:20pm

Probably Allen or another using the data they curate. It’s not as good as 4o yet, so you don’t hear much about it. You’d run their model locally. Most claiming to not scrape or pirate are distilling (stealing answers from) OpenAI.

mlevison · April 7, 2025, 12:37pm

I use Claude since they do seem to have a real ethics policy that they pay attention to.

I sometimes use NotebookLLM because they’re the only provider who can handle a massive number of documents.

I’m always looking for better options. For instance I will go try, AlienAI now that @cornchip has brought it up.

When I can, I run everything locally, thanks to @cornchip - I bought an M3Max with 64GB so I can run some models on my machine.

WayneG · April 7, 2025, 3:09pm

AFAIK, everyone involved in building AI models has admitted to using “information that is publicly available on the internet.” , including Apple.

MurphysLaw · April 7, 2025, 7:32pm

At this point I worry less about “ethical” they are and more about how to put guardrails around it between them and me. You can dive deep into any company and find anything that would be considered unethical.

mlevison · April 7, 2025, 8:36pm

Reframe - OpenAI:

has a meaningless safety plan that they just ignore
Sam Altman lied to his board - about so many things, he really was fired with cause. He only came back because the board didn’t communicate the reasons for firing well.
Sam Altman continues to pour gasoline on various fires, so he command a higher valuation
…

At least Anthropic has a safety plan.

Medievalist · April 7, 2025, 9:42pm

There aren’t any. Implementations of current commercial LLMs/AIs are deeply flawed, and the problems are going to increase.

We used very much more specialized LLMs with very tailored, and carefully parsed corpora in academe for very specific kinds of textual analysis. Even so, we did not use their results without a great deal of testing and validation.

I want nothing to do with any of the current commercial LLMs. They are emblematic of GIGO.

MurphysLaw · April 7, 2025, 9:53pm

The best use I have found for them is to make sure they remain insular. I am an English Teacher with frighteningly little in way of resources for assignments. I find that when I inputted the complete text and asked it to make assessment questions based on the standard we are covering it did a remarkably good job.

Medievalist · April 7, 2025, 11:00pm

I understand. I too was an English teacher, albeit at the college level which isI think, much easier. I spent a few years supportting faculty. One of the things we did was create a private shared online resource of paper prompts, text questions, etc.

I’m not sure this would work for k-12, where teachers have much less freedom regarding pedagogical content.

MurphysLaw · April 7, 2025, 11:26pm

Our school is very much pushing that our entire department be on the same page for equity’s sake. So we are building a resource bank as we go.

Vincent_Ardern · April 8, 2025, 2:19am

From an interest in the technical side of this - what are their bots doing beyond what a search engine e.g. Bing/Google might do? Or are search engines also more abusive than they once were?

brookter · April 8, 2025, 7:55am

Thank you to everybody who responded – it has been very illuminating and interesting, though a little depressing.

I am lucky in that I am not forced to adopt Gen AI by work or fear of losing a competitive advantage – it must be a difficult moral dilemma for those who are, particularly writers and artists. It must feel like they’ve weaving the rope for the executioner to hang them with.

Thanks again for your comments.

webwalrus · April 8, 2025, 9:15am

Search engines are more aggressive than they used to be. I have found that Bing tends to be more aggressive than Google, and tends to hit with a much larger percentage of completely worthless crawls. For example, GET parameters. I have one site that operates an event calendar, and Bing will query thousands of combinations of parameters that functionally return the same data.

The worst ones are the Chinese search engines though. PetalBot, for example, does not respect robots.txt, hammers the server with dozens of simultaneous page requests, and if one of its IP gets blocked it just grabs a new iP from its pool and keeps hammering.

Most reputable AI tools are somewhere in the category of the more aggressive United States search engines. But whereas Bing and Google provide value for the website owner, the AI bots do not.

The less reputable AI tools are more in the category of PetalBot. I don’t know that this is an intentionally nefarious design, but it is definitely a design that does not consider the impact it is having on website owners.

And since everybody is training their own LLM these days, the collective impact of AI crawlers - poorly-conceived or otherwise - is massive.

airwhale · April 8, 2025, 11:40am

None of the AI bots respect “robots.txt” instructions, AFAIK.

Also:

Medievalist · April 8, 2025, 4:08pm

The LLM crawlers were consuming banwidth on the writers’ foum I ran to the point that bandwith consumption tripled and interfered with users. Robots.txt was ignored. Some deliberated changed the text string that identified them.

The forum was decades old with millions of posts. A single LLM would have hundreds, sometimes thousands of crawlers at the same time.

I had to take some extraordinarily aggressively protective measures that were time consuming and expensive for a free forum.

mlevison · April 8, 2025, 5:16pm

In agreement with @Medievalist here.

Our staging/temp domain: https://effectivescrum.org - has seen 7+GB of traffic in the past few weeks. The compiled version of the site is only 300MB.

It’s too bad LLM vendors don’t use: Common Crawl - Overview - then I would only pay the bandwidth tax once a month.

Medievalist · April 11, 2025, 1:29am

This article from TechDirt [AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk](https://AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk) describes the same expensively abusive AI/LLM bot behavior I’ve seen.