As is probably obvious from my previous involvement in AI threads, I have big concerns about the ethical nature of Gen AI that go beyond its practical effectiveness, and I wondered if anyone had any experience in choosing a more ethical version.
Leaving aside the damage to the environment and the social cost to human workers for a moment, I want to concentrate on the issue of training the models on copyright data without recompense to the creators.
Which AI model is the most responsible in this regard? Do any of them guarantee that they have not used copyrighted data uncompensated? If not, which is the least bad?
Obviously, Meta, OpenAI, and Musk canât be trusted, but who are the relative âgood guysâ in this? Are there any?
The exact data used to train models isnât public, and may not even be recorded by the companies that train them, so the answer is essentially unknowable.
At most you can use your general trust of the companies in question as a proxy.
Web server admin here. Based on some recent analysis of logs, I can tell you that none of the major players (Claude, OpenAI, Perplexity, Meta, etc.) are asking for permission. Copyright is the default, but the companies are treating data collection as something you have to opt out of.
And their crawling behavior is positively abusive.
I already knew from their public statements (or from emerging internal documents) that OpenAI and Meta cannot be trusted on this (and Musk on anything), but I was hoping that one of the players â perhaps one of the Europeans â had at least gestured towards being responsible. It seems not.
Probably Allen or another using the data they curate. Itâs not as good as 4o yet, so you donât hear much about it. Youâd run their model locally. Most claiming to not scrape or pirate are distilling (stealing answers from) OpenAI.
At this point I worry less about âethicalâ they are and more about how to put guardrails around it between them and me. You can dive deep into any company and find anything that would be considered unethical.
has a meaningless safety plan that they just ignore
Sam Altman lied to his board - about so many things, he really was fired with cause. He only came back because the board didnât communicate the reasons for firing well.
Sam Altman continues to pour gasoline on various fires, so he command a higher valuation
âŚ
There arenât any. Implementations of current commercial LLMs/AIs are deeply flawed, and the problems are going to increase.
We used very much more specialized LLMs with very tailored, and carefully parsed corpora in academe for very specific kinds of textual analysis. Even so, we did not use their results without a great deal of testing and validation.
I want nothing to do with any of the current commercial LLMs. They are emblematic of GIGO.
The best use I have found for them is to make sure they remain insular. I am an English Teacher with frighteningly little in way of resources for assignments. I find that when I inputted the complete text and asked it to make assessment questions based on the standard we are covering it did a remarkably good job.
I understand. I too was an English teacher, albeit at the college level which isI think, much easier. I spent a few years supportting faculty. One of the things we did was create a private shared online resource of paper prompts, text questions, etc.
Iâm not sure this would work for k-12, where teachers have much less freedom regarding pedagogical content.
From an interest in the technical side of this - what are their bots doing beyond what a search engine e.g. Bing/Google might do? Or are search engines also more abusive than they once were?
Thank you to everybody who responded â it has been very illuminating and interesting, though a little depressing.
I am lucky in that I am not forced to adopt Gen AI by work or fear of losing a competitive advantage â it must be a difficult moral dilemma for those who are, particularly writers and artists. It must feel like theyâve weaving the rope for the executioner to hang them with.
Search engines are more aggressive than they used to be. I have found that Bing tends to be more aggressive than Google, and tends to hit with a much larger percentage of completely worthless crawls. For example, GET parameters. I have one site that operates an event calendar, and Bing will query thousands of combinations of parameters that functionally return the same data.
The worst ones are the Chinese search engines though. PetalBot, for example, does not respect robots.txt, hammers the server with dozens of simultaneous page requests, and if one of its IP gets blocked it just grabs a new iP from its pool and keeps hammering.
Most reputable AI tools are somewhere in the category of the more aggressive United States search engines. But whereas Bing and Google provide value for the website owner, the AI bots do not.
The less reputable AI tools are more in the category of PetalBot. I donât know that this is an intentionally nefarious design, but it is definitely a design that does not consider the impact it is having on website owners.
And since everybody is training their own LLM these days, the collective impact of AI crawlers - poorly-conceived or otherwise - is massive.
The LLM crawlers were consuming banwidth on the writersâ foum I ran to the point that bandwith consumption tripled and interfered with users. Robots.txt was ignored. Some deliberated changed the text string that identified them.
The forum was decades old with millions of posts. A single LLM would have hundreds, sometimes thousands of crawlers at the same time.
I had to take some extraordinarily aggressively protective measures that were time consuming and expensive for a free forum.
This article from TechDirt [AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk](https://AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk) describes the same expensively abusive AI/LLM bot behavior Iâve seen.