Writing in a time of robots

Information is power. Who gets to distribute information and how has shaped the course of history. After Johannes Gutenberg invented the printing press in 1439, he was sued by one of his investors over unpaid loans. The lawsuit broke Gutenberg’s monopoly on the machine, allowing the printing press to spread across Europe and gave rise to improved literacy rates and the scientific revolution

Today, we are in the midst of another information revolution — and another lawsuit. In 2023, The New York Times sued OpenAI and Microsoft, alleging that the two companies infringed on the Times’ copyrighted material when training their AI models. If the Times wins, then the large language model (LLM) creators must compensate publishers before using their content. Beyond its financial implications, this case is an important reminder that LLMs, arguably the most popular AI tool, don’t operate in a vacuum. Rather, they rely on millions of lines of text written by humans to provide answers. 

Most people want AI to write their essays, but few understand how chatbots read your work. Writing is all about exchanging ideas in a manner that is understandable to an audience. But what happens when that audience is no longer human? Does this fundamentally change the way that we write? Or are LLMs merely an intermediary, a translator between content creators and their audience, plagued with a short attention span?

To answer these questions, we first need to understand how an LLM accesses and reads text. For LLM to answer prompts accurately, it first must be trained, which exposes the model to massive amounts of text data. By analyzing millions of words, the model can understand patterns in the text and predict syntax, allowing it to mimic human prose. However, the text needed to train the AI models must come from somewhere. AI developers have disclosed that they use the Common Crawl, a massive dataset that archives all the publicly available content on the internet. If your work is publicly available somewhere on the internet, then it is likely somewhere in LLM training data.

As the Times lawsuit suggests, many people are upset that AI developers are using their writing and want to put a stop to it. But, for those of us who don’t have the money or technical expertise to sue OpenAI, we must live with the reality that AI is a member of our audience. 

If AI is reading our work, then how should we respond?

Unlike human readers, which often jump around or skim an article, LLMs read text in a sequential manner, processing one word at a time. However, this doesn’t mean that the article understands your work, at least in a traditional sense. Though LLMs are effective at summarizing text and extracting key points, they struggle to decipher abstract concepts and apply contextual information when reading a piece. LLMs see your work as a means to help it better generate sentences, not to grapple with the challenges your work is addressing. 

So, with this information in mind, is there a certain writing style that AI prefers?

To answer this question, I asked Gemini, Google’s LLM-based chatbot. In response, it gave me examples of a good sentence for AI to read and a bad sentence. The good sentence: "The recent study found that regular exercise significantly improves cognitive function in older adults." The bad sentence: "While the research indicated a potential link between exercise and cognitive health, further studies are needed to fully understand the complex relationship.”

These sentences, which are most likely summaries of the results of a scientific paper, expose the flaws of LLMs. It is clear that the sentence which is "bad" for AI is much more effective at conveying the true results of the study and the uncertainty surrounding them. Demonstrating causality is a tough hill to climb; Gemini appears to ignore the scientific rigor needed to reach it. AI admits that it hates complexity. That’s troubling, given that we live in a highly complex world.

You may be thinking, why should I care if AI reads my writing? After all, my work is only a tiny fraction of the millions of items the model is exposed to. 

In fact, LLMs are utterly reliant on human-generated content to properly function. The increasing prevalence of AI-generated content on the web has created an existential threat for LLMs. Researchers found that when LLMs are trained on text produced by other AI models, they fail. Instead of producing the coherent, punctual responses that we expect from chatbots, the LLM trained on AI-generated prose responded with gibberish filled with repeats and stray characters. This phenomenon, known as model collapse, underscores the importance of keeping human-generated content on the web. Without humans, LLMs are unable to observe trends in the real world and interpret them with reasonable accuracy. 

LLMs will not eliminate the need to write. In fact, the rise of AI will make writing even more important. As more people stop writing and opt to have LLMs do their work for them, those that continue to produce new content will exert an even larger influence over future training data — and the models reliant on them. Though many argue that LLMs are an equalizer in information access, I disagree. Given that LLMs essentially regurgitate information they find on the web, the decline of human authorship will result in more robots paraphrasing fewer humans’ writing. And this select group that continues to write will influence more people through the intermediary of AI. 

Another takeaway is the importance of transparency. It’s difficult to tell whether an article was human- or AI-created, given the inaccuracy of AI content detectors. This lack of traceability is deleterious for everyone involved, but especially for LLM developers themselves. To avoid model collapse, engineers must refrain from recycling AI-generated content into their own model. But this task becomes nearly impossible when one can’t tell the origin of their data. 

In sum, we shouldn’t change our writing style to accommodate for AI. Even if LLMs are frequently reading our work, they are nourished by the creativity and nuance infused into our prose, even if they claim to prefer bland and straightforward text. The threat posed by model collapse exemplifies how LLMs are fundamentally dependent on real people, despite the hype around their superior capabilities. Original ideas still matter, if even amidst a tsunami of artificially generated noise. 

Aaron Siegle is a Trinity junior. His pieces typically run on alternate Fridays.

Discussion

Share and discuss “Writing in a time of robots” on social media.