Large Language models - A Street Full of Wrong-Way Drivers?

Posted by Tim Zöller on May 05, 2024 · 20 mins read

If you follow me on social media, you know I am deeply sceptical about the benefits of AI in most usecases. I am critical of the environmental impact, I am worried about the centralization on a few SaaS providers and I am sad about the topic dominating every product roadmap, using up ressources which could have been used to progress the products exiting features further. On the other hand I talked about this topic with a lot of smart people this year – and many praised large language models and told me how they made their daily work easier. I’m writing this post, to reflect about my opinion and how arrogant it would be to dismiss peoples first hand experience without second guessing myself. If everybody seems to be a wrong-way driver, it is probably a good idea to check if you are yourself not the one going the wrong way.

Why I won’t talk about ethics and ecological impact

While these two aspects of large language models are my biggest issues, this won’t be a part of the upcoming paragraphs. I took (probably useless) measures that my blog posts and code are not ingested by large language models because I don’t want to contribute free labor for multi-billion-dollar software companies. I also think that the current hype about training models will negatively impact the efforts in fighting climate change. In this post however, I would like to ask the question “Are large language models useful”, not “Are they worth it?”, to simplify things. I also will not view this topic from a perspective of “All models will get better in the future”. Sure they will, but we don’t know when and how much. Addresing all shortcomings with “The future will fix it” helps nobody and is a bad way of discussing things.

The conversation triggering this post

Some weeks ago I participated in the “Digital Crafts Day” as a speaker. The crowd was a mix of the usual developer nerds, students and people working in different industries. As the event was happening at a university, the average participant was rather young, and AI was a very present topic. During a coffee break, I talked to two people working for the public sector, and they told me their AI success story: While modernizing their software stack for some years now, they still have a lot of COBOL code on IBM host sytems. For some months now, they use AI tools to translate this COBOL code into Java, and they were really happy with the results so far. My internal dialogue while listening to their experience was:

  • Oh god, no
  • This cannot work
  • The code quality must be horrible
  • How can you even translate COBOL code into OOP concepts with Java?
  • What are they gaining from this?
  • They will suffer from this in the future

At some points, my own thoughts felt a little embarrassing to me. There I was, talking to experienced professionals, software developers with at least as much experience as myself, and I rejected their experience based on my own preconceptions - without having shared their experience, or having tried similar things. Of course I did not say any of these things out loud. At two other conferences afterwards I shared their experience with a few other people whos skills I value very much, and asked for their opinion on this. They almost exactly monitored my own thoughts listed above. This pretty much sums up the two opinions I see only quite often: The people advocating for LLMs in writing code or even transforming code, and the people rejecting the very idea. I realized that the “contra” point of view was often backed by emotions and fear that other people would be using those tools poorly - and it was a little painful to admit that this was the case for myself, too.

How I currently use large language models

Right now I have two usecases which benefit from large language models: Translating texts and text snippets in a professional setting and writing and explaining bash scripts.

My mother tongue is German, and while I consider my English to be quite good, I am sure that many expressions and metaphors in my english texts work better in German, and no native speaker would write them this way. When participating in CfPs for international conferences, I use ChatGPT to correct my pre-written english abstracts. My prompt is usually “Please correct the following text, written by a German native speaker in english, correct spelling errors and grammar and make it more idiomatic”. If my abstracts are longer than the maximum character count allowed, I ask ChatGPT to shorten them. In my opinion, ChatGPT can work as an equalizer, giving non-native speakers the possibility to write idiomatic and error-free texts, maybe increasing their chance of getting talks, applications or proposals accepted. This text was not improved by any LLM, I don’t usually do this with my blog posts.

My other usecase, writing and explaining bash scripts, helps me because I really, really suck at bash. I am able to read and understand most scripts, but I write them so rarely, that I forget most of the syntax. While writing a short script to automate a recurring task on my laptop, I tried asking ChatGPT to create the script for me. I could understand the generated script, validate that it would do the task while doing no harm to my laptop, and I have been doing this ever since. It helps me finishing these chores much faster, as I don’t have any intention to learn bash on a higher level.

Letting AI write my code

I experimented with letting AI write my code – both ChatGPT and IntelliJ’s built in assistant. To understand my main focus, it is important to know that I work as an IT-Consultant for software development and -architecture. I usually work on Enterprise codebases which pull data from several sources, display it to users, lets users interact with it and write changes or work on processes with it. I rarely write complicated algorithms, my main task is making decisions about data:

  • Which system provides my data?
  • Which interfaces does this system provide?
  • How does the contract to this interface look like?
  • How can I transform and merge the data?
  • Can I find common abstractions for specific datasets which are robust?
  • Where are the transaction boundaries (if there are any)?
  • How can I encapsulate the business logic transforming the data?

Most of the time I spend thinking at work, I don’t think about the code I am writing, I am thinking about the concepts behind it. When writing code, I usually don’t type that much. REST-Controllers, Kafka Integrations and others don’t need much boilerplate code, as modern frameworks like Spring Boot already provide excellent abstractions over those technologies. Every time I tried generating code like this, my observation was that I spent as much time checking the code for edge cases and correctness as I would have spent writing the code in the first place. To be fair, the generated code was almost always correct, but only almost.

The same is true when I tried generating Unit Tests. The tests were usually really good, but when I am writing Unit Tests, testing my code for correctness is only one of my goals. The other one is to use the tests as a tool to think about my code for a second time, from a different angle. It does not only help me think about the edge cases (which the AI-generated code does, too), but quite often, while thinking about what to test, I realize that my initial assumptions while writing the functionality were flawed.

Right now I have 15 years of experience writing code, mostly in Java with JavaEE, Spring and Spring Boot. By now I know the language and the frameworks really well, and I hardly have to look up documentation or to use Google to help me find the correct features and classes. Considering that my code mostly orchestrates data and makes it available to other systems, I have concluded that large language models don’t help me becoming more productive at my main job. When talking to other professional software developers who work in a simliar setting, most of them agreed with this sentiment.

Usecase 1: Assisting Junior Developers

At one point in the past, a Junior Developer created a pull request in our project dealing with monetary values. They used a floating point datatype to store the value, while doing calculations on it. I commented on the PR, why a BigDecimal datatype would be more fitting, and how to deal with rounding and which types of rounding they could use in which case and what was important about precision. Our junior developer learned something about Java and datatypes. Some time later I was curious and asked ChatGPT to generate the code for me, only stating the requirements. The generated code was way better than the first draft of our junior, but it set the precision of the datatype incorrectly, leading to rounding errors for some rare cases. A developer who did not know about the peculiarities of BigDecimal would probably have missed this detail, in the best case the rounding errors would have been caught in unit tests. This was not a general flaw of ChatGPT, but the decision when to apply the precision to an intermediate result very much depends on the type of calculation you do. One could argue that the requirement was not precise enough, but I believe it was sufficient for most developers.

Another aspect of this anecdote is the “learning” part. After making the mistake, our junior developer had a story in their head which they could connect to the features of BigDecimal. That does not necessarily mean they won’t make a similar mistake again, but I’d argue that they learned about this error category to some extent. To compare myself with more junior developers, I tried writing a Python web application with the help of ChatGPT. I do not know Python well and only have used it sporadically.

After two or three evenings I had a Python Web app with Flask running and it did what I wanted from it. My experience was mixed, as the recommended Python functions could always be overloaded, and my applications context was not always that well known to ChatGPT. For many cases I ended up looking up the documentation for the libraries or APIs, and finding a better fit. Nevertheless, I would not have been able to write this application this fast without ChatGPT. My main issue is, that I would not be able to replicate this without any support of an LLM. I could look up my existing code and copy a lot from it, but I have not learned as much as I did learn from similar hobby-projects in languages I didn’t know too well (I am also sure, that experienced Python developers would shake their heads while reading my code).

To summarize this usecase: While I do believe, that junior developers and even mid-level developers could finish programming tasks faster with the support of an LLM, I am not too sure about the educational aspects. If we still wanted those junior developers to progress to senior developers with similar experience and a similar skillset as we have today, I’d argue that we need a new approach to education to counterbalance the effects of LLMs. This assumption is supported by an anecdote a speaker at JAX conference told me this year: They ordered a junior developer to extend an existing program, and they did the job well. In an in-person code review they asked about the code and his decision process and pointed out, two or three things that were not quite right, yet. The junior developer was not able to adapt the code, or even explain everything it did. They confessed using AI support to write it. While the result was mostly very good, there was no learning experience, and maintaining the code would have become harder and harder for the junior developer.

Usecase 2: Migrating large Codebases to a new technology

Earlier I mentioned that the biggest part of my job is thinking about data and the concepts behind it. But what if the data model, the business logic and even the abstractions behind it are already established? This matches the conversation about migrating a COBOL codebase, which I mentioned in the introduction. They migrated code which was already in place for decades to a new stack, so they can get rid of the expensive IBM host systems in the near future. Unfortunately, I don’t know about the tools they used for their specific project, but I can imagine this working quite well - depending on the goal of the migration.

I don’t think that their goal was to embed the COBOL funtionality into an existing Java code base, make it use object oriented patterns and leverage the functionality of a framework like Spring Boot. Existing abstractions in the data would have to be redesigned and rearranged to match the different paradigms of an object oriented language. If I planned such a task, I would make sure that the migrated COBOL code was encapsulated in its own application, with existing code accessing it via well defined interfaces. The first impression might be, that we now have moved the code from one silo which is not easy to maintain into another silo which is not easy to maintain, and this might be true – but this new silo can now be operated at a fraction of the cost of the old one. There is no need to purchase an expensive new IBM host system in the future, and no need to keep paying IBM for licenses and transactions.

This would still be the second best approach. The best approach would be to migrate the old code into a new, idiomatic Java codebase which makes use of all provided features of the language and the framework. Allocating budget for this task, however, is not an easy thing to do in a big corporation. The developers are expected to provide new features to users, and not spend a year working on something which then works exactly as it did before. This way of thinking is short sighted, but unfortunately present in too many companies (at least in Germany). Reducing the migration- and testing-effort with AI tooling could be a compromise.

Usecase 3: Assisting in languages which are used rarely.

I mentioned earlier in this text, that I already use LLMs to write bash scripts. When talking to other developers, even AI-sceptical ones, this usecase is brought up a lot: People use ChatGPT to write SQL queries, SPARQL queries or regular expressions – languages which they do not use that often, which makes them look up documentation every time again. This fits a sentiment on LLM-generated texts I heard quite often: They can be used in cases where errors don’t matter that much, or where the result can be checked. For me, this seems to be the most interesting usecase so far. It saves a lot of time, the effect on learning language features is neglectable, as these features and languages are only rarely used by the user of the LLM, and the intervals are often too long for a learning effect to set in. Unfortunately, this niche is not the one which makes for great marketing. GitHub promises huge productivity gains with CoPilot. The marketing claim “Really helps you write this annoying code in that language you don’t know that well, but only need every two months” does not have such a nice ring to it.

And what now?

The anecdote about the COBOL migration made me realize that I was judging people using LLMs for more complex tasks from a point of emotion. As written above, after thinking about it for some time I could see this working for some use cases, depending on the expectations. While I still see issues with using LLMs for our everyday programming tasks, it might be beneficial to listen to the real people who were successful in applying it. Dismissing these stories and the experience of these people would be arrogant. They are not the people selling AI products to you and have no benefit of exaggerating their benefits.