LLMs: our future overlords are hungry and thirsty

Public workshops:

Enabling DevOps and Team Topologies Through Architecture: Architecting for Fast Flow JNation, May - Coimbra, Portugal Learn more
Designing microservices: responsibilities, APIs and collaborations DDD EU, June 2-3, Antwerp, Belgium Learn more

Contact me for information about consulting and training at your company.

Since early this year, the news around generative AI technologies, such as ChatGPT, has been never ending. Some have even suggested that humanities existence is at stake. While there’s a massive amount of hype, there’s also a lot of potential as shown by tools such as ChatGPT and Copilot. Consequently, I decided to explore Generative AI from the perspective of an enterprise software architect. This article is the first in a series about generative AI - specifically Large Language Models (LLMs) - and the microservice architecture.

A large language model is a function….

Large Language Models (LLMs) are a generative AI technology for natural language processing. Simply put, an LLM is a function that takes a sequence of words as input - the prompt - and returns a sequence of words that’s the most likely completion of the prompt.

$ python3
Python 3.11.4 ..
>>> from langchain.llms import Ollama
>>> llm = Ollama(model="llama2")
llm("who is Oppenheimer")
' J. Robert Oppenheimer was an American theoretical physicist and professor who made significant contributions to...

Not particularly threatening, right?

…that is implemented by a neural network…

An LLM is implemented by a neural network. The details are quite complicated. But the basic idea is that the neural network is trained on a large amount of text to predict the next word (or more accurately a token, which is an encoded word fragment) given an input sequence of words. The entire completion is constructed by iteratively predicting the next token given the input sequence and the previous predicted tokens.

… with numerous NLP use cases

LLMs have numerous use cases, including text generation, summarization, rewriting, classification, entity extraction, semantic search and classification. To learn more about LLM use cases see Large Language Models and Where to Use Them: Part 1 and Hugging Face Transformers.

LLMs are hungry

LLMs are rather large, with billions of parameters, which are the weights of the neural net’s neurons and connections. For example, the GPT-3 LLM has 175 billion parameters. And, Facebook LLamas is a collection of language models ranging from 7B to 65B parameters. Consequently, LLMs involve lots of math and are hungry for computational resources, specifically expensive GPUs.

The resource requirements depend on the phase of LLM’s lifecycle. The three phases of a LLM’s life cycle are:

training - creating the LLM for scratch
fine tuning - tailoring the LLM to specific tasks
inferencing - performing tasks with the LLM

Training and inferencing both require a lot of compute resources. Let’s look at each of these in turn.

Training

Training an LLM from scratch is an extremely computationally intensive task. For example, training the GPT-3 LLM required 355 years of GPU time. The training cost was estimated at $4.6 million. Training is so costly because it requires lots of GPUs that have large amounts of memory, which are expensive. For example, the AWS EC2 p5.48xlarge instances, which have 8 GPUs each with 80G of memory and costs $98.32 per hour. Consequently, most organizations will use a 3rd party, pretrained LLM.

Fine-tuning

Fine tuning an LLM, which tailors an LLM to a specific task by adjusting the parameters, is much less computationally intensive than tuning. As a result it’s fairly inexpensive but it still requires GPUs.

Inferencing

Inferencing with an LLM, which is using the LLM to perform tasks, is less computationally intensive than training, but typically still requires GPUs. Moreover, each GPU must have sufficient memory to store all of the LLM’s billions of parameters. By default, each parameter is 16-bits, so 2 bytes per parameter is required. Sometimes, however, quantization can be applied to use fewer bits per parameter although there’s a trade-off between accuracy and memory usage. Machines with GPUs that can run LLMs are more expensive than machines without GPUs. For example, an AWS g5.48xlarge instance which has 8 GPUs each with 24G of memory costs $16/hour. A comparable non-GPU instance, such as a m7i.48xlarge costs $9.6768/hour.

LLMs are also thirsty

Since LLMs are computationally expensive they are also thirsty for water. Lots of computation requires a lot of electricity, which requires a lot of water to cool the data centers. For example, studies estimate that a ChatGPT conversation consumes as much as 500ml of water. So much for the environment!

LLMs and the microservice architecture

Let’s imagine that you want to add an LLM to your enterprise Java application. It’s quite likely that you will want to deploy the LLM inferencing code as a separate service for the following two reasons: efficient utilization of expensive GPU resources and the need to use a different technology stack. Let’s look at each reason in turn.

Efficient utilization of expensive GPU resources

There are two ways to run an LLM: self-hosted or SaaS. If you are self-hosting an LLM, then you are running software that has very distinctive resource requirements. LLMs must run on more expensive machines that have GPUs. Therefore, in order to utilize those resources efficiently you will need to resolve the dark energy force segregate by characteristics, and deploy your LLM as a separate service running on specialized infrastructure. You typically wouldn’t want to run non-GPU services on the same machines as your LLM services since that could result in over-provisioning of GPUs.

Separate technology stack

The second reason to run your LLM related code as a separate service is that it will likely use Python instead of Java. While you might run the LLM using a Java technology stack, it appears that Python has a much richer ecosystem. Furthermore, even if you are using a SaaS-based LLM, you will often write Python code to interact with the LLM. For example, the prompt tuning/engineering code that tailors an ‘off the shelf’ LLM to a specific task is often written in Python. Consequently, in order to resolve the the dark energy force multiple technology stacks you will need to deploy your Python-based LLM logic as a separate service.

What’s next?

In future articles, I’ll explore the topic of LLMs and microservices in more details.

Need help with accelerating software delivery?

I’m available to help your organization improve agility and competitiveness through better software architecture: training workshops, architecture reviews, etc.

Learn more about how I can help

generative AI microservice architecture

Follow @crichardson

Microservice Architecture

LLMs: our future overlords are hungry and thirsty

A large language model is a function….

…that is implemented by a neural network…

… with numerous NLP use cases

LLMs are hungry

Training

Fine-tuning

Inferencing

LLMs are also thirsty

LLMs and the microservice architecture

Efficient utilization of expensive GPU resources

Separate technology stack

What’s next?

Need help with accelerating software delivery?

About Microservices.io

ASK CHRIS

NEED HELP?

PREMIUM CONTENT

MICROSERVICES WORKSHOPS

Remote consulting session

ASSESS your architecture

LEARN about microservices

Get the book: Microservices Patterns

Example microservices applications

Virtual bootcamp: Distributed data patterns in a microservice architecture

Learn how to create a service template and microservice chassis

BUILD microservices

Consulting services

The Eventuate platform