Top 5 Application Challenges of Large Language Models (LLMs)

Generative AI has burst onto the scene with compelling capabilities, capturing the imagination of everyone from tech innovators to business leaders looking to leverage AI for problem-solving.  As these powerful tools redefine the boundaries of Machine Learning and Natural Language Processing, the rush to adopt and integrate Gen AI into various applications is evident.

However, the enthusiasm also brings to light a range of challenges, from ethical considerations to technical hurdles, that organizations must address to fully harness the potential of models. This article explores these challenges in-depth, providing a clear roadmap for navigating the complexities of Generative AI projects, using LLM (Large Language Model) as an example.

Additional layer increasing app complexity

Integrating a Large Language Model (LLM) like OpenAI ChatGPT, Anthropic Claude, Google Gemini, Microsoft Copilot etc. into an application does more than just add functionality; it introduces a significant layer of complexity.

For instance, LLMs demand vast amounts of computational resources, which can lead to scalability issues as user demands increase. Additionally, ensuring that the model processes data efficiently while maintaining accuracy and speed can challenge even the most experienced developers. 

This complexity often extends to the maintenance phase, where observability, continuous updates and optimizations are necessary to keep the model relevant and effective. Thus, incorporating an LLM into your application not only enhances its capabilities but also demands a robust framework to handle these complex challenges effectively.

Challenge 1 – Non-deterministic responses

One of the more perplexing challenges when working with Large Language Models (LLMs) is their non-deterministic nature of responses. Unlike traditional deterministic systems where the output is predictable and consistent, LLMs can generate different responses to the same input under similar conditions. 

This variability can pose significant challenges, particularly in applications requiring high levels of precision and consistency, such as in legal or medical advisory systems. 

Developers must therefore implement additional layers of validation and testing to ensure that the responses meet the necessary standards of reliability and accuracy. 

Example:
I was using GPT 3.5 Turbo to output JSON structure based on unstructured text input. This is a pretty standard flow for getting structured responses that can be easily used by the application code. Unfortunately from time to time, I was getting responses saying “I do not know what JSON is”.

It turned out that the model performed better when the flow was changed from 1-step (analyze text & output JSON) into 2 separate steps (first – analyze text, second – output JSON).

Challenge 2 – Observability

Most applications chain multiple (not only LLM) models also using external knowledge sources for giving extra context to the flow. Because of many possible points of failure, it is crucial to monitor every step in the flow. The main goal is to evaluate the quality of output (responses) based on a given input. 

For example, if one of the steps is to summarize a piece of text, the text would be our input, and the summary created by LLM would be the output. Having that data we should evaluate if the summarization has all the relevant information from the input and that there are no hallucinations, bias and how relevant the answer is. 

Fortunately, there are ready-to-use solutions like OpenAI Evals, langfuse or deepeval.

This topic is crucial for LLM projects to succeed, so make sure you have the right tools (and knowledge) for the job.

Challenge 3 – (Optimal) Token usage

Using LLMs costs. And cost depends on token usage. Let’s assume that in English 4 tokens = 3 words (1 token = ¾ word), so each word passed to (and returned from) LLM costs money. It is important to build an app in a way that uses tokens efficiently. 

For example, if we want LLM to remember our conversation – it works in a way that with every new message, we also need to pass all the previous messages, which can make the number of tokens increase with every new message. So what can we do to make it more efficient? We can compress the information, for example:

Original message (63 tokens):

“The dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf. Also called the domestic dog, it is derived from extinct gray wolves, and the gray wolf is the dog’s closest living relative. The dog was the first species to be domesticated by humans.”

Compressed Message (19 tokens):

“The dog, a domesticated wolf descendant, was the first species domesticated by humans.”

Such compressed information can be then stored in our knowledge base, most likely a vector database (Pinecone, Chroma, Quadrant, etc.) and then every usage of this information will be significantly cheaper (70% for this information).

Challenge 4 – LLM API limitations

Custom applications use API for connectivity with most models. For the most popular ones (ChatGPT, Gemini, Claude, etc.) it is pretty easy to start with, even build PoC. But when we get closer to a production-grade solution we may encounter issues like downtime or rate limiting.

When it comes to downtime we can check historical data for OpenAI uptime. We can see that it is not available all the time, different outrages are happening from time to time. You should be prepared for such a situation and think of a fallback mechanism for such a situation that will work best for you (for example, ask the same question to a different model).

Rate limiting also is a thing when we use the APIs of these popular models, and rate limits often depend on spending. Different models within OpenAI, Claude, and Gemini have different rate limits, simply higher limits come with higher spending. 

When building an application it is crucial to be able to be aware of those limits and calculate them accordingly. Details regarding respective rate limits for APIs can be found here:

To address this problem, you either need to have a fallback mechanism (asking other models for example) or consider deploying Open Source model(s) on-premise (eg: Mistral, Grok).

Challenge 5 – Security

LLM adds an extra layer to the application which needs to be analyzed from a security perspective. Threats like prompt injection (causing risky actions), denial of service (resulting in high costs or service being down), or sensitive information disclosure must be addressed and tested. The OWASP Top 10 for Large Language Model Applications may be a good starting point in making your LLM application more secure. It contains a list of the most popular vulnerabilities and possible ways to tackle them

Summary

Integrating Large Language Models (LLMs) into applications introduces several significant challenges. Developers face numerous hurdles from the added layer of complexity and non-deterministic responses to issues of observability, optimal token usage, API limitations, and security concerns. 

Addressing these challenges requires a comprehensive approach, including efficient resource management, robust validation mechanisms, thorough monitoring, and a focus on security best practices. By understanding and preparing for these challenges, organizations can better harness the potential of LLMs to enhance their applications and drive innovation.