Behind the Scenes: A Guide to LLM Safety

Jun 11, 2024

By Will Carter, Responsible Generative AI, Accenture

Download PDF here.

Executive Summary

Large language models (LLMs) are a type of artificial intelligence (AI) system that can generate original content with a variety of applications and associated benefits. They are multi-purpose tools which have the potential to transform many areas, including content creation and discovery, research, customer service, and developer efficiency.

Often, the same LLM can be used in multiple applications. This flexibility is one of LLMs’ greatest strengths, as pretraining large models requires significant investments in compute power and expertise. To realize their potential, LLMs must be fit-for-purpose and deployed in well-designed products. It is also important that these systems are developed and deployed responsibly, in ways that address identifiable concerns like fairness, privacy and safety.

This paper provides a brief overview of how LLMs can be tailored to different applications responsibly, while managing risk. Given the range of products and applications that are powered by LLMs, there is no one-size-fits-all solution, and efforts to develop industry standards or guidelines, and regulations must be developed in a way that is flexible and allow the AI community to tailor mitigations to their specific applications, products, and use cases. A global approach will be critical to ensure that regulations are not fragmented or conflicting, while also enabling interoperability.

LLMs are multi-purpose tools

LLMs can be used in a wide range of applications, including question answering, translation, chatbots, customer service, summarization and creative writing, to name just a few. Often, the same LLM can be used in multiple applications, and some products are designed to perform multiple tasks for users. This flexibility is one of LLMs’ greatest strengths. Pretraining large models requires significant investments in computing power, data, and expertise, while adapting pre-trained models for a variety of downstream applications is far cheaper and more accessible to a wide range of deployers and users.

Depending on the application, products that use LLMs will often prioritize different attributes of the models, and manage different risks. For example, a search engine that uses LLMs to answer questions based on information on the internet is expected to return accurate information, and direct users to authoritative sources to learn more. If the system provides inaccurate answers or draws from misleading sources, not only will it not be useful, but it could lead to significant harm, like spreading harmful misinformation. It’s also important that these systems are prepared to handle queries on sensitive topics like contemporary politics or medical information, providing authoritative information in an appropriate way, which sometimes means declining to answer altogether.

In contrast, creative writing applications often prioritize creativity and originality over sourcing and grounding in fact. For example, a search engine might want to respond to a query about “how to identify the lizard people that control the world” by surfacing information debunking that story. However, it would be difficult for a creative writing system to help write a fictional short story about a dragon if the system prioritized telling the user that dragons do not exist.

Similarly, translation systems that have a greater understanding of context and nuance can provide more accurate translations than rote word-by-word methods. For customer service, grounding responses in relevant information from a single source, for example about the company or product the customer is calling about, is paramount. For summaries, making sure they are accurate, appropriately sourced, respect intellectual property rights, and enable users to seek additional information easily is key.

Flexible applications, multi-purpose tools:

One of the advantages of LLMs is that they can power a range of different tools, which can focus more on creativity or accuracy, depending on the use case. Here are just a few examples of these:

Creative writing: chat applications, for example, can be used for a variety of creative activities including writing a poem or short story – where it may be more important to focus on creativity than factuality.
Translation: understanding context and nuance is important for these types of use cases, as this can help provide translations that sound closer to natural language.
Question answering and summarization: these types of tools can help with identifying the most important information in a document (or group of documents) or answering questions on specific topics, which may require more fidelity to underlying sources.

These are just a few of many types of applications that address different user needs, which range from customer service to research and analysis to code generation. Depending on the application, LLMs-enabled products should be designed to emphasize the appropriate attributes of LLMs that fit with users’ needs and reduce risk.

There are many ways to customize LLMs and manage risks

AI experts have developed multiple methods to tailor general-purpose LLMs to different applications – and to manage potential risks of AI-enabled products. Some changes can be made to the base model, some can be incorporated into fine-tuned versions of the model for specific applications, and others must be implemented at the final product level. Most developers use multiple methods to align their applications with their intended goals.

Dataset filtering

The first step is to evaluate training data and ensure that it is appropriate for the intended uses of the model. In some cases, this means filtering examples from training data that could lead to inappropriate behavior, like removing abusive language from the dataset that might be emulated by a chatbot. But LLMs are large models, and they are often pretrained on massive datasets that comprise most of the language on the internet. This can make it impossible to go through all of the training data and remove potentially offensive language. The multi-purpose nature of LLMs also makes this complicated; while a chatbot should generally avoid inappropriate language, for other applications like detecting online abuse, the same model may need to be exposed to inappropriate language to learn to detect it. And for many topics, completely avoiding exposing the model to the topic may lead to more harm. For example, a model that has been trained on a dataset that removes references to Nazis or the Holocaust could mislead users about this period of history.

Fine-tuning

The next step is careful testing and evaluation of models themselves using tools like Responsible AI with Tensorflow to identify behaviors you want to change or risks that need to be managed. In some cases developers can “fine-tune” the model or retrain it on smaller, curated datasets to change specific behaviors. For example, if a model is using gendered terms for certain professions, e.g. “he” for “doctor” and “she” for “nurse,” you can fine-tune the model using a dataset of more gender-balanced examples so that it gives more balanced responses. Models can also be fine tuned on datasets with labeled examples of offensive language so that the model is better able to identify and avoid inappropriate responses.

Sometimes, these changes are made in the underlying general-purpose model, but in most cases it makes more sense to fine tune application-specific versions of the model instead. For example, if the same general-purpose LLM is used to train classifiers to detect online abuse in a customer support chatbot, the version used to build the chatbot should be fine tuned to minimize potential offensive language, while the version used to train classifiers to detect abuse should be trained to generate the right quantity and types of abusive language so that the classifier can detect real examples in the wild. Each application requires opposing fine-tuning, so instead of making changes to the base model, different versions of the model can be fine-tuned for each application without retraining the base model.

Grounding

Fine-tuning is also important to ground LLMs in relevant information. An LLM that is used to answer a broad range of questions based on information from the internet should be generally knowledgeable about a wide range of topics, but if a company wants to use an LLM to help users navigate their website or answer customer service questions, it’s important that the LLM is grounded with information about the specific company and site, and regularly updated, so its responses are relevant to the user. Or if an LLM is intended to be used in a specific domain, for example to respond to queries about health, the model should be fine-tuned on relevant medical information from authoritative sources.

Mitigating risks in product using classifiers

Not all behaviors or risks can be mitigated during model training either. Some risks are best mitigated with product features that interact with the LLM itself. Detecting inappropriate or abusive queries or responses can often be done most effectively by using classifiers, AI systems that categorize data, e.g. identifying safe vs. unsafe queries or responses from an LLM. These classifiers are part of the broader system that users interact with and help direct how the users interact with the language model.

Classifiers can be used to filter inputs, blocking the model from responding to inappropriate or dangerous prompts from the user (e.g. how to build a bomb), and filter outputs to prevent the model from generating potentially harmful responses (e.g. abusive language from a chatbot). They can also be used for “rejection sampling,” in which the model produces multiple potential outputs that are scored on the level of potential harm they could cause, then only the most appropriate responses are shown to the user.

Classifiers can not only help limit inappropriate or unsafe responses from the model, but also be used to redirect users to help resources. For example, classifiers that detect prompts from users related to self-harm can be used to redirect users to mental health resources.

While classifiers are an important tool to safeguard content, they also come with tradeoffs. Classifiers can “overtrigger,” blocking legitimate queries or responses, which can lead to other kinds of harm, for example if a classifier intended to prevent homophobic language inadvertently blocks healthy conversations about LGBTQ+ experiences.

User empowerment

In addition to tailoring models and implementing guardrails in products, it’s also important that providers of LLM-enabled products provide clear guidance to users on how products work, what their limitations are, and how they are intended to be used. Not all risks can be mitigated technically, so clear product documentation, onboarding experiences, and UI features like labels for AI-generated content should highlight relevant information at the right time in ways that enable users to make informed decisions about how they interact with the system.

Other product features like embedded links to sources in summaries, or the “double-check” feature in Google Gemini (which allows users to check the output of an LLM against information from authoritative sources) also empower users to hold LLMs accountable and take steps to address inaccurate or misleading results.

Ongoing monitoring to understand patterns of abuse

Finally, it’s essential to monitor products on an ongoing basis to detect patterns of misuse and potential harm. Ongoing monitoring can be used to identify users that repeatedly try to use models in inappropriate ways, like repeatedly prompting the system to make disparaging comments about protected groups. Using this information, the product can direct them to warnings or additional training on how to use the system appropriately – or even ban them from accessing the service if needed. Monitoring can also help product teams understand patterns of inappropriate responses generated by the model so they can figure out how to address them – or identify common ways that users abuse the system so that they can make changes to the product to limit users’ ability to use it in harmful ways.

There is no one-size-fits all solution

While these techniques offer a variety of tools that developers and deployers of LLMs can use to improve the performance of LLMs and mitigate risks, different products and applications require different approaches. Each LLM is different, and the diverse products built on LLMs also require the models to perform in different ways.

AI technologies are also constantly changing, as the growth of LLMs over the last few years has demonstrated, like OpenAI’s GPT-3 being replaced by GPT-4 or the evolution of Google’s Bard into Gemini. In addition to chatbots, LLMs are increasingly being used in a variety of products, from search engines to word processors to invoice processing systems. As the models and products built on them are changing, so are the tools and techniques available to improve performance and manage risk.

Efforts to develop industry standards or guidelines, and regulations, about how LLMs are used in products are important, but must be developed in a way that is flexible enough to allow companies using LLMs to tailor mitigations to their specific products and use cases. A global approach is critical to ensure that regulations are not fragmented or conflicting, and enable interoperability. And any requirements should be flexible enough to allow companies to use the most effective techniques to improve performance and manage risk, whether that is through data filtering, fine-tuning, classifiers, user education, or ongoing monitoring, and to evolve their approaches over time as new techniques are developed.

Like every new technology, LLMs have tremendous potential to benefit users and society, and require a thoughtful and balanced approach that maximizes those benefits and supports innovation, while reinforcing high quality standards and managing risk.

Anchor Change with Katie Harbath

Behind the Scenes: A Guide to LLM Safety