In the rapidly evolving world of large language models (LLMs), ensuring the safety and robustness of these powerful systems is of utmost importance. A recent study by OpenAI introduces the concept of an instruction hierarchy, drawing inspiration from the protection rings found in modern operating systems. This groundbreaking approach aims to mitigate the risks associated with prompt injections and other attacks that can compromise the integrity of LLM-powered applications. In this blog post, we will go into the details of the instruction hierarchy and explore how it enhances the security and reliability of LLMs.

Understanding the instruction hierarchy

The instruction hierarchy is a framework that explicitly defines how LLMs should prioritize and handle instructions of varying privilege levels. It assigns different priorities to instructions based on their source and importance. The hierarchy consists of four main levels:

  1. Priority 0 (critical): System Message
  2. Priority 10 (high): User Messages
  3. Priority 20 (medium): Messages or Instructions in images or audio
  4. Priority 30 (low): Text from tools (e.g., web browsing, search, code, uploaded and retrieved documents)
example conversation with chatGPT

An example conversation with ChatGPT

The key idea behind the instruction hierarchy is that higher-priority instructions should take precedence over lower-priority ones. When instructions at different levels conflict, the LLM is trained to disregard the lower-level instruction in favor of the higher-level one. This approach ensures that the LLM remains aligned with the intended behavior and prevents unauthorized modifications.

Generating training data

To effectively teach LLMs the instruction hierarchy, the researchers propose two methods for creating training data: context synthesis and context ignorance.

Context synthesis

For aligned instructions, where lower-level instructions align with higher-level ones, the researchers generate examples using context synthesis. They create compositional requests (e.g., “write a 20 line poem in Spanish”) and decompose them into smaller pieces (e.g., “write a poem,” “use Spanish,” “use 20 lines”). These decomposed instructions are then placed at different levels of the hierarchy, and the LLM is trained to predict the original ground-truth response.

Context ignorance

For misaligned instructions, where lower-level instructions conflict with higher-level ones, the researchers employ context ignorance. They train the LLM to act as if it is completely unaware of the lower-level instructions. This is achieved by generating examples using red-teamer LLMs for various attacks (e.g., prompt injections, system prompt extractions) and combining them with generic instruction-following examples.

Evaluation and results

The researchers evaluate their approach using both open-source and novel benchmarks, including attacks that differ from those encountered during training. The results are impressive, with the instruction hierarchy dramatically improving robustness across all evaluations. For instance, defense against system prompt extraction is improved by 63%, and jailbreak robustness increases by over 30%. Notably, the LLM exhibits generalization to unseen attacks, suggesting that it has internalized the instruction hierarchy.

The model trained with the instruction hierarchy has substantiallyhigher robustness across a wide range of attacks

The model trained with the instruction hierarchy has substantially
higher robustness across a wide range of attacks

While there are some regressions in “over-refusals,” where the LLM occasionally ignores or refuses benign queries, the generic capabilities of the models remain largely unaffected. The researchers are confident that this can be addressed with further data collection.

Conclusion

The instruction hierarchy represents a significant step forward in enhancing the safety and robustness of LLMs. By prioritizing privileged instructions and training LLMs to handle aligned and misaligned instructions appropriately, this approach effectively mitigates the risks associated with prompt injections and other attacks. The impressive results and generalization capabilities demonstrate the potential of the instruction hierarchy in building more secure and reliable LLM-powered applications.
As LLMs continue to advance and find applications in various domains, ensuring their safety and trustworthiness becomes increasingly crucial. The instruction hierarchy provides a promising framework for addressing these challenges and paves the way for further research and development in this field. By combining the instruction hierarchy with other system-level safeguards and ongoing advancements in LLM training, we can look forward to a future where LLMs are not only powerful but also secure and dependable.

Categorized in:

Deep Learning,

Last Update: 25/04/2024