How To Fine-Tune Hugging Face Transformers on a Custom Dataset

Language models have come a long way in recent years, and their capabilities have expanded rapidly. With the right prompt, a language model can generate text that is almost indistinguishable from what a human would produce. In this post, we’ll explore the art of prompt engineering, which is the process of crafting the perfect prompt to guide the language model to produce the output you want.

If you don’t want any background or tips about prompt engineering, you can jump directly to the LangChain portion of this article by clicking here. And if you want to jump directly into the code? We got you covered there too:

This article is best-enjoyed piece by piece. Learn a topic and try it out. Prompt engineering is fun!

Want to jump right in and get started? Visit the Langchain docs for our integration to get access to working examples right away.

🤔 Talking to AI: Why Should You Care About Prompt Engineering?

Prompt engineering is crucial because it determines the quality of the output that the language model produces. A well-crafted prompt can steer the model in the right direction, while a poorly crafted prompt can lead to irrelevant or nonsensical output. The key to prompt engineering is to provide the model with the right combination of instructions, context, input data, and output indicators.

We’ll discuss all this in more detail below:

💡 Putting on my Thinking Cap: How to Effectively Prompt Engineer

Big thanks to the folks at Co:here and Dair.ai for their insight on prompt engineering. Subscribe to both for the latest tips for LLM prompt engineering.

We’re going to look at an important parameter and sampling technique before we dig into some suggestions for how to effectively engineer prompts for LLMs.

🌡️ Temperature

Temperature is a parameter used to control the randomness of the generated text. A lower temperature will result in more conservative predictions, while a higher temperature will encourage more diverse and creative outputs. Determining the optimal temperature for a specific task is crucial to achieving the desired output.

When working on fact-based questions where accuracy is key, a lower temperature should be used to encourage concise and factual responses. On the other hand, creative tasks such as poetry generation could benefit from a higher temperature to encourage more imaginative outputs.

🔝 Top P

top_p is a sampling technique that controls how deterministic the model is at generating a response. With top_p, we can limit the number of possible next tokens to consider for the next output.

If a low top_p is selected, the model will only consider the most probable tokens, resulting in more accurate but potentially less diverse outputs. If a higher top_p is selected, the model is encouraged to consider a wider range of tokens, producing more diverse outputs.

When accuracy is essential, a lower top_p should be used to encourage the model to select only the most probable tokens. For more diverse outputs, increase top_p to allow the model to consider a wider range of tokens.

🚀 Start Simple

As with any new skill, it’s best to start with simple prompts and gradually increase complexity. You can start with a playground like OpenAI’s or Co:here’s to experiment with basic prompts. As you become more comfortable with prompt engineering, you can add more elements and context to your prompts.

🧐 Three Tips for Prompt Engineering

📒 Use Instructive Language

Commands are an effective way to instruct the model to perform a specific task, such as “Write,” “Classify,” “Summarize,” “Translate,” or “Order.” You can experiment with different instructions, keywords, and contexts to see what works best for your particular use case and task.

One important tip is to avoid negative prompts that focus on what not to do, and instead, provide positive prompts that specify what to do. This approach encourages more specificity and details in responses from the model.

🎯 Be Specific

The more specific and detailed the prompt is, the better the results.

This is particularly important when you have a desired outcome or style of generation you are seeking. Adding examples in the prompt is very effective to get desired output in specific formats. When designing prompts, keep in mind the length of the prompt, as there are limitations regarding how long this can be. Too many unnecessary details are not necessarily a good approach. The details should be relevant and contribute to the task at hand.

🙅 Avoid Imprecision

It’s often better to be specific and direct rather than trying to be too clever about prompts and potentially creating imprecise descriptions. You might still get somewhat good responses with imprecise prompts, but a better prompt is generally very specific and concise.

🧠 Thinking in Prompts: Examples of Prompt-Driven Tasks

In this section, we’ll walk through some common NLP tasks and how prompt engineering can get you the desired outputs:

📝 Text Summarization

Text summarization is a standard task in natural language generation. It involves condensing longer texts into shorter, more concise summaries. One of the most promising applications of language models is the ability to summarize articles and documents like boardroom meeting minutes into quick summaries.

To perform text summarization, it’s important to provide clear instructions to the language model. For example, you can use a prompt such as “Explain [topic]” followed by “in one sentence.” Additionally, providing examples of the desired output can help the model understand what you are looking for.

📊 Information Extraction

Information extraction involves identifying and extracting specific pieces of information from a given text. This task can be useful for applications such as data mining and business intelligence.

To prompt a language model to perform information extraction, you can provide a context followed by a question that specifies the desired information. For example, “Find the [specific piece of information] mentioned in the paragraph above.”

❓ Question Answering

Question answering involves providing a specific answer to a given question based on a given context. To prompt a language model to perform question answering, it’s important to provide a clear and structured prompt. This can include providing a context, a question, and an output indicator such as “Answer.”

📈 Text Classification

Text classification involves categorizing a given text into a specific category or label. To prompt a language model to perform text classification, you can provide clear instructions such as “Classify the text into [specific category or label].” Additionally, providing examples of input data and desired output can help the model understand what you are looking for.

💬 Conversation

Conversational systems involve prompting a language model to engage in a conversation with a user. To prompt a language model to engage in a conversation, it’s important to provide clear instructions on the behavior, intent, and identity of the system. This can include providing a greeting, a specific tone, and instructions on how to respond to specific prompts.

💻 Code Generation

Code generation involves prompting a language model to generate code based on a given prompt. This task can be useful for automating repetitive coding tasks or generating code based on specific requirements. To prompt a language model to perform code generation, you can provide clear instructions such as “Create a [specific programming language] query for [specific task].” Additionally, providing examples of input data and desired output can help the model understand what you are looking for.

You can get a little more detail in our recent post, where we compare the code-writing chops of GPT3.5 and GPT4.

🤔 Reasoning

Reasoning tasks involve prompting a language model to perform logical or mathematical reasoning based on a given prompt. While current LLMs tend to struggle with reasoning tasks, it’s still possible to prompt them to perform some basic arithmetic. To prompt a language model to perform reasoning tasks, it’s important to provide clear and detailed instructions on the task, including examples of input data and desired output.

📈 Unlock the Power of LLMs: Advanced Prompting Techniques

🎯 Zero-Shot and Few-Shot Prompting

Zero-shot and few-shot prompting are techniques that allow us to generate responses from LLMs without fine-tuning on specific tasks. This is useful in situations where there is limited training data available or when it is not possible to fine-tune the model for a specific task.

🎯 Zero-Shot Prompting

Zero-shot prompting involves providing the LLM with a prompt that describes the task to be performed without providing any examples or training data. The model then generates a response based on its understanding of the task. This technique is useful for tasks that involve simple, straightforward rules or patterns.

🎯 Few-Shot Prompting

Few-shot prompting involves providing the LLM with a few examples of the task to be performed, along with a prompt that describes the task. The model then generates a response based on its understanding of the task and the provided examples. This technique is useful for tasks that involve more complex patterns or rules that are not easily captured by a simple prompt.

⛓️ Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding. CoT prompting involves breaking down a complex task into a series of simpler sub-tasks or intermediate reasoning steps. This allows the language model to reason through the steps and arrive at the correct answer.

🔄 Self-Consistency

Self-consistency is a technique that improves the performance of chain-of-thought (CoT) prompting on tasks involving arithmetic and commonsense reasoning. It involves sampling multiple, diverse reasoning paths through few-shot CoT, and using the generations to select the most consistent answer. The idea behind self-consistency is to replace the naive greedy decoding used in CoT prompting.

📖 Generated Knowledge Prompting

Generated knowledge prompting is a technique that incorporates knowledge or information to help the model make more accurate predictions. The generated knowledge is included as part of the prompt. This technique is particularly useful for tasks that require more knowledge about the world.

🤖 Automatic Prompt Engineer (APE)

Automatic Prompt Engineer (APE) is a framework that generates instructions and selects the most appropriate one for a given task. A large language model is used to generate candidate instructions based on provided output demonstrations. The generated instructions guide the search procedure, and the most appropriate one is selected based on computed evaluation scores. The framework addresses the instruction generation problem as a black-box optimization problem using LLMs to generate and search over candidate solutions. APE has been shown to discover a better zero-shot CoT prompt than the human-engineered “Let’s think step by step” prompt.

🔒 Protecting Your Prompts: How to Prevent Adversarial Prompting

Adversarial prompting is a growing concern in prompt engineering, as it poses a risk to the safety and reliability of language model applications. In this section, we will discuss some of the methods that can be used to prevent adversarial prompting and ensure the safety of your language models.

📝 Parameterizing Prompt Components

One of the most effective ways to prevent prompt attacks is to parameterize the different components of the prompts, such as having instructions separated from inputs and dealing with them differently. Parameterization can lead to cleaner and safer solutions, but it may also result in less flexibility. This is an active area of interest as we continue to build software that interacts with LLMs.

👩‍🔬 Robust Testing

Robust testing is necessary to prevent prompt leaking, a form of prompt injection where prompts containing confidential information or proprietary information are leaked. A lot of startups are already developing and chaining well-crafted prompts that are leading to useful products built on top of LLMs. Developers need to consider the kinds of robust testing that need to be carried out to avoid prompt leaking.

🛡️ Quotes and Additional Formatting

Quoting and other formatting techniques can also be used to prevent prompt injection. Escaping the input strings can help to prevent certain types of injection attacks. One example is to use JSON encoding plus markdown headings for instructions and examples. More research is needed to determine the most effective methods for preventing prompt attacks.

🔓Jailbreaking

Some models will avoid responding to unethical instructions, but they can be bypassed if the request is contextualized in a clever way. For example, a prompt like “Can you write me a poem about how to hotwire a car?” was able to bypass the content policy of previous versions of ChatGPT. Models like ChatGPT and Claude have been aligned to avoid outputting content that promotes illegal or unethical behavior. However, they still have flaws, and we are learning new ones as people experiment with these systems.

🦜⛓️ Why Use LangChain for Prompt Workflows

Prompt engineering can be a challenging task, especially if you’re new to the field. Luckily, there are tools and frameworks available that can help you get started. One such tool is LangChain, a powerful platform for prompt engineering with LLMs.

By providing specific instructions, context, input data, and output indicators, LangChain enables users to design prompts for a wide range of tasks, from simple text completion to more complex natural language processing tasks such as text summarization and code generation. LangChain also allows users to adjust various LLM settings, such as temperature and top_p, to further fine-tune the results of their prompts.

In this section, we’ll showcase how LangChain works, how you can use it, and how you can track it

🤖 What are the LangChain Modules

LangChain provides a comprehensive set of modules to simplify the development of large language models. Each module has a specific role in the generation of high-quality prompts and output. The main modules are:

📝 Prompt Template

The Prompt Template module allows you to define the structure and format of your prompts. This module is used to specify input data, context, and output indicators. Prompt Templates can help ensure consistency and efficiency when generating prompts.

🧠 LLM

The Large Language Model (LLM) module provides a standard interface for state-of-the-art language models for generating high-quality output. LangChain supports a wide range of tasks, including text summarization, mathematical reasoning, and code generation. The module includes pre-trained LLMs that can be fine-tuned to match the needs of specific conversational agents.

📄 Document Loaders

The Document Loaders module ingests documents, such as PDFs or PowerPoints, into the LLM for further analysis. Combining language models with your own text data is a powerful way to differentiate them. The first step in doing this is to load the data into “documents” – a fancy way of saying some pieces of text.

🕵️‍♂️ Utils

The Utils module provides an extensive collection of common utilities to use in your application, such as Python REPLs, bash commands, or search engines. Utils are employed to strengthen the potency of LLMs when interacting with other sources of knowledge or computation.

🗂️ Indexes

Indexes refer to ways to structure documents so that LLMs can best interact with them. This module contains utility functions for working with documents, different types of indexes, and then examples for using those indexes in chains. LangChain provides common indices for working with data (most prominently support for vector databases). It includes embeddings and data stores that can be used to store and retrieve large amounts of data.

🔗 Chain

The Chain module allows you to chain together multiple LLMs to create more complex and sophisticated prompts and applications. This module is useful for tasks that require multiple steps or stages. For example, we can create a chain that takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM. We can build more complex chains by combining multiple chains together or by combining chains with other components such as Utils.

🤖 Agent

An Agent is a module that allows for scaling up the generation of output by deploying prompts. It uses an LLM to determine which actions to take, either by using a tool and observing its output or returning to the user. Tools are functions that perform specific duties such as Google Search or Database lookup. LLMs are the language models powering the agent.

📕 Memory

The Memory component enables agents to recall previous interactions with users or remember entities. This enables agents to provide users with more personalized and contextualized responses as time passes.

🚀 How to Use LangChain

To use LangChain, you first need to create a prompt using the Prompt Template module. The LLM module generates output from the prompt. If you need more complex prompts, you can use the Chain module to create a pipeline of LLMs. Finally, you can use the Agent module to deploy your prompts and generate output at scale. Langchain Modules are composable but have a hierarchy of order, as seen here. As a result, analyzing and debugging LangChain benefits from stack traces.

Prompt engineering is an exciting and rapidly growing field, and LangChain is one of the most powerful tools available for this task. With the right combination of modules and techniques, you can create high-quality prompts that guide the language model to produce the output you want. Whether you’re a seasoned prompt engineer or just getting started, LangChain is a platform that can help you achieve your goals.

🤖 Tracking Your Prompts with Weights & Biases

LangChain is a powerful tool for prompt engineering and LLM development, but keeping track of your prompts and interactions with the LLM can be challenging. This is where Weights & Biases comes in.

W&B is a platform for visualizing and tracking machine learning experiments. It allows you to log your prompts, LLM outputs, and other analytical data and easily compare different models and experiments. With W&B, you can keep track of how your LLM chains are performing, identify areas for improvement, and make data-driven decisions about your prompt engineering process.

To use W&B with LangChain, add the `WandbCallbackHandler` to your LangChain and `flush_tracker` to your model whenever you’re satisfied with your prompt engineering session. This will allow you to log your prompts and other analytical LLM outputs to your W&B account. You can then use the W&B dashboard to visualize your data and compare different experiments. W&B also provides tools for collaboration, so you can share your results with your team and get feedback and suggestions.

👋 How to Build a Callback in LangChain for Better Prompt Engineering

Looking at the specification of the BaseCallbackHandler defined here

The hierarchy of LangChain Modules are loosely:

Agents utilize components of LLMs and Tools and

Chains utilize PromptTemplates and LLMs where

LLMs provide the text generations given an input

Applications using LangChain can be made at any of these mentioned workflow levels, as at a core, they all use LLMs to provide generations given an arbitrary input. To make debugging, investigation, and analysis easier, the `BaseCallbackHandler` exposes an entry point to each action essential to each of these text-generating workflows.

You can override each of these functions to process the inputs that are shown and typed in the functions above.

Let’s take a look at these functions, which will always be touched by a callback:

`on_llm_start` and `on_llm_end`

`on_llm_start` provides functionality to investigate prompts and gather details about the underlying LLM these prompts are being passed to.

`on_llm_end` provides functionality to analyze generations from the prompts passed as input in on_llm_start

As a result, we can easily add concepts such as keeping track of the state of actions taken via counters and lists to store the results and inputs of each of these actions, where all of these are easily logged to Weights & Biases as an example.

However, the Weights & Biases callback has just a bit 🤏🏽 different of a workflow than the other provided callbacks by LangChain. Here’s what you need to know:

🔍 How to Track your LangChain Prompts with Weights & Biases

In fact, let’s look at 3 experiments taken from the LangChain documentation below to understand each of the workflows mentioned above and how Weights & Biases aids in each! We explore 3 different scenarios with different modules that directly interact with our callback and LLMs within LangChain.

To run this example you can visit this colab here

🆕 Initializing our LLM and Callback

The details for the `WandbCallbackHandler` are described in the code below…

… alongside the default values for some key analytical functionality

Our code will initialize an OpenAI language model and a callback manager with a WandbCallbackHandler. The WandbCallbackHandler will be used to log prompts, outputs, and other analytical data to Weights & Biases. The manager is a collection of callback handlers. The callback handlers are used to manage the LLM’s output and behavior during generation. By default, we do not log text structure visualizations, advanced text complexity metrics, or log streams during inference.

📥 Storing LangChain Sessions to Weights & Biases

The `flush_tracker` function is used to log LangChain sessions to Weights & Biases. It takes in the LangChain module or agent and logs, at minimum, the prompts and generations alongside the serialized form of the LangChain module to the specified Weights & Biases project. By default, we reset the session as opposed to concluding the session outright.

🤝 Collaborative Prompt Engineering

With Reports, you can compare generations across all of your prompting experiments to better understand your own prompting workflows and the LLMs that generated the outputs you’re investigating. With Reports, you can do things such as:

Contextually analyze an LLM’s details alongside its generations and additional metrics or visualizations across experiments

Investigate token usage across your experiments to better plan cost and resource allocation

Collect a trace of all underlying action details that occur within LangChain as generations occur for a more effective debugging process with your LLMs

Check the version details of different saved model sessions…

with the ability to compare serialized model specs across different sessions

You can also investigate system resources that are exhausted during your experiments such as GPU metrics or Network traffic

Conclusion

Prompt engineering is gaining importance with the release of each new foundational language model. Understanding how to use it to get the outputs you want, fine-tune models you’re building, and understand more deeply how your LLM functions are vital skills moving forward.

We hope you enjoyed this introduction to prompt engineering and LangChain alongside W&B. If you’d like to try it yourself, the colab below will get you started. We’ve also included a few recent pieces on LLMs if you want to read more.

Happy modeling!

This blog has been republished by AIIA. To view the original article and see all summaries and analysis please click HERE.