Skip to content

Commit

Permalink
Merge branch 'huggingface:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
jucanbe authored Jan 22, 2025
2 parents bfdf511 + 3c9a583 commit 8050dc4
Show file tree
Hide file tree
Showing 53 changed files with 5,944 additions and 8 deletions.
2 changes: 1 addition & 1 deletion 1_instruction_tuning/supervised_fine_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ SFT plays a fundamental role in aligning language models with human preferences.

## Supervised Fine-Tuning With Transformer Reinforcement Learning

A key software package for Supervised Fine-Tuning is Transformer Reinforcement Learning (TRL). TRL is a toolkit used to train transformer language models models using reinforcement learning (RL).
A key software package for Supervised Fine-Tuning is Transformer Reinforcement Learning (TRL). TRL is a toolkit used to train transformer language models using reinforcement learning (RL).

Built on top of the Hugging Face Transformers library, TRL allows users to directly load pretrained language models and supports most decoder and encoder-decoder architectures. The library facilitates major processes of RL used in language modelling, including supervised fine-tuning (SFT), reward modeling (RM), proximal policy optimization (PPO), and Direct Preference Optimization (DPO). We will use TRL in a number of modules throughout this repo.

Expand Down
1 change: 1 addition & 0 deletions 4_evaluation/project/generate_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
from pydantic import BaseModel, Field
from datasets import Dataset
from typing import List

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
Expand Down
36 changes: 36 additions & 0 deletions 7_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Inference

Inference is the process of using a trained language model to generate predictions or responses. While inference might seem straightforward, deploying models efficiently at scale requires careful consideration of various factors like performance, cost, and reliability. Large Language Models (LLMs) present unique challenges due to their size and computational requirements.

We'll explore both simple and production-ready approaches using the [`transformers`](https://huggingface.co/docs/transformers/index) library and [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), two popular frameworks for LLM inference. For production deployments, we'll focus on Text Generation Inference (TGI), which provides optimized serving capabilities.

## Module Overview

LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions.

## Contents

### 1. [Basic Pipeline Inference](./pipeline_inference.md)

Learn how to use the Hugging Face Transformers pipeline for basic inference. We'll cover setting up pipelines, configuring generation parameters, and best practices for local development. The pipeline approach is perfect for prototyping and small-scale applications. [Start learning](./pipeline_inference.md).

### 2. [Production Inference with TGI](./tgi_inference.md)

Learn how to deploy models for production using Text Generation Inference. We'll explore optimized serving techniques, batching strategies, and monitoring solutions. TGI provides production-ready features like health checks, metrics, and Docker deployment options. [Start learning](./text_generation_inference.md).

### Exercise Notebooks

| Title | Description | Exercise | Link | Colab |
|-------|-------------|----------|------|-------|
| Pipeline Inference | Basic inference with transformers pipeline | 🐢 Set up a basic pipeline <br> 🐕 Configure generation parameters <br> 🦁 Create a simple web server | [Link](./notebooks/basic_pipeline_inference.ipynb) | [Colab](https://githubtocolab.com/huggingface/smol-course/tree/main/7_inference/notebooks/basic_pipeline_inference.ipynb) |
| TGI Deployment | Production deployment with TGI | 🐢 Deploy a model with TGI <br> 🐕 Configure performance optimizations <br> 🦁 Set up monitoring and scaling | [Link](./notebooks/tgi_deployment.ipynb) | [Colab](https://githubtocolab.com/huggingface/smol-course/tree/main/7_inference/notebooks/tgi_deployment.ipynb) |

## Resources

- [Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial)
- [Text Generation Inference Documentation](https://huggingface.co/docs/text-generation-inference/en/index)
- [Pipeline WebServer Guide](https://huggingface.co/docs/transformers/en/pipeline_tutorial#using-pipelines-for-a-webserver)
- [TGI GitHub Repository](https://github.com/huggingface/text-generation-inference)
- [Hugging Face Model Deployment Documentation](https://huggingface.co/docs/inference-endpoints/index)
- [vLLM: High-throughput LLM Serving](https://github.com/vllm-project/vllm)
- [Optimizing Transformer Inference](https://huggingface.co/blog/optimize-transformer-inference)
169 changes: 169 additions & 0 deletions 7_inference/inference_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Basic Inference with Transformers Pipeline

The `pipeline` abstraction in 🤗 Transformers provides a simple way to run inference with any model from the Hugging Face Hub. It handles all the preprocessing and postprocessing steps, making it easy to use models without deep knowledge of their architecture or requirements.

## How Pipelines Work

Hugging Face pipelines streamline the machine learning workflow by automating three critical stages between raw input and human-readable output:

**Preprocessing Stage**
The pipeline first prepares your raw inputs for the model. This varies by input type:
- Text inputs undergo tokenization to convert words into model-friendly token IDs
- Images are resized and normalized to match model requirements
- Audio is processed through feature extraction to create spectrograms or other representations

**Model Inference**
During the forward pass, the pipeline:
- Handles batching of inputs automatically for efficient processing
- Places computation on the optimal device (CPU/GPU)
- Applies performance optimizations like half-precision (FP16) inference where supported

**Postprocessing Stage**
Finally, the pipeline converts raw model outputs into useful results:
- Decodes token IDs back into readable text
- Transforms logits into probability scores
- Formats outputs according to the specific task (e.g., classification labels, generated text)

This abstraction lets you focus on your application logic while the pipeline handles the technical complexity of model inference.

## Basic Usage

Here's how to use a pipeline for text generation:

```python
from transformers import pipeline

# Create a pipeline with a specific model
generator = pipeline(
"text-generation",
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
torch_dtype="auto",
device_map="auto"
)

# Generate text
response = generator(
"Write a short poem about coding:",
max_new_tokens=100,
do_sample=True,
temperature=0.7
)
print(response[0]['generated_text'])
```

## Key Configuration Options

### Model Loading
```python
# CPU inference
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device="cpu")

# GPU inference (device 0)
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device=0)

# Automatic device placement
generator = pipeline(
"text-generation",
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
device_map="auto",
torch_dtype="auto"
)
```

### Generation Parameters

```python
response = generator(
"Translate this to French:",
max_new_tokens=100, # Maximum length of generated text
do_sample=True, # Use sampling instead of greedy decoding
temperature=0.7, # Control randomness (higher = more random)
top_k=50, # Limit to top k tokens
top_p=0.95, # Nucleus sampling threshold
num_return_sequences=1 # Number of different generations
)
```

## Processing Multiple Inputs

Pipelines can efficiently handle multiple inputs through batching:

```python
# Prepare multiple prompts
prompts = [
"Write a haiku about programming:",
"Explain what an API is:",
"Write a short story about a robot:"
]

# Process all prompts efficiently
responses = generator(
prompts,
batch_size=4, # Number of prompts to process together
max_new_tokens=100,
do_sample=True,
temperature=0.7
)

# Print results
for prompt, response in zip(prompts, responses):
print(f"Prompt: {prompt}")
print(f"Response: {response[0]['generated_text']}\n")
```

## Web Server Integration

Here's how to integrate a pipeline into a FastAPI application:

```python
from fastapi import FastAPI, HTTPException
from transformers import pipeline
import uvicorn

app = FastAPI()

# Initialize pipeline globally
generator = pipeline(
"text-generation",
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
device_map="auto"
)

@app.post("/generate")
async def generate_text(prompt: str):
try:
if not prompt:
raise HTTPException(status_code=400, detail="No prompt provided")

response = generator(
prompt,
max_new_tokens=100,
do_sample=True,
temperature=0.7
)

return {"generated_text": response[0]['generated_text']}

except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=5000)
```

## Limitations

While pipelines are great for prototyping and small-scale deployments, they have some limitations:

- Limited optimization options compared to dedicated serving solutions
- No built-in support for advanced features like dynamic batching
- May not be suitable for high-throughput production workloads

For production deployments with high throughput requirements, consider using Text Generation Inference (TGI) or other specialized serving solutions.

## Resources

- [Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial)
- [Pipeline API Reference](https://huggingface.co/docs/transformers/en/main_classes/pipelines)
- [Text Generation Parameters](https://huggingface.co/docs/transformers/en/main_classes/text_generation)
- [Model Quantization Guide](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one)
136 changes: 136 additions & 0 deletions 7_inference/text_generation_inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Text Generation Inference (TGI)

Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs). It's designed to enable high-performance text generation for popular open-source LLMs. TGI is used in production by Hugging Chat - An open-source interface for open-access models.

## Why Use Text Generation Inference?

Text Generation Inference addresses the key challenges of deploying large language models in production. While many frameworks excel at model development, TGI specifically optimizes for production deployment and scaling. Some key features include:

- **Tensor Parallelism**: TGI's can split models across multiple GPUs through tensor parallelism, essential for serving larger models efficiently.
- **Continuous Batching**: The continuous batching system maximizes GPU utilization by dynamically processing requests, while optimizations like Flash Attention and Paged Attention significantly reduce memory usage and increase speed.
- **Token Streaming**: Real-time applications benefit from token streaming via Server-Sent Events, delivering responses with minimal latency.

## How to Use Text Generation Inference

### Basic Python Usage

TGI uses a simple yet powerful REST API integration which makes it easy to integrate with your applications.

### Using the REST API

TGI exposes a RESTful API that accepts JSON payloads. This makes it accessible from any programming language or tool that can make HTTP requests. Here's a basic example using curl:

```bash
# Basic generation request
curl localhost:8080/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
```

### Using the `huggingface_hub` Python Client

The `huggingface_hub` python client client handles connection management, request formatting, and response parsing. Here's how to get started.

```python
from huggingface_hub import InferenceClient

client = InferenceClient(
base_url="http://localhost:8080/v1/",
)

output = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Count to 10"},
],
stream=True,
max_tokens=1024,
)

for chunk in output:
print(chunk.choices[0].delta.content)
```


### Using OpenAI API

Many libraries support the OpenAI API, so you can use the same client to interact with TGI.

```python
from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
base_url="http://localhost:8080/v1/",
api_key="-"
)

chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant." },
{"role": "user", "content": "What is deep learning?"}
],
stream=True
)

# iterate and print stream
for message in chat_completion:
print(message)
```

## Preparing Models for TGI

To serve a model with TGI, ensure it meets these requirements:

1. **Supported Architecture**: Verify your model architecture is supported (Llama, BLOOM, T5, etc.)

2. **Model Format**: Convert weights to safetensors format for faster loading:

```python
from safetensors.torch import save_file
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("your-model")
state_dict = model.state_dict()
save_file(state_dict, "model.safetensors")
```

3. **Quantization** (optional): Quantize your model to reduce memory usage:

```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16"
)

model = AutoModelForCausalLM.from_pretrained(
"your-model",
quantization_config=quantization_config
)
```

## References

- [Text Generation Inference Documentation](https://huggingface.co/docs/text-generation-inference)
- [TGI GitHub Repository](https://github.com/huggingface/text-generation-inference)
- [Hugging Face Model Hub](https://huggingface.co/models)
- [TGI API Reference](https://huggingface.co/docs/text-generation-inference/api_reference)
36 changes: 36 additions & 0 deletions 8_agents/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Agents

AI Agents are autonomous systems that can understand user requests, break them down into steps, and execute actions to accomplish tasks. They combine language models with tools and external functions to interact with their environment. This module covers how to build effective agents using the [`smolagents`](https://github.com/huggingface/smolagents) library, which provides a lightweight framework for creating capable AI agents.

## Module Overview

Building effective agents requires understanding three key components. First, retrieval capabilities allow agents to access and use relevant information from various sources. Second, function calling enables agents to take concrete actions in their environment. Finally, domain-specific knowledge and tooling equip agents for specialized tasks like code manipulation.

## Contents

### 1️⃣ [Retrieval Agents](./retrieval_agents.md)

Retrieval agents combine models with knowledge bases. These agents can search and synthesize information from multiple sources, leveraging vector stores for efficient retrieval and implementing RAG (Retrieval Augmented Generation) patterns. They are great at combining web search with custom knowledge bases while maintaining conversation context through memory systems. The module covers implementation strategies including fallback mechanisms for robust information retrieval.

### 2️⃣ [Code Agents](./code_agents.md)

Code agents are specialized autonomous systems designed for software development tasks. These agents excel at analyzing and generating code, performing automated refactoring, and integrating with development tools. The module covers best practices for building code-focused agents that can understand programming languages, work with build systems, and interact with version control while maintaining high code quality standards.

### 3️⃣ [Custom Functions](./custom_functions.md)

Custom function agents extend basic AI capabilities through specialized function calls. This module explores how to design modular and extensible function interfaces that integrate directly with your application's logic. You'll learn to implement proper validation and error handling while creating reliable function-driven workflows. The focus is on building simple systems where agents can predictably interact with external tools and services.

### Exercise Notebooks

| Title | Description | Exercise | Link | Colab |
|-------|-------------|----------|------|-------|
| Building a Research Agent | Create an agent that can perform research tasks using retrieval and custom functions | 🐢 Build a simple RAG agent <br> 🐕 Add custom search functions <br> 🦁 Create a full research assistant | [Notebook](./notebooks/agents.ipynb) | <a target="_blank" href="https://colab.research.google.com/github/huggingface/smol-course/blob/main/8_agents/notebooks/agents.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |

## Resources

- [smolagents Documentation](https://huggingface.co/docs/smolagents) - Official docs for the smolagents library
- [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) - Research paper on agent architectures
- [Agent Guidelines](https://huggingface.co/docs/smolagents/tutorials/building_good_agents) - Best practices for building reliable agents
- [LangChain Agents](https://python.langchain.com/docs/how_to/#agents) - Additional examples of agent implementations
- [Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) - Understanding function calling in LLMs
- [RAG Best Practices](https://www.pinecone.io/learn/retrieval-augmented-generation/) - Guide to implementing effective RAG
Loading

0 comments on commit 8050dc4

Please sign in to comment.