An application for generating high-quality synthetic datasets for various fine-tuning techniques. The application currently specializes in code generation and SQL query synthesis, with support for custom use cases.
- Support for multiple fine-tuning techniques:
- ✅ Supervised Fine-Tuning (SFT) - Currently Live
- 🚧 PPO (Proximal Policy Optimization) - WIP
- 🚧 ORPO (Odds Ratio Preference Optimization) - WIP
- 🚧 DPO (Direct Preference Optimization) - WIP
- 🚧 KTO (Kahneman-Tversky Optimisation) - WIP
Generate diverse programming question-answer pairs across multiple domains, complete with detailed explanations and working code examples. The system creates scenarios that test both theoretical understanding and practical implementation skills, producing high-quality training data for code-assistance models.
Generate data as prompt and SQL pairs on custom data schemas which can be used to further fine-tune models for enhanced text2sql performance on OSS models.
Flexible framework for implementing additional use cases which allows users to create their own workflow.
- Claude 3 Family
- Llama 3 Models
- Mistral Models
- Built-in evaluation capabilities for generated datasets
- Customizable evaluation prompts
- Scoring system with detailed justifications
- Preview Mode (displayed on Front-End) for prompts solution pairs <= 25
- Batch Mode via Cloudera ML Jobs (User can run this only in Cloudera environment)
Built using:
- Backend: FastAPI
- Frontend: React
- Database: SQLite (for metadata storage)
-
Clone the repository:
git clone <repository-url>
-
Configure environment variables:
# AWS Bedrock credentials (in CML environment) export AWS_ACCESS_KEY_ID="your key" export AWS_SECRET_ACCESS_KEY="your secret key" export AWS_DEFAULT_REGION="aws region"
Note: If using AWS Bedrock, ensure you have access to the LLM you intend to use.
-
Build the application:
python build_client.py
-
Start the application:
python start_application.py
/synthesis/generate
: Generate synthetic Q&A pairs/synthesis/evaluate
: Evaluate generated examples
/model/model_ID
: Get available model configurations/use-cases
: List available use cases/model/parameters
: Get model parameter ranges/{use_case}/gen_prompt
: Get generation prompts/{use_case}/eval_prompt
: Get evaluation prompts
/generations/history
: View generation history/evaluations/history
: View evaluation history/generations/display-name
: Update generation metadata/evaluations/display-name
: Update evaluation metadata
{
"use_case": "code_generation",
"model_id": "anthropic.claude-3-5-sonnet-20240620-v1:0",
"num_questions": 3,
"technique": "sft",
"topics": ["python_basics", "data_structures"],
"examples": [
{
"question": "How do you create a list in Python and add elements to it?",
"solution": "# Example solution code..."
}
],
"model_params": {
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 4096
}
}
{
"use_case": "text2sql",
"model_id": "anthropic.claude-3-5-sonnet-20240620-v1:0",
"num_questions": 3,
"technique": "sft",
"topics": ["basic_queries", "joins"],
"schema": "CREATE TABLE users (...)",
"examples": [
{
"question": "How do you select all employees from the employees table?",
"solution": "SELECT * FROM employees;"
}
],
"model_params": {
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 4096
}
}
IMPORTANT: Please read the following before proceeding. This AMP includes or otherwise depends on certain third party software packages. Information about such third party software packages are made available in the notice file associated with this AMP. By configuring and launching this AMP, you will cause such third party software packages to be downloaded and installed into your environment, in some instances, from third parties' websites. For each third party software package, please see the notice file and the applicable websites for more information, including the applicable license terms.
If you do not wish to download and install the third party software packages, do not configure, launch or otherwise use this AMP. By configuring, launching or otherwise using the AMP, you acknowledge the foregoing statement and agree that Cloudera is not responsible or liable in any way for the third party software packages.
Copyright (c) 2024 - Cloudera, Inc. All rights reserved.