-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076
[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
for more information, see https://pre-commit.ci
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/ |
CodSpeed Performance ReportMerging #1076 will degrade performances by 49.04%Comparing Summary
Benchmarks breakdown
|
@burtenshaw can we get rid of the from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline(num_instructions=5)
distiset = pipeline.pipeline.run(
use_cache=False,
dataset=Dataset.from_list(
[
{
"input": wikipedia.page(title="Transfer_learning").content,
}
]
),
) |
for more information, see https://pre-commit.ci
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
for more information, see https://pre-commit.ci
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge. I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something. |
Thanks. I agree with those suggestions. I'll work on this next week. |
@burtenshaw perhaps you can add the documentation is this PR? |
Also, perhaps I like some more explicit naming like |
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
- Create a new base class `BasePipelineTemplate` for pipeline templates - Update `DatasetInstructionResponsePipeline` and `InstructionResponsePipeline` to inherit from `BasePipelineTemplate` - Enhance documentation for pipeline classes with detailed attribute and column descriptions - Add pipeline components to the components gallery generation - Update components gallery index to include a new Pipelines section
for more information, see https://pre-commit.ci
- Update Hugging Face steps gallery to include HuggingFaceHubCheckpointer - Improve checkpointing documentation with updated links and formatting - Reorganize import in checkpointer.py for better code structure
…batch size guidance
@davidberenstein1957 This looks great. Thanks for the help. Are we ready to go? |
This is a continuation of this: #1059
It implements a pipeline abstraction template that runs on
SelfInstruct
step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.