Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Merged

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Dec 2, 2024

This is a continuation of this: #1059

It implements a pipeline abstraction template that runs on SelfInstruct step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.

from datasets import Dataset
import wikipedia
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5)

distiset = pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

github-actions bot commented Dec 2, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/

Copy link

codspeed-hq bot commented Dec 2, 2024

CodSpeed Performance Report

Merging #1076 will degrade performances by 49.04%

Comparing feat/dataset-instruction-response-pipeline (0303d0f) with develop (f5ddbc6)

Summary

❌ 1 regressions

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_cache_time 2.2 s 4.3 s -49.04%

@burtenshaw burtenshaw marked this pull request as draft December 2, 2024 14:16
@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Dec 10, 2024

@burtenshaw can we get rid of the pipeline.pipeline.run? Also, perhaps we could limit the exposure to different classes with something like the following. Under the hood it can still use the same but we just use different arguments. WDYT?

from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline(num_instructions=5)

distiset = pipeline.pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

@burtenshaw burtenshaw marked this pull request as ready for review December 16, 2024 12:26
@davidberenstein1957
Copy link
Member

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

@burtenshaw
Copy link
Contributor Author

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

Thanks. I agree with those suggestions. I'll work on this next week.

@davidberenstein1957
Copy link
Member

@burtenshaw perhaps you can add the documentation is this PR?

@davidberenstein1957
Copy link
Member

Also, perhaps I like some more explicit naming like InstructionResponseFromDataPipeline or InstructionResponseFromSeedDataPipeline better.

davidberenstein1957 and others added 3 commits January 29, 2025 17:28
- Create a new base class `BasePipelineTemplate` for pipeline templates
- Update `DatasetInstructionResponsePipeline` and `InstructionResponsePipeline` to inherit from `BasePipelineTemplate`
- Enhance documentation for pipeline classes with detailed attribute and column descriptions
- Add pipeline components to the components gallery generation
- Update components gallery index to include a new Pipelines section
@davidberenstein1957
Copy link
Member

image image

burtenshaw and others added 3 commits January 29, 2025 19:33
- Update Hugging Face steps gallery to include HuggingFaceHubCheckpointer
- Improve checkpointing documentation with updated links and formatting
- Reorganize import in checkpointer.py for better code structure
@burtenshaw
Copy link
Contributor Author

@davidberenstein1957 This looks great. Thanks for the help.

Are we ready to go?

@davidberenstein1957 davidberenstein1957 merged commit 0009d1c into develop Jan 30, 2025
7 of 8 checks passed
@davidberenstein1957 davidberenstein1957 deleted the feat/dataset-instruction-response-pipeline branch January 30, 2025 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants