[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

burtenshaw · 2024-12-02T12:14:31Z

This is a continuation of this: #1059

It implements a pipeline abstraction template that runs on SelfInstruct step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.

from datasets import Dataset
import wikipedia
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5)

distiset = pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

review-notebook-app · 2024-12-02T12:14:38Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

github-actions · 2024-12-02T12:15:49Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/

codspeed-hq · 2024-12-02T12:19:03Z

CodSpeed Performance Report

Merging #1076 will degrade performances by 49.04%

_{Comparing feat/dataset-instruction-response-pipeline (0303d0f) with develop (f5ddbc6)}

Summary

❌ 1 regressions

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`test_cache_time`	2.2 s	4.3 s	-49.04%

davidberenstein1957 · 2024-12-10T10:05:21Z

@burtenshaw can we get rid of the pipeline.pipeline.run? Also, perhaps we could limit the exposure to different classes with something like the following. Under the hood it can still use the same but we just use different arguments. WDYT?

from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline(num_instructions=5)

distiset = pipeline.pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

src/distilabel/pipeline/templates/dataset_instruction.py

for more information, see https://pre-commit.ci

src/distilabel/pipeline/templates/dataset_instruction.py

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

for more information, see https://pre-commit.ci

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

davidberenstein1957 · 2025-01-10T07:29:35Z

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

burtenshaw · 2025-01-10T08:29:10Z

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

Thanks. I agree with those suggestions. I'll work on this next week.

davidberenstein1957 · 2025-01-20T07:46:22Z

@burtenshaw perhaps you can add the documentation is this PR?

davidberenstein1957 · 2025-01-20T07:48:51Z

Also, perhaps I like some more explicit naming like InstructionResponseFromDataPipeline or InstructionResponseFromSeedDataPipeline better.

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

…sync clients

- Create a new base class `BasePipelineTemplate` for pipeline templates - Update `DatasetInstructionResponsePipeline` and `InstructionResponsePipeline` to inherit from `BasePipelineTemplate` - Enhance documentation for pipeline classes with detailed attribute and column descriptions - Add pipeline components to the components gallery generation - Update components gallery index to include a new Pipelines section

for more information, see https://pre-commit.ci

davidberenstein1957 · 2025-01-29T17:11:50Z

- Update Hugging Face steps gallery to include HuggingFaceHubCheckpointer - Improve checkpointing documentation with updated links and formatting - Reorganize import in checkpointer.py for better code structure

…batch size guidance

burtenshaw · 2025-01-30T08:35:09Z

@davidberenstein1957 This looks great. Thanks for the help.

Are we ready to go?

burtenshaw added 2 commits November 25, 2024 22:23

feat: implement abstraction on pipeline form datasets

ab8c385

docs: update class doc string and examples

81697ca

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff18c78

for more information, see https://pre-commit.ci

burtenshaw requested review from gabrielmbmb and plaguss December 2, 2024 12:14

burtenshaw marked this pull request as draft December 2, 2024 14:16

burtenshaw requested a review from davidberenstein1957 December 10, 2024 09:56

davidberenstein1957 reviewed Dec 10, 2024

View reviewed changes

burtenshaw and others added 2 commits December 16, 2024 12:36

feat: respond to small changes

3266e70

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8f3310

for more information, see https://pre-commit.ci

davidberenstein1957 reviewed Dec 16, 2024

View reviewed changes

src/distilabel/pipeline/templates/dataset_instruction.py Show resolved Hide resolved

burtenshaw and others added 5 commits December 16, 2024 13:23

add kwargs to docstring

45e10f1

Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

6e69361

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

[pre-commit.ci] auto fixes from pre-commit.com hooks

68524f5

for more information, see https://pre-commit.ci

remove notebook

a2b7356

Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

f76bc38

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

burtenshaw marked this pull request as ready for review December 16, 2024 12:26

Merge branch 'develop' into feat/dataset-instruction-response-pipeline

a0c23f6

burtenshaw and others added 4 commits January 28, 2025 12:21

remove ununsed import

f1fb538

Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

d8e71ea

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

add documentation

636e2c2

Update Hugging Face Inference Endpoints tests to mock both sync and a…

fb55221

…sync clients

davidberenstein1957 and others added 3 commits January 29, 2025 17:28

Merge branch 'develop' into feat/dataset-instruction-response-pipeline

da1720f

[pre-commit.ci] auto fixes from pre-commit.com hooks

f0a1c46

for more information, see https://pre-commit.ci

burtenshaw and others added 3 commits January 29, 2025 19:33

fix: change liscence spacing

a2e092d

docs: Add HuggingFaceHubCheckpointer to documentation

786be3e

- Update Hugging Face steps gallery to include HuggingFaceHubCheckpointer - Improve checkpointing documentation with updated links and formatting - Reorganize import in checkpointer.py for better code structure

docs: Update checkpointing documentation numbering and clarify input …

0303d0f

…batch size guidance

davidberenstein1957 merged commit 0009d1c into develop Jan 30, 2025
7 of 8 checks passed

davidberenstein1957 deleted the feat/dataset-instruction-response-pipeline branch January 30, 2025 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

burtenshaw commented Dec 2, 2024 •

edited

Loading

review-notebook-app bot commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

codspeed-hq bot commented Dec 2, 2024 •

edited

Loading

davidberenstein1957 commented Dec 10, 2024 •

edited

Loading

davidberenstein1957 commented Jan 10, 2025

burtenshaw commented Jan 10, 2025

davidberenstein1957 commented Jan 20, 2025

davidberenstein1957 commented Jan 20, 2025

davidberenstein1957 commented Jan 29, 2025

burtenshaw commented Jan 30, 2025

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Conversation

burtenshaw commented Dec 2, 2024 • edited Loading

review-notebook-app bot commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

codspeed-hq bot commented Dec 2, 2024 • edited Loading

CodSpeed Performance Report

Merging #1076 will degrade performances by 49.04%

Summary

Benchmarks breakdown

davidberenstein1957 commented Dec 10, 2024 • edited Loading

davidberenstein1957 commented Jan 10, 2025

burtenshaw commented Jan 10, 2025

davidberenstein1957 commented Jan 20, 2025

davidberenstein1957 commented Jan 20, 2025

davidberenstein1957 commented Jan 29, 2025

burtenshaw commented Jan 30, 2025

burtenshaw commented Dec 2, 2024 •

edited

Loading

codspeed-hq bot commented Dec 2, 2024 •

edited

Loading

davidberenstein1957 commented Dec 10, 2024 •

edited

Loading