Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LoadDataFromFileSystem gets stuck when used with fsspec #1110

Open
gabrielmbmb opened this issue Jan 22, 2025 · 0 comments
Open

[BUG] LoadDataFromFileSystem gets stuck when used with fsspec #1110

gabrielmbmb opened this issue Jan 22, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@gabrielmbmb
Copy link
Member

Describe the bug

LoadDataFromFileSystem never ends loading when used in a Pipeline because call to load method gets stuck. The source of the problem is that distilabel needs to know the output that will produce the step in advance using the outputs property which is accessed from the main process. In the LoadDataFromFileSystem, the outputs property is calling the load method to load the datasets.Dataset and get its column_names. Then, when load method called is again from the child process, it gets stuck. Probably for the same reason as in fsspec/s3fs#464.

To reproduce

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromFileSystem

with Pipeline() as pipeline:
    load_data = LoadDataFromFileSystem(
        data_files="s3://my_path/*.parquet",
        num_examples=10,
        output_mappings={"text": "seed"},
    )

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False)

Expected behavior

The step doesn't get stuck.

Screenshots

No response

Environment

  • Distilabel Version [e.g. 1.0.0]: 1.5.1
  • Python Version [e.g. 3.11]: 3.11

Additional context

No response

@gabrielmbmb gabrielmbmb added the bug Something isn't working label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant