Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAUSED] Fix for predict file type when directory is provided as input #1422

Open
wants to merge 3 commits into
base: branch-21.06
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
- #1413 Fix for null tests 13 and 23 of windowFunctionTest
- #1416 Fix full join when both tables contains nulls
- #1423 Fix temporary directory for hive partition test
- #1422 Fix for predict file type when directory is provided as input


## Deprecated Features
Expand Down
31 changes: 22 additions & 9 deletions pyblazing/pyblazing/apiv2/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -2270,15 +2270,6 @@ def create_table(self, table_name, input, **kwargs):
# /path/to/data/folder/ -> name_file = /path/to/data/folder/, extension = ''
name_file, extension = os.path.splitext(input[0])

if not recognized_extension(extension) and file_format_hint == "undefined":
raise Exception(
"ERROR: Your input file doesn't have a recognized extension, "
+ "you have to specify the `file_format` parameter. "
+ "Recognized extensions are: [orc, parquet, csv, json, psv]."
+ "\nFor example if you are using a *.log file you must pass file_format='csv' "
+ "with all the needed extra parameters. See https://docs.blazingdb.com/docs/creating-tables"
)

if (
file_format_hint == "undefined"
and extension == ""
Expand All @@ -2296,6 +2287,28 @@ def create_table(self, table_name, input, **kwargs):
kwargs["names"].pop(id)
kwargs["dtype"].pop(id)

# if the input is a directory and files do not have extension we want to raise an error
if name_file[-1] == "/" and extension == "":
all_files = os.listdir(name_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong here. If we are creating a table from a folder, we will get the files listed out by the C++ layer. I cant remember where or how that happens, but the python layer cant be the one doing it. If you have an HDFS or S3 or GCS filesystem, python cant do os.listdir


if len(all_files) == 0:
raise Exception(
"ERROR: You need to ensure the current directory is not empty."
)

first_file, extension_file = os.path.splitext(all_files[0])
if (
not recognized_extension(extension_file)
and file_format_hint == "undefined"
):
raise Exception(
"ERROR: Your input file doesn't have a recognized extension, "
+ "you have to specify the `file_format` parameter. "
+ "Recognized extensions are: [orc, parquet, csv, json, psv]."
+ "\nFor example if you are using a *.log file you must pass file_format='csv' "
+ "with all the needed extra parameters. See https://docs.blazingdb.com/docs/creating-tables"
)

parsedSchema, parsed_mapping_files = self._parseSchema(
input,
file_format_hint,
Expand Down