You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is mostly done with the compressed_file_reader being the only processor left to convert. The slicer is substantially different than the file_reader, so I think the biggest question here would be whether to modernize the compressed_file_reader or to add it as part of the file_reader.
Currently, the processor will uncompress files to a separate working directory before slicing them for processing. Once the last slice of a file is processed, an archive mechanism (known slice order issue in #17) moves it to an "archive" directory. Also, there is a timer mechanism in the slicer to check the specified directory at some interval for new files if the job is a persistent job. Finally, the processor also maintains an on-disk state for each file being processed (I think this should be removed in favor of just logging file statuses where applicable).
In adding compression as an option for the file_reader, I imagine it would have a compression_type option with this schema:
compression_type: {
doc: 'Determines whether or not to uncompress files',
default: 'uncompressed',
format: ['uncompressed', 'lz4',...]
}
For the decompression jobs, the slicer could just decompress the files in place and add both the compressed path and the uncompressed path as metadata for each record. For now, the files would be left on-disk as-is for cleanup after the job by an operator or some other program. Adding this to the file_reader should be fairly straightforward since it would just be a matter of adding the compression utilities to the slicer. The next question would be whether or not to preserve the persistent job logic. I think all of the file reader jobs I have encountered so far were once jobs, but if there is a need for persistent file reader jobs, this functionality should at least be extended to the file_reader as well.
This is still using the old style processor APis and should be updated at some point.
The text was updated successfully, but these errors were encountered: