-
Notifications
You must be signed in to change notification settings - Fork 136
Decompress util #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Decompress util #244
Conversation
turicas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! :)
I've requested some changes/questions and we can discuss about them. =)
rows/utils.py
Outdated
| lzma_mime_types = ( | ||
| 'application/x-xz', | ||
| 'application/x-lzma' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mimetype detection is not needed here - we can expect the user will call this function only if she knows the file is compressed and in one of the supported algorithms; we can do this detection automatically on the command-line interface using file-magic and then pass the correct arguments to decompress. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally : )
rows/utils.py
Outdated
| return export_function(table, uri, *args, **kwargs) | ||
|
|
||
|
|
||
| def decompress(path, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an algorithm parameter to this function? It should defaults to None (if None, use file extension to define it).
rows/utils.py
Outdated
| msg = "Couldn't identify file mimetype, or lzma module isn't available" | ||
| raise RuntimeError(msg) | ||
|
|
||
| with open_compressed(path, **kwargs) as handler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think having kwargs on decompress is really needed? Could you give me an example use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically encoding. On UNIX for example I barely use it. But Windows user should always add utf-8 I was told.
tests/tests_utils.py
Outdated
|
|
||
| def setUp(self): | ||
| self.contents = six.b('Ahoy') | ||
| self.temp = tempfile.TemporaryDirectory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the tests are failing here. I've replaced self.tmp with self.temp (was receiving a NameError) but they still fail:
======================================================================
ERROR: test_decompress_with_bz2 (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/turicas/projects/rows/tests/tests_utils.py", line 99, in test_decompress_with_bz2
decompressed = rows.utils.decompress(compressed)
File "/home/turicas/projects/rows/rows/utils.py", line 359, in decompress
raise RuntimeError(msg)
RuntimeError: Couldn't identify file mimetype, or lzma module isn't available
======================================================================
ERROR: test_decompress_with_gz (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/turicas/projects/rows/tests/tests_utils.py", line 107, in test_decompress_with_gz
self.assertEqual(self.contents, decompressed.read())
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/gzip.py", line 272, in read
self._check_not_closed()
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/_compression.py", line 14, in _check_not_closed
raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file
======================================================================
ERROR: test_decompress_with_incompatible_file (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/turicas/projects/rows/tests/tests_utils.py", line 126, in test_decompress_with_incompatible_file
with self.assertRaises():
TypeError: assertRaises() missing 1 required positional argument: 'expected_exception'
======================================================================
ERROR: test_decompress_with_lzma (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/turicas/projects/rows/tests/tests_utils.py", line 112, in test_decompress_with_lzma
with lzma.open(compressed) as compressed_handler:
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 302, in open
preset=preset, filters=filters)
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 120, in __init__
self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpr5s7bse8/test.lzma'
======================================================================
ERROR: test_decompress_with_xz (tests.tests_utils.UtilsDecompressTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/turicas/projects/rows/tests/tests_utils.py", line 120, in test_decompress_with_xz
with lzma.open(compressed) as compressed_handler:
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 302, in open
preset=preset, filters=filters)
File "/home/turicas/software/pyenv/versions/3.6.2/lib/python3.6/lzma.py", line 120, in __init__
self._fp = builtins.open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpz5l7xm1q/test.gz'
----------------------------------------------------------------------
Ran 184 tests in 0.848s
Are the tests passing in your machine?
rows/utils.py
Outdated
| kwargs are passed to either `bz2.openn`, `gzip.open` or `lzma.open`. | ||
| :param path: (str) path to a bz2, gzip or lzma file | ||
| """ | ||
| filename = os.path.basename(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the current API accepts filenames or file-objects,
this function should also do (this decision was inspired in the Python stdlib modules, such as csv). You can get some help using rows.plugins.utils.get_filename_and_fobj.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed pretty helpful, thanks ; )
rows/utils.py
Outdated
| raise RuntimeError(msg) | ||
|
|
||
| with open_compressed(path, **kwargs) as handler: | ||
| return handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very important to ensure the file object returned is open in binary mode (so the plugins will decode the data using the desired encoding). Could you please add a test for this case?
|
Ok, sorry I took too long to get back to this PR. Life's got hectic around here. Anyway, I addressed many issues in this refactor:
About tests: I've just ran Do you prefer to implement |
fa144f0 to
bbb2c57
Compare
This PR implements
rows.utils.decompressas suggested in #230. The API is:A
RunTimeerror is raised if:rowscan't properly guess the mimetype as a known mimetype of a lzma or gzip filerowsidentifies a lzma mimetype but thelzmamodule is not available (a lzma lib has to be available in the OS when compiling Python it self as explained here)