Add Marisa trie index#6
Conversation
|
Man, this is great, thanks for the contrib! This is a good excuse to clear out the PR backlog here. Sorry ahead of time if it takes a couple of days to get back, currently on some paper deadlines. |
|
Hell, thank you for building this thing; it scratched an itch I've had for a long time: getting so spoiled with Oh, yeah, I also removed I was intrigued about including a snapshot of the package index for immediate use after installation. Was that your intention? Would you like to keep that functionality? Maybe I could include it as an extra, for use with |
| def get_pip_cache_data_dir(): | ||
| return os.path.join(get_xdg_data_dir(), 'pip-cache') | ||
|
|
||
|
|
||
| def get_raw_index_filename(): | ||
| return os.path.join(get_pip_cache_data_dir(), 'all-packages.txt') | ||
|
|
||
|
|
||
| def get_index_filename(): | ||
| return os.path.join(get_pip_cache_data_dir(), 'all-packages.marisa') |
There was a problem hiding this comment.
FYI: I changed these to methods, instead of global constants, so I could easily patch them out during testing
| if not os.path.isdir(pip_cache_data_dir): | ||
| os.mkdir(pip_cache_data_dir) | ||
|
|
||
| print('Connecting to PyPi...', end='', flush=True) |
There was a problem hiding this comment.
FYA: there's a neat little package, manrajgrover/halo, which adds a little spinner in the console and can be used as a context manager
with Halo('Connecting to PyPi'):
client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
with Halo('Downloading package names'):
packages = client.list_packages()
with Halo('Writing packages to cache'):
with open(raw_index_filename, 'w') as f:
f.write('\n'.join(packages))
with Halo('Indexing packages'):
trie = marisa_trie.Trie(packages)
trie.save(index_filename)I like to save the text of each thing, though
class task(Halo):
def __exit__(self, type, value, traceback):
self.succeed()
with task('Connecting to PyPi'):
client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')
with task('Downloading package names'):
packages = client.list_packages()
with task('Writing packages to cache'):
with open(raw_index_filename, 'w') as f:
f.write('\n'.join(packages))
with task('Indexing packages'):
trie = marisa_trie.Trie(packages)
trie.save(index_filename)|
|
||
| @pytest.fixture(scope='session') | ||
| def pip_cache_dir(request): | ||
| path = request.config.cache.makedir('pip-cache-data') |
There was a problem hiding this comment.
This'll create a directory under .pytest_cache/d/, which'll remain in between test runs. To clear the cache, py.test --cache-clear can be used; though, deleting .pytest_cache would prolly work just as well :P
|
|
||
| if not request.config.cache.get(CACHE_KEY, None): | ||
| pip_cache.update_package_list() | ||
| request.config.cache.set(CACHE_KEY, True) |
There was a problem hiding this comment.
This'll get stored as .pytest_cache/v/pip-cache/is-cache-built, and I think it's a JSON file



I was curious about the performance differences, so I implemented the package name cache as a Marisa trie. Here are some of my results.
Methodology
I used ionelmc/pytest-benchmark to measure performance in my tests. I ran
pkgnameswith a few different prefixes:''— empty string, to measure results when all package names are returned'i'— a large number of packages to be returned, but filtered from all packages'ipy'— a small number of packagesBase
First, I ran the benchmarks against the existing package
Filtering the package names results in pretty damn decent results... though, if all packages were requested, the performance was pretty poor, at ~0.5s.
All packages perf
First, I wanted to fix that 0.5s for requesting all packages.
I noticed
pkgnames()printing each package line-by-line in a for loopIn general, when doing perf work, the core tenet is not to make things faster, but to do less things. And also, when it comes to I/O, the language/OS has probably built better buffering than I ever could; I seek to provide as much data upfront, so those buffers can remain full.
I changed the printing to generate the entire response in one fell swoop, with
'\n'.join(matching_packages), and just printing that whole fellaAnd results from the benchmark:
Aww yiss, the all-packages performance matched the filtered performance
Trie
Next, I wanted to try out the Marisa trie. It ended up being a lot simpler than I expected to implement, with instantiation just being
Trie(packages), saving to file as easy astrie.save(path), and loading withtrie = Trie(); trie.load(path).I altered
pkgnamesto read from the trie, and these were the resultsAnd wow! The speeds were measured in microseconds, not milliseconds! For the
'i'input, the trie was 25x faster; for'ipy', it was 200x faster!But pulling all the packages from the trie was still 60ms, which was slower than the regular index.
Trie w/ All Packages
To address the all-packages performance w/ the trie, I made a special case: if an empty prefix is provided, simply parrot raw package names file (the original package names index)
17ms — not bad!
All packages optimization
I realized that for empty prefix / all-packages response, I was still splitting all the lines in the raw index, and rejoining them when emitting the response. I changed it to simply parrot the raw index file — read the file, then print it out.
Boom! 2ms :) (Well, 4ms max)
Summary
So, with a few optimizations, overall performance is improved 10x — and in some cases, 200x. Success :)