Add Marisa trie index by theY4Kman · Pull Request #6 · brunobeltran/pip-cache

theY4Kman · 2020-05-16T22:39:12Z

I was curious about the performance differences, so I implemented the package name cache as a Marisa trie. Here are some of my results.

Methodology

I used ionelmc/pytest-benchmark to measure performance in my tests. I ran pkgnames with a few different prefixes:

'' — empty string, to measure results when all package names are returned
'i' — a large number of packages to be returned, but filtered from all packages
'ipy' — a small number of packages

Base

First, I ran the benchmarks against the existing package

---------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames at 0x7fed776fe550>': 4 tests ---------------------------------------------------------------
Name (time in ms)                            Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames]          32.1004 (1.0)       39.3018 (1.16)      32.8190 (1.01)     1.2665 (4.77)      32.5297 (1.00)     0.6084 (2.23)          1;2  30.4702 (0.99)         31           1
test_pkgnames_speed[ipy-pkgnames]        32.2818 (1.01)      33.8813 (1.0)       32.5978 (1.0)      0.4063 (1.53)      32.4363 (1.0)      0.3085 (1.13)          4;4  30.6769 (1.0)          30           1
test_pkgnames_speed[i-pkgnames]          42.8242 (1.33)      44.0079 (1.30)      43.3055 (1.33)     0.2656 (1.0)       43.3043 (1.34)     0.2733 (1.0)           7;1  23.0918 (0.75)         24           1
test_pkgnames_speed[<all>-pkgnames]     484.3362 (15.09)    492.8921 (14.55)    487.4714 (14.95)    3.8232 (14.39)    485.3129 (14.96)    6.1898 (22.64)         1;0   2.0514 (0.07)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Filtering the package names results in pretty damn decent results... though, if all packages were requested, the performance was pretty poor, at ~0.5s.

All packages perf

First, I wanted to fix that 0.5s for requesting all packages.

I noticed pkgnames() printing each package line-by-line in a for loop

def pkgnames(prefix=''):
    matching_packages = get_package_names(prefix=prefix)
    for package in matching_packages:
        print(package)

In general, when doing perf work, the core tenet is not to make things faster, but to do less things. And also, when it comes to I/O, the language/OS has probably built better buffering than I ever could; I seek to provide as much data upfront, so those buffers can remain full.

I changed the printing to generate the entire response in one fell swoop, with '\n'.join(matching_packages), and just printing that whole fella

def pkgnames(prefix=''):
    matching_packages = get_package_names(prefix=prefix)
    response = '\n'.join(matching_packages)
    print(response)

And results from the benchmark:

------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames_bulk_print at 0x7fed776fe5e0>': 4 tests --------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames_bulk_print]         31.5530 (1.0)      32.7390 (1.01)     31.9732 (1.0)      0.3079 (2.34)     31.8860 (1.0)      0.3953 (2.80)         10;0  31.2762 (1.0)          31           1
test_pkgnames_speed[ipy-pkgnames_bulk_print]       31.8123 (1.01)     32.3486 (1.0)      32.0379 (1.00)     0.1316 (1.0)      32.0156 (1.00)     0.1414 (1.0)           9;1  31.2131 (1.00)         32           1
test_pkgnames_speed[i-pkgnames_bulk_print]         32.8958 (1.04)     37.6176 (1.16)     33.4168 (1.05)     1.0162 (7.72)     33.1444 (1.04)     0.2340 (1.65)          2;3  29.9251 (0.96)         22           1
test_pkgnames_speed[<all>-pkgnames_bulk_print]     37.1408 (1.18)     39.4088 (1.22)     38.1502 (1.19)     0.6027 (4.58)     38.2545 (1.20)     0.7840 (5.54)          9;0  26.2122 (0.84)         25           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Aww yiss, the all-packages performance matched the filtered performance

Trie

Next, I wanted to try out the Marisa trie. It ended up being a lot simpler than I expected to implement, with instantiation just being Trie(packages), saving to file as easy as trie.save(path), and loading with trie = Trie(); trie.load(path).

I altered pkgnames to read from the trie, and these were the results

--------------------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames_trie at 0x7f4c3bdb6a60>': 4 tests ---------------------------------------------------------------------------
Name (time in us)                                    Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers         OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames_trie]            121.5001 (1.0)         238.8821 (1.0)         130.4816 (1.0)          9.6921 (1.0)         127.7820 (1.0)          5.4599 (1.12)      153;161  7,663.9147 (1.0)        2374           1
test_pkgnames_speed[ipy-pkgnames_trie]          144.4730 (1.19)        372.6850 (1.56)        157.3764 (1.21)        21.5655 (2.23)        153.1346 (1.20)         4.8894 (1.0)       142;276  6,354.1947 (0.83)       3276           1
test_pkgnames_speed[i-pkgnames_trie]          1,346.6370 (11.08)     2,199.2889 (9.21)      1,394.0925 (10.68)       85.1560 (8.79)      1,379.6790 (10.80)       26.3505 (5.39)        18;20    717.3125 (0.09)        635           1
test_pkgnames_speed[<all>-pkgnames_trie]     60,827.8541 (500.64)   72,994.6129 (305.57)   62,315.6022 (477.58)   2,990.3667 (308.54)   61,359.3940 (480.19)   1,137.2961 (232.60)        1;1     16.0473 (0.00)         16           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And wow! The speeds were measured in microseconds, not milliseconds! For the 'i' input, the trie was 25x faster; for 'ipy', it was 200x faster!

But pulling all the packages from the trie was still 60ms, which was slower than the regular index.

Trie w/ All Packages

To address the all-packages performance w/ the trie, I made a special case: if an empty prefix is provided, simply parrot raw package names file (the original package names index)

def pkgnames(prefix=''):
    if not prefix:
        matching_packages = get_all_package_names()
    else:
        matching_packages = get_package_names(prefix=prefix)

    response = '\n'.join(matching_packages)
    print(response)

-------------------------------------------------------------------------------------------- benchmark 'pkgnames': 4 tests --------------------------------------------------------------------------------------------
Name (time in us)                      Min                    Max                   Mean                StdDev                 Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0]            121.5001 (1.0)         168.2369 (1.0)         127.9703 (1.0)          6.5916 (1.0)         125.5025 (1.0)        4.6890 (1.0)       267;253  7,814.3134 (1.0)        2342           1
test_pkgnames_speed[ipy]          147.9900 (1.22)        319.0329 (1.90)        159.2223 (1.24)        13.8756 (2.11)        155.8540 (1.24)       5.0691 (1.08)      172;270  6,280.5287 (0.80)       3287           1
test_pkgnames_speed[i]          1,312.9509 (10.81)     2,971.6729 (17.66)     1,362.7967 (10.65)       90.5921 (13.74)     1,341.6645 (10.69)     20.2591 (4.32)        49;86    733.7851 (0.09)        690           1
test_pkgnames_speed[<all>]     17,952.0501 (147.75)   24,011.7370 (142.73)   20,125.1347 (157.26)   1,040.7176 (157.89)   19,827.9684 (157.99)   320.2048 (68.29)        9;11     49.6891 (0.01)         50           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

17ms — not bad!

All packages optimization

I realized that for empty prefix / all-packages response, I was still splitting all the lines in the raw index, and rejoining them when emitting the response. I changed it to simply parrot the raw index file — read the file, then print it out.

----------------------------------------------------------------------------------------- benchmark 'pkgnames': 4 tests -----------------------------------------------------------------------------------------
Name (time in us)                     Min                   Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0]           123.2328 (1.0)        262.3170 (1.0)        133.0901 (1.0)       15.0659 (1.0)        129.9269 (1.0)        4.0630 (1.0)        80;175  7,513.7051 (1.0)        2289           1
test_pkgnames_speed[ipy]         150.7460 (1.22)       355.0630 (1.35)       164.6641 (1.24)      20.6171 (1.37)       159.6075 (1.23)       4.9237 (1.21)      119;213  6,072.9679 (0.81)       2132           1
test_pkgnames_speed[i]         1,390.0651 (11.28)    3,571.6922 (13.62)    1,485.3559 (11.16)    161.0224 (10.69)    1,461.4101 (11.25)     26.2756 (6.47)        14;34    673.2393 (0.09)        636           1
test_pkgnames_speed[<all>]     2,341.6900 (19.00)    3,823.8191 (14.58)    2,697.3013 (20.27)    194.8775 (12.94)    2,642.0900 (20.34)    220.8641 (54.36)        47;9    370.7409 (0.05)        263           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Boom! 2ms :) (Well, 4ms max)

Summary

So, with a few optimizations, overall performance is improved 10x — and in some cases, 200x. Success :)

brunobeltran · 2020-05-17T19:42:05Z

Man, this is great, thanks for the contrib! This is a good excuse to clear out the PR backlog here. Sorry ahead of time if it takes a couple of days to get back, currently on some paper deadlines.

theY4Kman · 2020-05-17T21:02:04Z

Hell, thank you for building this thing; it scratched an itch I've had for a long time: getting so spoiled with apt install <tab><tab> and wanting it for pip.

Oh, yeah, I also removed all-packages.txt from data_files, because it was being included in the wheel with your full home directory path :P

I was intrigued about including a snapshot of the package index for immediate use after installation. Was that your intention? Would you like to keep that functionality? Maybe I could include it as an extra, for use with pip install pip-cache[full] or something

theY4Kman · 2020-05-17T21:04:14Z

+def get_pip_cache_data_dir():
+    return os.path.join(get_xdg_data_dir(), 'pip-cache')
+
+
+def get_raw_index_filename():
+    return os.path.join(get_pip_cache_data_dir(), 'all-packages.txt')
+
+
+def get_index_filename():
+    return os.path.join(get_pip_cache_data_dir(), 'all-packages.marisa')


FYI: I changed these to methods, instead of global constants, so I could easily patch them out during testing

theY4Kman · 2020-05-17T21:23:13Z

+    if not os.path.isdir(pip_cache_data_dir):
+        os.mkdir(pip_cache_data_dir)
+
+    print('Connecting to PyPi...', end='', flush=True)


FYA: there's a neat little package, manrajgrover/halo, which adds a little spinner in the console and can be used as a context manager

with Halo('Connecting to PyPi'): client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi') with Halo('Downloading package names'): packages = client.list_packages() with Halo('Writing packages to cache'): with open(raw_index_filename, 'w') as f: f.write('\n'.join(packages)) with Halo('Indexing packages'): trie = marisa_trie.Trie(packages) trie.save(index_filename)

I like to save the text of each thing, though

class task(Halo): def __exit__(self, type, value, traceback): self.succeed() with task('Connecting to PyPi'): client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi') with task('Downloading package names'): packages = client.list_packages() with task('Writing packages to cache'): with open(raw_index_filename, 'w') as f: f.write('\n'.join(packages)) with task('Indexing packages'): trie = marisa_trie.Trie(packages) trie.save(index_filename)

theY4Kman · 2020-05-17T21:30:11Z

+
+@pytest.fixture(scope='session')
+def pip_cache_dir(request):
+    path = request.config.cache.makedir('pip-cache-data')


This'll create a directory under .pytest_cache/d/, which'll remain in between test runs. To clear the cache, py.test --cache-clear can be used; though, deleting .pytest_cache would prolly work just as well :P

theY4Kman · 2020-05-17T21:30:55Z

+
+    if not request.config.cache.get(CACHE_KEY, None):
+        pip_cache.update_package_list()
+        request.config.cache.set(CACHE_KEY, True)


This'll get stored as .pytest_cache/v/pip-cache/is-cache-built, and I think it's a JSON file

theY4Kman added 5 commits May 16, 2020 18:04

feat: add marisa trie package index

b975aa9

test: add benchmark tests

2cea4a7

chore: don't install package cache w/ package

03523f7

chore: add marisa-trie req, and pytest-benchmark test req

90c9321

chore: improve performance of all-packages responses

a302739

theY4Kman commented May 17, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Marisa trie index#6

Add Marisa trie index#6
theY4Kman wants to merge 5 commits into
brunobeltran:masterfrom
theY4Kman:feat/add-marisa-trie

theY4Kman commented May 16, 2020

Uh oh!

brunobeltran commented May 17, 2020

Uh oh!

theY4Kman commented May 17, 2020 •

edited

Loading

Uh oh!

theY4Kman May 17, 2020

Uh oh!

theY4Kman May 17, 2020

Uh oh!

theY4Kman May 17, 2020

Uh oh!

theY4Kman May 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

theY4Kman commented May 16, 2020

Methodology

Base

All packages perf

Trie

Trie w/ All Packages

All packages optimization

Summary

Uh oh!

brunobeltran commented May 17, 2020

Uh oh!

theY4Kman commented May 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theY4Kman May 17, 2020

Choose a reason for hiding this comment

Uh oh!

theY4Kman May 17, 2020

Choose a reason for hiding this comment

Uh oh!

theY4Kman May 17, 2020

Choose a reason for hiding this comment

Uh oh!

theY4Kman May 17, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theY4Kman commented May 17, 2020 •

edited

Loading