Skip to content

Add Marisa trie index#6

Open
theY4Kman wants to merge 5 commits into
brunobeltran:masterfrom
theY4Kman:feat/add-marisa-trie
Open

Add Marisa trie index#6
theY4Kman wants to merge 5 commits into
brunobeltran:masterfrom
theY4Kman:feat/add-marisa-trie

Conversation

@theY4Kman
Copy link
Copy Markdown

I was curious about the performance differences, so I implemented the package name cache as a Marisa trie. Here are some of my results.

Methodology

I used ionelmc/pytest-benchmark to measure performance in my tests. I ran pkgnames with a few different prefixes:

  1. '' — empty string, to measure results when all package names are returned
  2. 'i' — a large number of packages to be returned, but filtered from all packages
  3. 'ipy' — a small number of packages

Base

First, I ran the benchmarks against the existing package

---------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames at 0x7fed776fe550>': 4 tests ---------------------------------------------------------------
Name (time in ms)                            Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames]          32.1004 (1.0)       39.3018 (1.16)      32.8190 (1.01)     1.2665 (4.77)      32.5297 (1.00)     0.6084 (2.23)          1;2  30.4702 (0.99)         31           1
test_pkgnames_speed[ipy-pkgnames]        32.2818 (1.01)      33.8813 (1.0)       32.5978 (1.0)      0.4063 (1.53)      32.4363 (1.0)      0.3085 (1.13)          4;4  30.6769 (1.0)          30           1
test_pkgnames_speed[i-pkgnames]          42.8242 (1.33)      44.0079 (1.30)      43.3055 (1.33)     0.2656 (1.0)       43.3043 (1.34)     0.2733 (1.0)           7;1  23.0918 (0.75)         24           1
test_pkgnames_speed[<all>-pkgnames]     484.3362 (15.09)    492.8921 (14.55)    487.4714 (14.95)    3.8232 (14.39)    485.3129 (14.96)    6.1898 (22.64)         1;0   2.0514 (0.07)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Filtering the package names results in pretty damn decent results... though, if all packages were requested, the performance was pretty poor, at ~0.5s.

All packages perf

First, I wanted to fix that 0.5s for requesting all packages.

I noticed pkgnames() printing each package line-by-line in a for loop

def pkgnames(prefix=''):
    matching_packages = get_package_names(prefix=prefix)
    for package in matching_packages:
        print(package)

In general, when doing perf work, the core tenet is not to make things faster, but to do less things. And also, when it comes to I/O, the language/OS has probably built better buffering than I ever could; I seek to provide as much data upfront, so those buffers can remain full.

I changed the printing to generate the entire response in one fell swoop, with '\n'.join(matching_packages), and just printing that whole fella

def pkgnames(prefix=''):
    matching_packages = get_package_names(prefix=prefix)
    response = '\n'.join(matching_packages)
    print(response)

And results from the benchmark:

------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames_bulk_print at 0x7fed776fe5e0>': 4 tests --------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames_bulk_print]         31.5530 (1.0)      32.7390 (1.01)     31.9732 (1.0)      0.3079 (2.34)     31.8860 (1.0)      0.3953 (2.80)         10;0  31.2762 (1.0)          31           1
test_pkgnames_speed[ipy-pkgnames_bulk_print]       31.8123 (1.01)     32.3486 (1.0)      32.0379 (1.00)     0.1316 (1.0)      32.0156 (1.00)     0.1414 (1.0)           9;1  31.2131 (1.00)         32           1
test_pkgnames_speed[i-pkgnames_bulk_print]         32.8958 (1.04)     37.6176 (1.16)     33.4168 (1.05)     1.0162 (7.72)     33.1444 (1.04)     0.2340 (1.65)          2;3  29.9251 (0.96)         22           1
test_pkgnames_speed[<all>-pkgnames_bulk_print]     37.1408 (1.18)     39.4088 (1.22)     38.1502 (1.19)     0.6027 (4.58)     38.2545 (1.20)     0.7840 (5.54)          9;0  26.2122 (0.84)         25           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Aww yiss, the all-packages performance matched the filtered performance

Trie

Next, I wanted to try out the Marisa trie. It ended up being a lot simpler than I expected to implement, with instantiation just being Trie(packages), saving to file as easy as trie.save(path), and loading with trie = Trie(); trie.load(path).

I altered pkgnames to read from the trie, and these were the results

--------------------------------------------------------------------------- benchmark 'pkgnames method=<function pkgnames_trie at 0x7f4c3bdb6a60>': 4 tests ---------------------------------------------------------------------------
Name (time in us)                                    Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers         OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0-pkgnames_trie]            121.5001 (1.0)         238.8821 (1.0)         130.4816 (1.0)          9.6921 (1.0)         127.7820 (1.0)          5.4599 (1.12)      153;161  7,663.9147 (1.0)        2374           1
test_pkgnames_speed[ipy-pkgnames_trie]          144.4730 (1.19)        372.6850 (1.56)        157.3764 (1.21)        21.5655 (2.23)        153.1346 (1.20)         4.8894 (1.0)       142;276  6,354.1947 (0.83)       3276           1
test_pkgnames_speed[i-pkgnames_trie]          1,346.6370 (11.08)     2,199.2889 (9.21)      1,394.0925 (10.68)       85.1560 (8.79)      1,379.6790 (10.80)       26.3505 (5.39)        18;20    717.3125 (0.09)        635           1
test_pkgnames_speed[<all>-pkgnames_trie]     60,827.8541 (500.64)   72,994.6129 (305.57)   62,315.6022 (477.58)   2,990.3667 (308.54)   61,359.3940 (480.19)   1,137.2961 (232.60)        1;1     16.0473 (0.00)         16           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

And wow! The speeds were measured in microseconds, not milliseconds! For the 'i' input, the trie was 25x faster; for 'ipy', it was 200x faster!

But pulling all the packages from the trie was still 60ms, which was slower than the regular index.

Trie w/ All Packages

To address the all-packages performance w/ the trie, I made a special case: if an empty prefix is provided, simply parrot raw package names file (the original package names index)

def pkgnames(prefix=''):
    if not prefix:
        matching_packages = get_all_package_names()
    else:
        matching_packages = get_package_names(prefix=prefix)

    response = '\n'.join(matching_packages)
    print(response)
-------------------------------------------------------------------------------------------- benchmark 'pkgnames': 4 tests --------------------------------------------------------------------------------------------
Name (time in us)                      Min                    Max                   Mean                StdDev                 Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0]            121.5001 (1.0)         168.2369 (1.0)         127.9703 (1.0)          6.5916 (1.0)         125.5025 (1.0)        4.6890 (1.0)       267;253  7,814.3134 (1.0)        2342           1
test_pkgnames_speed[ipy]          147.9900 (1.22)        319.0329 (1.90)        159.2223 (1.24)        13.8756 (2.11)        155.8540 (1.24)       5.0691 (1.08)      172;270  6,280.5287 (0.80)       3287           1
test_pkgnames_speed[i]          1,312.9509 (10.81)     2,971.6729 (17.66)     1,362.7967 (10.65)       90.5921 (13.74)     1,341.6645 (10.69)     20.2591 (4.32)        49;86    733.7851 (0.09)        690           1
test_pkgnames_speed[<all>]     17,952.0501 (147.75)   24,011.7370 (142.73)   20,125.1347 (157.26)   1,040.7176 (157.89)   19,827.9684 (157.99)   320.2048 (68.29)        9;11     49.6891 (0.01)         50           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

17ms — not bad!

All packages optimization

I realized that for empty prefix / all-packages response, I was still splitting all the lines in the raw index, and rejoining them when emitting the response. I changed it to simply parrot the raw index file — read the file, then print it out.

----------------------------------------------------------------------------------------- benchmark 'pkgnames': 4 tests -----------------------------------------------------------------------------------------
Name (time in us)                     Min                   Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_pkgnames_speed[0]           123.2328 (1.0)        262.3170 (1.0)        133.0901 (1.0)       15.0659 (1.0)        129.9269 (1.0)        4.0630 (1.0)        80;175  7,513.7051 (1.0)        2289           1
test_pkgnames_speed[ipy]         150.7460 (1.22)       355.0630 (1.35)       164.6641 (1.24)      20.6171 (1.37)       159.6075 (1.23)       4.9237 (1.21)      119;213  6,072.9679 (0.81)       2132           1
test_pkgnames_speed[i]         1,390.0651 (11.28)    3,571.6922 (13.62)    1,485.3559 (11.16)    161.0224 (10.69)    1,461.4101 (11.25)     26.2756 (6.47)        14;34    673.2393 (0.09)        636           1
test_pkgnames_speed[<all>]     2,341.6900 (19.00)    3,823.8191 (14.58)    2,697.3013 (20.27)    194.8775 (12.94)    2,642.0900 (20.34)    220.8641 (54.36)        47;9    370.7409 (0.05)        263           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Boom! 2ms :) (Well, 4ms max)

Summary

So, with a few optimizations, overall performance is improved 10x — and in some cases, 200x. Success :)

@brunobeltran
Copy link
Copy Markdown
Owner

Man, this is great, thanks for the contrib! This is a good excuse to clear out the PR backlog here. Sorry ahead of time if it takes a couple of days to get back, currently on some paper deadlines.

@theY4Kman
Copy link
Copy Markdown
Author

theY4Kman commented May 17, 2020

Hell, thank you for building this thing; it scratched an itch I've had for a long time: getting so spoiled with apt install <tab><tab> and wanting it for pip.

Oh, yeah, I also removed all-packages.txt from data_files, because it was being included in the wheel with your full home directory path :P

image

I was intrigued about including a snapshot of the package index for immediate use after installation. Was that your intention? Would you like to keep that functionality? Maybe I could include it as an extra, for use with pip install pip-cache[full] or something

Comment thread pip_cache/__init__.py
Comment on lines +19 to +28
def get_pip_cache_data_dir():
return os.path.join(get_xdg_data_dir(), 'pip-cache')


def get_raw_index_filename():
return os.path.join(get_pip_cache_data_dir(), 'all-packages.txt')


def get_index_filename():
return os.path.join(get_pip_cache_data_dir(), 'all-packages.marisa')
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I changed these to methods, instead of global constants, so I could easily patch them out during testing

Comment thread pip_cache/__init__.py
if not os.path.isdir(pip_cache_data_dir):
os.mkdir(pip_cache_data_dir)

print('Connecting to PyPi...', end='', flush=True)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYA: there's a neat little package, manrajgrover/halo, which adds a little spinner in the console and can be used as a context manager

with Halo('Connecting to PyPi'):
    client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')

with Halo('Downloading package names'):
    packages = client.list_packages()

with Halo('Writing packages to cache'):
    with open(raw_index_filename, 'w') as f:
        f.write('\n'.join(packages))

with Halo('Indexing packages'):
    trie = marisa_trie.Trie(packages)
    trie.save(index_filename)

Peek 2020-05-17 17-20

I like to save the text of each thing, though

Peek 2020-05-17 17-21

class task(Halo):
    def __exit__(self, type, value, traceback):
        self.succeed()

with task('Connecting to PyPi'):
    client = xmlrpclib.ServerProxy('https://pypi.python.org/pypi')

with task('Downloading package names'):
    packages = client.list_packages()

with task('Writing packages to cache'):
    with open(raw_index_filename, 'w') as f:
        f.write('\n'.join(packages))

with task('Indexing packages'):
    trie = marisa_trie.Trie(packages)
    trie.save(index_filename)

Comment thread tests/conftest.py

@pytest.fixture(scope='session')
def pip_cache_dir(request):
path = request.config.cache.makedir('pip-cache-data')
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll create a directory under .pytest_cache/d/, which'll remain in between test runs. To clear the cache, py.test --cache-clear can be used; though, deleting .pytest_cache would prolly work just as well :P

Comment thread tests/conftest.py

if not request.config.cache.get(CACHE_KEY, None):
pip_cache.update_package_list()
request.config.cache.set(CACHE_KEY, True)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll get stored as .pytest_cache/v/pip-cache/is-cache-built, and I think it's a JSON file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants