Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
eb5cdd9
Added changelog with all changes since version 0.3.2
Apr 11, 2021
5cb0066
added StringGrouper attribute function _get_true_max_n_matches() and
ParticularMiner Apr 12, 2021
f9f1868
modified ing-bank's sparse_dot_topn to get n_max_matches true value and
ParticularMiner Apr 14, 2021
68a51a1
made significant performance enhancements:
ParticularMiner Apr 14, 2021
798daf7
updated setup.py
ParticularMiner Apr 14, 2021
5b2ce38
attempted to remove n_max_matches restriction altogether
ParticularMiner Apr 17, 2021
d6f3127
removed the restriction n_max_matches put on memory allocation
ParticularMiner Apr 17, 2021
5a12efb
defragmented temporary memory allocations in sparse_dot_topn routines
ParticularMiner Apr 23, 2021
30712de
made ntop always flexible (i.e., not only when ntop >= B.shape[1])
ParticularMiner Apr 24, 2021
4b86ab1
removed code-redundancies in sparse_dot_topn
ParticularMiner Apr 26, 2021
2cf60a0
made README.md "pypi.org-friendly"
ParticularMiner Apr 28, 2021
c96ec50
rearranged code in string_grouper.py
ParticularMiner Apr 28, 2021
0f0b2c3
corrected optional_kwargs for awesome_cossim_dotn in _build_matches()
ParticularMiner Apr 29, 2021
57c4122
added scouting function that determines the amount of memory needed for
ParticularMiner Apr 29, 2021
6b7ee4b
introduced heuristic to reduce over-estimate of memory allocation for
ParticularMiner May 2, 2021
0b3bc8a
tried vector reserve
ParticularMiner May 3, 2021
80d388b
fixed bug related to single-valued input Series
ParticularMiner May 4, 2021
2c6b102
fixed bug related to single-valued input Series
ParticularMiner May 4, 2021
1b8ddec
modified GitHub workflow action script test.yml
ParticularMiner May 4, 2021
75fdf3d
renamed sparse_dot_topn sub-package to string_grouper_topn to avoid
ParticularMiner May 5, 2021
7b6289a
Merge branch 'master' into free
ParticularMiner May 5, 2021
29dcb42
added unittest for get_groups() with single-valued input Series
ParticularMiner May 5, 2021
6f6ff50
fixed other squeeze() bugs
ParticularMiner May 8, 2021
90a6fd1
made PEP8-conforming modifications
ParticularMiner May 11, 2021
eaee025
Merge branch 'master' into free
ParticularMiner Jun 10, 2021
32d7136
removed string_grouper_topn submodule
ParticularMiner Jun 10, 2021
bce1ce7
restored dependency on upgraded package sparse_dot_topn
ParticularMiner Jun 10, 2021
3fd7329
updated GitHub workflow action test script
ParticularMiner Jun 10, 2021
7742fe4
updated dependency on latest version of sparse_dot_topn (v 0.3.1)
ParticularMiner Jun 10, 2021
36f7316
updated CHANGELOG.md
ParticularMiner Jun 11, 2021
64e4f85
added new keyword argument tfidf_matrix_dtype (the datatype for the
ParticularMiner Jun 11, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,10 @@ jobs:
with:
python-version: ${{ matrix.python-version }}

- name: Install package
run: pip install .
- name: Install dev-package
run: |
python -m pip install --upgrade pip
pip install -v -e .

- name: Run tests
run: python -m unittest
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.4.1?] - 2021-06-11

### Added

* Added new keyword argument **`tfidf_matrix_dtype`** (the datatype for the tf-idf values of the matrix components). Allowed values are `numpy.float32` and `numpy.float64` (used by the required external package `sparse_dot_topn` version 0.3.1). Default is `numpy.float32`. (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)

### Changed

* Changed dependency on `sparse_dot_topn` from version 0.2.9 to 0.3.1
* Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
* Changed the default value of the keyword argument `max_n_matches` from 20 to the number of strings in `duplicates` (or `master`, if `duplicates` is not given).
* Changed warning issued when the condition \[`include_zeroes=True` and `min_similarity` ≤ 0 and `max_n_matches` is not sufficiently high to capture all nonzero-similarity-matches\] is met to an exception.

### Removed

* Removed the keyword argument `suppress_warning`

## [0.4.0] - 2021-04-11

### Added
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,16 +134,16 @@ All functions are built using a class **`StringGrouper`**. This class can be use
All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:

* **`ngram_size`**: The amount of characters in each n-gram. Default is `3`.
* **`tfidf_matrix_dtype`**: The datatype for the tf-idf values of the matrix components. Allowed values are `numpy.float32` and `numpy.float64`. Default is `numpy.float32`. (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)
* **`regex`**: The regex string used to clean-up the input string. Default is `"[,-./]|\s"`.
* **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is `20`.
* **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is the number of strings in `duplicates` (or `master`, if `duplicates` is not given).
* **`min_similarity`**: The minimum cosine similarity for two strings to be considered a match.
Defaults to `0.8`
* **`number_of_processes`**: The number of processes used by the cosine similarity calculation. Defaults to
`number of cores on a machine - 1.`
* **`ignore_index`**: Determines whether indexes are ignored or not. If `False` (the default), index-columns will appear in the output, otherwise not. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
* **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
* **`include_zeroes`**: When `min_similarity` ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md) for a demonstration.) **Warning:** Make sure the kwarg `max_n_matches` is sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise some zero-similarity-matches returned will be false.
* **`suppress_warning`**: when `min_similarity` ≤ 0 and `include_zeroes` is `True`, determines whether or not to suppress the message warning that `max_n_matches` may be too small. Defaults to `False`.
* **`include_zeroes`**: When `min_similarity` ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).) **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`. To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
* **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen. Allowed values are `'centroid'` (the default) and `'first'`. See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.

## Examples
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@
, 'scipy'
, 'scikit-learn'
, 'numpy'
, 'sparse_dot_topn>=0.2.6'
, 'sparse_dot_topn>=0.3.1'
]
)
Loading