All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
n_blocksAdded "guesstimate" as default value forn_blocks. This will guess an optimal number of blocks based on empirical observation.
- matrix-blocking/splitting as a performance-enhancer (see README.md for details)
- new keyword arguments
force_symmetriesandn_blocks(see README.md for details) - new dependency on packages
topnandsparse_dot_topn_for_blocksto help with the matrix-blocking - capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like
match_strings(). See README.md for details.)
- Improved the performance of the function
match_most_similar. - The
Seriesduplicatesis now the left operand, whilemasteris the right operand in the underlying left-join operation that does the string-matching. - Changed the default value of the keyword argument
max_n_matchesto the total number of strings inmaster. (max_n_matchesis now defined as the maximum number of matches allowed per string induplicates[ormasterifduplicatesis not given]).
- Added new keyword argument
tfidf_matrix_dtype(the datatype for the tf-idf values of the matrix components). Allowed values arenumpy.float32andnumpy.float64(used by the required external packagesparse_dot_topnversion 0.3.1). Default isnumpy.float32. (Note:numpy.float32often leads to faster processing and a smaller memory footprint albeit less numerical precision thannumpy.float64.)
- Changed dependency on
sparse_dot_topnfrom version 0.2.9 to 0.3.1 - Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
- Changed the default value of the keyword argument
max_n_matchesfrom 20 to the number of strings induplicates(ormaster, ifduplicatesis not given). - Changed warning issued when the condition [
include_zeroes=Trueandmin_similarity≤ 0 andmax_n_matchesis not sufficiently high to capture all nonzero-similarity-matches] is met to an exception.
- Removed the keyword argument
suppress_warning
-
Added group representative functionality - by default the centroid is used. From @ParticularMiner
-
Added string_grouper_utils package with additional group-representative functionality:
- new_group_rep_by_earliest_timestamp
- new_group_rep_by_completeness
- new_group_rep_by_highest_weight
From @ParticularMiner
-
Original indices are now added by default to output of
group_similar_strings,match_most_similarandmatch_strings. From @ParticularMiner -
compute_pairwise_similaritiesfunction From @ParticularMiner
- Default group representative is now the centroid. Used to be the first string in the series belonging to a group. From @ParticularMiner
- Output of
match_most_similarandmatch_stringsis now apandas.DataFrameobject instead of apandas.Seriesby default. From @ParticularMiner - Fixed a bug which occurs when min_similarity=0. From @ParticularMiner