Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring

I expect that the issues and PRs I've posted here are going to my own fork and no further, but if anyone out there is reading, your feedback is welcome.

In `Parser.getKeywords()`:
```py
# ...
uniqueWords = list(set(words))

keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords] 
keywords = sorted(keywords, key=lambda x: -x['count'])

# ... returns: ...
keywords[0]  = {'word': 'foo', 'count': 100}  # rank: 1st (most common keyword)
keywords[1]  = {'word': 'eggs', 'count': 25}  # rank: 9th
keywords[2]  = {'word': 'bacon', 'count': 32} # rank: 8th
# ...
keywords[10] = {'word': 'spam', 'count': 25}  # rank: 9th
keywords[11] = {'word': 'bar', 'count': 19}   # rank: 13th
keywords[12] = {'word': 'ham', 'count': 25}   # rank: 9th
# ...
```

`set()` does not return an ordered enumerable, so `uniqueWords` is an unordered list. Iterating it to build `keywords` means this list is also unordered, so the return order of `sorted()` on the count key alone is inconsistent. 

Down in `Summarizer.summarize()` it means sentence scores are also inconsistent:
```py
# ... 
(keywords, wordCount) = self.parser.getKeywords(text)

topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)

# ... iterating sentences ...
sbsFeature = self.sbs(words, topKeywords, keywordList)
dbsFeature = self.dbs(words, topKeywords, keywordList)

# ... calculate sentence score based on these features ...
# ... 
```
For consistency, when the `summarize` method calls `getTopKeywords()`, should it still pass a fixed list of ten keywords that's been sorted with more tiebreakers…

```py
# replace: keywords = sorted(keywords, key=lambda x: -x['count'])
keywords = sorted(keywords, key=lambda x: (-x['count'], -len(x['word'], x['word']))
```
…every keyword that *ranks* 10th or better, i.e. this comprehension…
```py
topKeywordSlice = [kw for kw in keywords if kw['count'] >= keywords[9]['count']]
```
… or maybe both?

Computationally, the advanced sort seems unnecessarily expensive, but I don't know if there's a rationale for exactly ten top keywords. What's the best way to make this work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring #11

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Inconsistent keyword list in Parser.getKeywords() causes inconsistent scoring #11

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions