IndexError in pyspark_kmodes

I'm receiving index error on the line #317:
random_element = random.choice(clusters[biggest_cluster].members)
I have a large dataframe (10000+ rows and 15+ columns). I tried this first with k=2. I debugged the program and it is because cluster_sizes gets 0 as value in two of its elements, but I'm not able to understand why.

If I limit my dataframe by say, a 100 rows, this error goes away, but then I get another error after 3 iterations of the algorithm: 'More clusters than data points?'

Any ideas on how to solve this?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError in pyspark_kmodes #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

IndexError in pyspark_kmodes #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions