Skip to content

Categorical encoding #11

@FAMILIAR-project

Description

@FAMILIAR-project

Some options are tristate (n, y, m) and we need a strategy to encode their values.
As discussed, our current solution (0, 1, 2) has limitations for some learning algorithms.

We could try some strategies like one-hot encoding
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
http://contrib.scikit-learn.org/categorical-encoding/index.html
or dummies: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
dummyfication: https://en.wikiversity.org/wiki/Dummy_variable_(statistics)

but there are some subtilities to think about:
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn

In fact, we need to think about these subtilities for all kinds of algorithms. The encoding can be different depending on the use of linear regression or neural networks.
I like the reading of https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
that recommends to use drop when linear regression is employed.

Another appealing idea of @llesoil is to consider 'm' is similar to 'n' wrt size (basically 'm' does not have effect on kernel size)

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions