Categorical encoding

Some options are tristate (n, y, m) and we need a strategy to encode their values. 
As discussed, our current solution (0, 1, 2) has limitations for some learning algorithms. 

We could try some strategies like one-hot encoding 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
http://contrib.scikit-learn.org/categorical-encoding/index.html
or dummies: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
dummyfication: https://en.wikiversity.org/wiki/Dummy_variable_(statistics) 

but there are some subtilities to think about:
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn 

In fact, we need to think about these subtilities for all kinds of algorithms. The encoding can be different depending on the use of linear regression or neural networks.
I like the reading of https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
that recommends to use `drop` when linear regression is employed. 

Another appealing idea of @llesoil is to consider 'm' is similar to 'n' wrt size (basically 'm' does not have effect on kernel size) 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical encoding #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Categorical encoding #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions