-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Some options are tristate (n, y, m) and we need a strategy to encode their values.
As discussed, our current solution (0, 1, 2) has limitations for some learning algorithms.
We could try some strategies like one-hot encoding
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
http://contrib.scikit-learn.org/categorical-encoding/index.html
or dummies: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
dummyfication: https://en.wikiversity.org/wiki/Dummy_variable_(statistics)
but there are some subtilities to think about:
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
In fact, we need to think about these subtilities for all kinds of algorithms. The encoding can be different depending on the use of linear regression or neural networks.
I like the reading of https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
that recommends to use drop when linear regression is employed.
Another appealing idea of @llesoil is to consider 'm' is similar to 'n' wrt size (basically 'm' does not have effect on kernel size)