-
Notifications
You must be signed in to change notification settings - Fork 0
Notes
Pritam Sukumar edited this page May 2, 2018
·
2 revisions
Just a gathering place for notes.
- Removing dummy Qs from the corpus Dummy Qs are present in every chapter and they make up a large portion of the false classification elements
-
Finding words that are not so useful: This was the process:
- Find top occuring words across each chapter
- Find the top occurring words across those top occurring ones
- Those are the words that are most confused among chapters
- It turns out that a lot of these words are LaTeX words.
-
Trying to classify without ML: I tried to do classification directly based on the words of a question. We have the above table where we have the most commonly occurring words per chapter. So we just see the chapter that has the most matching words, and that's our prediction.
- Unfortunately, this does not work so well. But, it was worth trying, eh?
- SVM and kNN don't perform as well as Naive Bayes either, so Naive Bayes it is.
- The Naive Bayes implementation in R doesn't support multinomial distributions, which is what you need for numerical variables. The python sklearn one does, so can we use that?
- We can also use [Bernoulli Naive Bayes]
- Where do we integrate?
- The point of question entering
- Run the classifier on the repository questions
- API? Pass the text as parameter and get classification as result
- For integrating with the question entering
- How fast does the server need to respond?
- Can we use the same API as for above?
- Separate place to train and run the models -- so just use it as an API?
- Beginning of a new service for all of ML?