Skip to content
Pritam Sukumar edited this page May 2, 2018 · 2 revisions

Notes

Just a gathering place for notes.

  • Removing dummy Qs from the corpus Dummy Qs are present in every chapter and they make up a large portion of the false classification elements
  • Finding words that are not so useful: This was the process:
    • Find top occuring words across each chapter
    • Find the top occurring words across those top occurring ones
    • Those are the words that are most confused among chapters
    • It turns out that a lot of these words are LaTeX words.
  • Trying to classify without ML: I tried to do classification directly based on the words of a question. We have the above table where we have the most commonly occurring words per chapter. So we just see the chapter that has the most matching words, and that's our prediction.
    • Unfortunately, this does not work so well. But, it was worth trying, eh?
  • SVM and kNN don't perform as well as Naive Bayes either, so Naive Bayes it is.
  • The Naive Bayes implementation in R doesn't support multinomial distributions, which is what you need for numerical variables. The python sklearn one does, so can we use that?
    • We can also use [Bernoulli Naive Bayes]

Integration with server

  • Where do we integrate?
    • The point of question entering
    • Run the classifier on the repository questions
    • API? Pass the text as parameter and get classification as result
  • For integrating with the question entering
    • How fast does the server need to respond?
    • Can we use the same API as for above?
  • Separate place to train and run the models -- so just use it as an API?
    • Beginning of a new service for all of ML?

Clone this wiki locally