Skip to content

Tree#3

Open
manishamde wants to merge 19 commits intoamplab:masterfrom
manishamde:tree
Open

Tree#3
manishamde wants to merge 19 commits intoamplab:masterfrom
manishamde:tree

Conversation

@manishamde
Copy link
Contributor

Decision Tree algorithm implemented on top of Spark RDD.

Key features:

  • Supports both classification and regressions
  • Supports gini, entropy and variance for information gain calculation
  • Supports calculating quantiles using a configurable fraction of the data
  • Performance accuracy verified by comparing with scikit-learn

@etrain
Copy link
Contributor

etrain commented Oct 11, 2013

This looks awesome, thanks so much for the contribution Manish!

The big question I have is whether you looked at using MLTable and its API for your input? Were there big hurdles preventing that from being an option? We'd like to build ML algorithms around that API, so if there are things we need to change to add this case, let us know! Decision trees are fairly different than algorithms that work by evaluating some linear loss function and optimizing via gradient descent, so this is a good test for something different that may not fit our existing model.

@manishamde
Copy link
Contributor Author

Thanks Evan. Looking forward to contributing more to the library.

Unfortunately, I haven't looked at MLTable since the code was written prior to the open sourcing of the MLI library. As I mentioned in an earlier comment, I will look to make this code compatible with the MLI API and give feedback for any improvements.

The fixes should not take me too long. The non-linear data generator will be the trickiest part.

When do you think we can start testing performance once I am done?

@manishamde
Copy link
Contributor Author

Evan,

I have just performed a major refactoring of the code based on your feedback without changing functionality.

A few tasks remain:

  1. Making the interface to the tree algorithm consistent with the MLI API. I am wondering whether some "implicit" magic might make the algorithm work with both MLTable and RDD. However, I couldn't find any variance calculation logic for MLTable features. I am currently using the StatCounter class in the Spark library for that purpose.
  2. Using the utils and deprecating TreeUtils class. I don't think this should be a show stopper for now and could be deprecated in the future. TreeRunner and TreeUtils are helper/example classes to get started.
  3. Non-linear data generator. I will think about this problem a little more. We will need to come up with a configurable (features, size, etc.) data generator.
  4. Using a CLI parser (scopt/sumac/etc.)

I think task 1 is the most important for now. Task 2 can be done in the future. I am wondering whether we can use the same data that you might have used for testing logistic regression or SVM for performance testing while we work on Task 3. Task 4 is again one for the future.

@manishamde
Copy link
Contributor Author

Some more changes.

  1. Quantile and split calculations are now performed in memory (significant improvement in performance)
  2. Calculated training error after tree building. Could not find the best place to store the training error in the tree model. Didn't want to hack it right now but will do it more systematically while building ensembles.
  3. Started doing testing on a dataset with ~0.5 million instances but limited by lack of access to a Spark cluster to accurately measure performance.

@etrain
Copy link
Contributor

etrain commented Oct 20, 2013

This is awesome, thanks Manish - we'll plan to test your code for
scalability on a cluster this week.

On Sat, Oct 19, 2013 at 6:43 PM, manishamde notifications@github.comwrote:

Some more changes.

  1. Quantile and split calculations are now performed in memory
    (significant improvement in performance)
  2. Calculated training error after tree building. Could not find the best
    place to store the training error in the tree model. Didn't want to hack it
    right now but will do it more systematically while building ensembles.
  3. Started doing testing on a dataset with ~0.5 million instances but
    limited by lack of access to a Spark cluster.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3#issuecomment-26662937
.

@manishamde
Copy link
Contributor Author

Sounds great Evan!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants