Tree by manishamde · Pull Request #3 · amplab/MLI

manishamde · 2013-10-07T06:28:56Z

Decision Tree algorithm implemented on top of Spark RDD.

Key features:

Supports both classification and regressions
Supports gini, entropy and variance for information gain calculation
Supports calculating quantiles using a configurable fraction of the data
Performance accuracy verified by comparing with scikit-learn

Added basic documentation.

etrain · 2013-10-11T18:13:37Z

This looks awesome, thanks so much for the contribution Manish!

The big question I have is whether you looked at using MLTable and its API for your input? Were there big hurdles preventing that from being an option? We'd like to build ML algorithms around that API, so if there are things we need to change to add this case, let us know! Decision trees are fairly different than algorithms that work by evaluating some linear loss function and optimizing via gradient descent, so this is a good test for something different that may not fit our existing model.

manishamde · 2013-10-11T22:14:39Z

Thanks Evan. Looking forward to contributing more to the library.

Unfortunately, I haven't looked at MLTable since the code was written prior to the open sourcing of the MLI library. As I mentioned in an earlier comment, I will look to make this code compatible with the MLI API and give feedback for any improvements.

The fixes should not take me too long. The non-linear data generator will be the trickiest part.

When do you think we can start testing performance once I am done?

manishamde · 2013-10-13T07:15:00Z

Evan,

I have just performed a major refactoring of the code based on your feedback without changing functionality.

A few tasks remain:

Making the interface to the tree algorithm consistent with the MLI API. I am wondering whether some "implicit" magic might make the algorithm work with both MLTable and RDD. However, I couldn't find any variance calculation logic for MLTable features. I am currently using the StatCounter class in the Spark library for that purpose.
Using the utils and deprecating TreeUtils class. I don't think this should be a show stopper for now and could be deprecated in the future. TreeRunner and TreeUtils are helper/example classes to get started.
Non-linear data generator. I will think about this problem a little more. We will need to come up with a configurable (features, size, etc.) data generator.
Using a CLI parser (scopt/sumac/etc.)

I think task 1 is the most important for now. Task 2 can be done in the future. I am wondering whether we can use the same data that you might have used for testing logistic regression or SVM for performance testing while we work on Task 3. Task 4 is again one for the future.

manishamde · 2013-10-20T01:43:34Z

Some more changes.

Quantile and split calculations are now performed in memory (significant improvement in performance)
Calculated training error after tree building. Could not find the best place to store the training error in the tree model. Didn't want to hack it right now but will do it more systematically while building ensembles.
Started doing testing on a dataset with ~0.5 million instances but limited by lack of access to a Spark cluster to accurately measure performance.

etrain · 2013-10-20T02:14:50Z

This is awesome, thanks Manish - we'll plan to test your code for
scalability on a cluster this week.

On Sat, Oct 19, 2013 at 6:43 PM, manishamde notifications@github.comwrote:

Some more changes.

Quantile and split calculations are now performed in memory
(significant improvement in performance)

Calculated training error after tree building. Could not find the best
place to store the training error in the tree model. Didn't want to hack it
right now but will do it more systematically while building ensembles.

Started doing testing on a dataset with ~0.5 million instances but
limited by lack of access to a Spark cluster.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3#issuecomment-26662937
.

manishamde · 2013-10-20T02:21:22Z

Sounds great Evan!

etrain · 2013-11-08T19:29:38Z

.gitignore

This is a good idea, thanks.

manishamde added 7 commits September 30, 2013 23:22

migrating tree code to MLI

1eba6f3

added design file

e2231ad

added accuracy score calculation

95d45ab

added mean square error and test directory

6754a40

placeholder test file

49b8797

basic documentation

376e241

Update README.md

35cebe6

Added basic documentation.

manishamde added 5 commits October 12, 2013 19:11

adding empty lines above comments

9ad0dd5

moved impurity classes to a different package

fc773de

reorganized code

e29493c

refactored decison nodes into separate package

6047ed8

making variance serializable

2a1185b

manishamde added 4 commits October 13, 2013 00:21

fixing usage and example

b2447a8

drastic speedup of split calculation

68ad6c8

moving metrics to new class and changing root node depth to 1 from 0

729a3b1

calculating training error while building decision tree

d956cd3

renaming error to accuracy :-)

422ed7d

etrain reviewed Nov 8, 2013
View reviewed changes

.gitignore

Copy link

Contributor

etrain Nov 8, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree#3

Tree#3
manishamde wants to merge 19 commits intoamplab:masterfrom
manishamde:tree

manishamde commented Oct 7, 2013

Uh oh!

etrain commented Oct 11, 2013

Uh oh!

manishamde commented Oct 11, 2013

Uh oh!

manishamde commented Oct 13, 2013

Uh oh!

manishamde commented Oct 20, 2013

Uh oh!

etrain commented Oct 20, 2013

Uh oh!

manishamde commented Oct 20, 2013

Uh oh!

etrain Nov 8, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

manishamde commented Oct 7, 2013

Uh oh!

etrain commented Oct 11, 2013

Uh oh!

manishamde commented Oct 11, 2013

Uh oh!

manishamde commented Oct 13, 2013

Uh oh!

manishamde commented Oct 20, 2013

Uh oh!

etrain commented Oct 20, 2013

Uh oh!

manishamde commented Oct 20, 2013

Uh oh!

etrain Nov 8, 2013

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants