Featured on Data.World website.
What are classification models?
A classification model draws some sort of conclusion from observed values. Given one or more inputs(features) a classification model will try to predict the value of one or more outcomes. Outcomes are labels that can be applied to a dataset. For example, spam filtering, either ‘Spam’ or ‘Not Spam’ based on text of email, sender or IP.
Two approaches to machine learning: Supervised and Unsupervised. In a Supervised model, a training dataset is fed into the classification algorithm (general rule of thumb is 80% is training data, and 20% is test data).That lets the model know what is, for example, “authorized” transactions. Then the test data sample is compared with that to determine if there is a “fraudulent” transaction. This type of learning falls under “Classification”.
Unsupervised models are fed a dataset that is not labeled and looks for clusters of data points. It can be used to detect patterns or identify outliers within a dataset. A typical use case would be finding similar images. Unsupervised models can also be used to find “fraudulent” transactions by looking for anomalies within a dataset. This type of learning falls under “Clustering.”
There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
Setting the Scene
The question I tried answering:
Based on NBA rookie stats, which players have 5 year or more career longevity?
This question is posted on data.world website, and here is the link. Luckily, the data was also provided and you can access the data from the same link provided above. My next question was: How can I aggregate and use this data to get the best probability for each player?
I thought that a sensible solution will be adding as many ‘features’ into a Liner classifier: Logistic regression. A linear classifier will use training data to learn a weight or coefficient for each feature. Make a prediction (both class and probability) of games played for each player. After finding the logistic regression weights and predictors I can compute the accuracy of the model. I can inspect the coefficients of the logistic regression model and make them talk to me.
For this ML model instead of using Pandas, I used SFrames. There is no particular reason, I just wanted to work with SFrames, and this dataset was perfect for it.
To train the classifier with logistic regression, I used 19 features given in the data set, and my target was the ‘Target_5yrs’ column which represented outcome: 1(positive label) if career length is equal or more than 5 years or 0(negative label) if career length is less than 5 years.
Once the model was fitted, I extracted the coefficients. There were 20 coefficients in the model, 11 positives and 9 negatives.
Using Logistic Regression
The part you’ve all been waiting for (drum rolls). I explored this in the context of 3 examples in the test dataset, that I referred as the sample_test_data. Here is the result:
I checked the prediction, as you might recall ‘1’ is positive, and ‘0’ is negative.
Phew! It is working. How can I tell? Look at the stats on the dataset above. Although Tony Smith as an outcome of 5 years or more, there are higher chances on Gerald Paddio based on features.
I examined the whole test dataset(test_data), to form predictions. Finding the 20 players in the entire test_data with the highest probability of having 5-year career longevity. Here are the names with probability predictions:
Well, you might be thinking, what about the tail-enders? Just because you asked, here:
Visualization of Probability Prediction because Donald Knuth, in his legendary 1974 essay Computer Programming as an Art states:
Science is knowledge which we understand so well that we can teach it to a computer.Everything else is art.
The accuracy of the Classifier:
I computed using the trained model, by counting the number of data points when the predicted class labels match the ground truth labels. Then dividing the total number of correct predictions by the total number of data points in the dataset. The accuracy of the model on the test_data was 0.7.
Baseline: Majority class prediction
It is quite common to use the majority class classifier as the baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.
Guess what? The model was better than Majority class classifier. The majority classifier prediction was 0.62.
That’s all folks.
I’ve uploaded an accompanying script on Github. That provides the whole recipe of how everything was computed.
Please don’t hesitate to contact me if you have any questions or comments!