Exploring supervised learning

Jose Pablo
5 min readJul 5, 2021

In this article I’m going to present the results of using 4 different algorithms for classification tasks: logistic regression, decision trees, k nearest neighbors, and neural networks. In order to do this I’ll use Red wine quality dataset. Also, I use sklearn to implement each of the algorithms. The implementation can be found in this link.

Image from internet.

I. Dataset introduction and preprocessing

It contains 11 physicochemical properties (features), plus the variable to predict (quality). The quality of the wine is rated on a scale of 3 to 8, the problem we are going to result consist in predicting whether the wine is good or bad using the known features. In order to make things a bit easier we’re going to handle the problem as a binary classification task.

Eleven physicochemical properties of the wine
The physicochemical data statistics. Taken from original paper.

The classes are not balanced (e.g. there are many more normal wines than excellent or poor ones), image below shows the distribution of the variable to predict.

Distribution of variable to predict.

We are going to work with 2 different scenarios: in the first scenario bad quality wines (class 0) go from 3 to 6, and the good ones (class 1) have a quality of 7 or 8.

First scenario’s distribution of variable to predict.

In the second scenario the bad wines (class 0) go from 3 to 5, and the good ones (class 1) go from 6 to 8, these 2 thresholds have a direct impact of the data distribution.

Second scenario’s distribution of variable to predict.

As we can see in scenario 1 the data is highly unbalanced, in contrast in scenario 2 the data is not that bad. This will have an impact in the results, we discuss it later.

Another step in the preprocessing stage was to standardize the data, we used robust scaler aiming to reduce the impact of possible outliers.

II. Experiments and results

We use GridSearchCV and K-fold cross validation to perform the 4 experiments we did.

a. Logistic regression. For both scenarios GridSearchCV found that liblinear is the best solver, 200 iterations are enough, and L2 is the best penalty.

Results for scenario 1. We got an accuracy of 0.87. The precision for class 0 was 0.9, and for class 1 was 0.5. This difference is due the unbalanced data. Figure below displays the ROC curve, and AUC area (it was 0.64).

ROC curve for scenario 1.

Results for scenario 2. In this case the accuracy decreases a lit bit, now we got 0.75. The precision to predict class 0 also decreases, now it is 0.74. However, the precision for predicting class 0 increases to 0.76. This improvement is due the data is more balanced. The AUC increases in 11 points getting 0.75 which represents a great gain. Next plot shows ROC curve.

ROC curve for scenario 2.

Another good metric is the confusion matrix, in the next figure we can see the corresponding to scenario 2.

Confusion matrix for scenario 2.

b. Decision trees. For both scenarios GridSearchCV found that the best criterion is gini, and the best splitter is random.

Results for scenario 1. We got an accuracy of 0.88. A precision of 0.93 for class 0, and 0.53 for class 1. The AUC is 0.74. In the image below we can see the ROC curve plot, and the confusion matrix.

ROC curve plot and confusion matrix for scenario 1.

Results for scenario 2. The accuracy was 0.74. The precision for class 0 was 0.72, and for class 1 it was 0.76. Again we can see how we get a better precision when the data is balanced. In the image below it’s possible to see that the AUC was the same (0.74), however the ROC behaves a bit different. The confusion matrix help us to understand why the ROCs are distinct: the model of scenario 1 is better finding (in terms of percentage) the positive samples (true positive, and true negative) due they sum 87.5%, meanwhile in scenario 2 this same addition is equal to 74.06%.

c. K nearest neighbors. For both scenarios GridSearchCV found that the best parameters are: 6 neighbors, weights for distance, and auto as best algorithm.

Results for scenario 1. We got an accuracy of 0.89. A precision of 0.94 for class 0, and 0.61 for class 1. The AUC is 0.73.

Results for scenario 2. The accuracy was 0.79. The precision for class 0 was 0.81, and for class 1 it was 0.77. It presents the same behavior of the 2 previous algorithms, where the scenario two gets a better precision for class 1. The AUC is 0.78, which is better to the one obtained in scenario 1.

d. Neural networks. Thanks to GridSearchCV we used 100 hidden layers, ReLU as activation function, and Adam as solver.

Results for scenario 1. We got an accuracy of 0.86. A precision of 0.94 for class 0, and 0.9 for class 1. The AUC is 0.64.

Results for scenario 2. The accuracy was 0.74. The precision for class 0 was 0.73, and for class 1 it was 0.76. The AUC is 0.75, which is better to the one obtained in scenario 1.

Conclusions

The distribution of the target variable has a direct impact on the overall performance of the models. In the results of the experiments a better AUC was obtained in 3 of the cases (logistic regression, kNN and neural networks) when the distribution of the dependent variable was a bit more uniform, as in scenario 2 where class 0 represented 46.5% of the data and class 1 53.5%.

To get the same AUC value in two different models do not imply the models behave in the same way. In the Decision Tree experiment we got the same AUC (0.74) for both scenarios, however in scenario 1 (unbalanced data) the values for true and false negative were higher, and the ROC curve was different.

Accuracy is not the only metric for measuring the performance of a model, there are others such as: precision, and AUC (and more). In this experiment the best results were obtained using kNN algorithm.

--

--

Jose Pablo

Full time Sr. software engineer and part time MSc in Computer Science student with interest in AI, DL, HPC, and computer graphics. Love outdoors. Foodie.