Machine Learning with scikit-learn

7 min readSep 11, 2021

This is a quick reference for applying supervised machine learning using sklearn in Python. This guide assumes that you know basic concepts of supervised machine learning.

A model that can’t classify — Photo by Lindsay Doyle on Unsplash

Regression using Linear Model

Finds weight(w) and constant term(b) that minimizes the squared error of the model (also called loss function). Linear Models do not have any parameters to control model complexity, however feature preprocessing & regularization are used to improve model performance.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegressionfruits = pd.read_table(‘filename.txt’)X = fruits[‘mass’, ‘width’, ‘height’]
y = fruits[‘fruit_label’]X_train, X_test, y_train, y_test = train_test_split(X,y,randome_state = 123)lin = LinearRegression().fit(X_train,y_train)train_score = lin.score(X_train, y_train)
test_score = lin.score(X_test, y_test)intercept = lin.intercept_
coef = lin.coef_
Number_of_non_zero_features = np.sum(lin.coef_ !=0)#to predict on new examples
y_new = coef*X_new + intercept

Regression using K-Nearest Neighbours (KNN)

Measures how well a prediction model for regression fits the data. A value of r_squared = 0 means a constant model that predicts the mean value for all training target values. A value of r_square = 1 means perfect prediction. R_squared is also known as coefficient of determination

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressorfruits = pd.read_table(‘filename.txt’)
X = fruits[‘mass’, ‘width’, ‘height’]
y = fruits[‘fruit_label’]X_train, X_test, y_train, y_test = train_test_split(X,y,randome_state = 123)knn = KNeighborsClassifer(n_neighbors = 5)train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)#to predict on new examples
X_new = np.linspace(-3,3,500).reshape(-1,1) 
y_new = knn.predict(X_new)

Classification using K-Nearest Neighbours (KNN)

In KNN, all neighbors can be treated equally, or they can be given preference for being closer, i.e. having feature set similar, to labelled samples. The parameter for doing so is, weights and can have weights =’uniform’ or =’distance’

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifierfruits = pd.read_table(‘filename.txt’)# to get human readable labels like {1: ‘apple’, 2: ‘mandarin’, 3: ‘orange’, 4: ‘lemon’}, use the below codelookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))# create train test splitX = fruits[‘mass’, ‘width’, ‘height’] # this is where I choose which columns of dataframe ‘fruits’ will be featuresy = fruits[‘fruit_label’] # this is where I choose which column will be the ‘target variable’X_train, X_test, y_train, y_test = train_test_split(X,y,randome_state = 123) # splits in 75%:25% ratio of train:test#create a classifier objectknn = KNeighborsClassifer(n_neighbors = 5, weights = ‘uniform’)#train the classifer using the training data
knn.fit(X_train, y_train)#asssess the accuracy
knn.score(X_test, y_test) # accuracy = (TN + TP)/ # of samples#checking for individual instances
X_new = [’20 g’, ’10 cm broad’, ‘5.5 cm high’] # units are for demonstration, I’d actually use just the values.fruit_prediction_number = knn.predict(X_new)
fruit_prediction_text = lookup_fruit_name[fruit_prediction_number[0]]

Improving performance of KNN & Linear Models:

1. By understanding Bias-Variance Tradeoff

Smaller K=1, classifier is good at learning individual points in the training set. The prediction is sensitive to noise, outliers, mislabelled data.
Larger K=5, areas are not as fragmented, more robust to noise, but possibly more mistakes for individual points.
Increasing K from 1, 10, 25, we move from complex models with low bias and high variance to simpler models that may have higher bias but might be more robust with lower variance.
To get a more reliable result from the below code, try with multiple train-test splits (an issue called model selection)

k_range = range(1,20)
scores = []for k in k_range:    knn = KNeighborsClassifer(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))

KNN & linear model fit using least squares are the two most common methods. Other models are: Decision trees, kernelized Support Vector Machines, Neural Networks.

2. By handling Overfitting & Underfitting

Generalisability is the goal: How the model performs on test/ unseen data.
Model complexity: trying to fit a too complex model with inadequate amount of training data.
Underfit (too simple model that doesn’t even fit well on training data, let alon test data), Overfit(too complex model that memorizes all examples in training data too well, and performs poorly on test data).

3. Through Feature normalisation

Pre-process data by normalizing feature values before applying Ridge regularization. This prevents Ridge from unduly weighting features with larger scales. MinMaxScaler() is a common normalization method. x’ = (x-xmin)/(xmax-xmin)

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)# Notice ‘transform’. train scaler on training data to prevent data leakage

4. Polynomial Transformations

Creates a richer feature set to work without the need to add new samples to the training dataset. degree controls the polynomial power to which features can be raised to create new combinations. For a dataset with two features X1 and X2 , new features added with degree=2 are X1*X1, X1*X2, X2*X2

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)lin = LinearRegression().fit(X_train_poly. y_train)
train_score = lin.score(X_train_poly, y_train)
test_score = lin.score(X_test_poly, y_test)
intercept = lin.intercept_
coef = lin.coef_
Number_of_non_zero_features = np.sum(lin.coef_ !=0)

5. Ridge (L2), lasso (L1) and polynomial feature transformation

Ridge regression adds a penalty for large variation in w parameters, this is also called regularization. Since Ridge uses square of weights, it is called L2-regularization.
Reduces complexity of the final model. If two models have same minimized loss function, Ridge regression will prefer the model that has smaller sum of feature weights. Improves accuracy significantly when the feature set is large (100s of features). Amount of regularization is controlled by the alpha parameter. larger value means more regularization and simpler models. default value is 1.0. Regularization becomes less important as the amount of training data I have increases.

from sklearn.linear_model import Ridgefor alpha in [0,1,10,20,50,100,1000]:linridge = Ridge(alpha = 20.0).fit(X_train_scaled, y_train)
   train_score = linridge.score(X_train_scaled, y_train)
   test_score = lnridge.score(X_test_scaled, y_test)
   intercept = linridge.intercept_
   coef = linridge.coef_
   Number_of_non_zero_features = np.sum(linridge.coef_ !=0)

Lasso has single-powers of |weights| instead of squares as in Ridge. Prefer Lasso if there are fewer features with medium/large effects. The code is almost the same. Lasso sets more coefficients and by observing non-zero weights and sorting them by magnitude gives a good idea of the most crucial features in the model.

from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)for alpha in [0,1,10,20,50,100,1000]:linlasso = Lasso(alpha = alpha).fit(X_train_scaled, y_train)
    train_score = linlasso.score(X_train_scaled, y_train)
    test_score = linlasso.score(X_test_scaled, y_test)
    intercept = linlasso.intercept_
    coef = linlasso.coef_
    Number_of_non_zero_features = np.sum(linlasso.coef_ !=0) # with lasso, more features are set to zero as compared to Ridgelinlasso = Lasso(alpha = 20.0).fit(X_train_scaled, y_train)
train_score = linlasso.score(X_train_scaled, y_train)
test_score = linlasso.score(X_test_scaled, y_test)for e in sorted ([*zip ( [*X_labels] , linlasso.coeff_ ) ], key = lambda e: -abs(e[1])

Decision Trees

Find the feature that leads to most informative split. Overfitting is a typical problem as the Decision Tree Classifier keeps adding more nodes, until each node is pure.
There are three methods to control the complexity of these trees and prevent them from overfitting: max_depth, max_leaf_nodes , min_samples_leaf (minimum number of instances in a node to consider splitting). SciKit learn only supports pre-pruning where I decide beforehand how big/complex I want my decision trees to be.
Advantages: No need to do feature preprocessing, easy to visualise, and works well if features are of different types (categorical, continuous, binary).
Disadvantages: Even after tuning, trees still tend to overfit, and need an ensemble of trees to get good generalization performance.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifier
iris = load_iris()X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 123)tree = DecisionTreeClassifer(max_depth = 3).fit(X_train, y_test)
train_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)#how does each node look like:
decision: The split criteria
samples: number of samples to split
distribution: [# samples class A, # samples class B, # samples class C, …]
class: Name of the class with most #samples at this node.

Improving Decision Trees with Random Forests and Gradient Boosting

1. Ensemble — Random Forests

n_estimator is the number of decision trees to use. Each tree is built from a different sample of the data, also called the bootstrapped samples (random selection with replacement — allows for a row to sample to be included multiple times). Instead of building the decision tree on all features, a random set of features is chosen for different trees — controlled by max_features parameter.
Combining results after all decision trees are computed: For regression, the weights and intercepts are taken as the average of all trees and then applied for each sample. For classification, each tree gives probability for each class for every sample, probabilities are averaged out across trees for each sample, class with highest probability for a sample is chosen.
n_estimators Should be larger for larger data sets to reduce overfitting.
max_features influences the diversity of trees in the forest. Default is good enough, but tweaking might give some gains.
max_depth must be set each time as default value is None and works until each leaf is pure.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_splitclf = RandomForestClassifier( n_estimators = 10, n_features = 3, random_state = 123).fit (X_train, y_train)

2. Gradient boosted Decision Trees (GBDT)

Use ensemble like Random Forest, but unlike building trees in parallel and averaging their results like RandomForest, it builds trees in series, where each tree will correct the mistakes of previous tree.
Uses weak learners in start (shallow trees) and improves them downstream. learning_rate if high, means each successive tree will put strong emphasis on correcting the mistakes of the previous tree, and results in more complex trees.

from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import train_test_splitclf = GradientBoostingClassifier(learning_rate=0.1, n_estimators=100, max_depth=3, random_state = 123).fit (X_train, y_train)

n_estimators # of weak learners, usually adjusted first to increase accuracy, and exploit computing efficiencies.
learning_rate is adjusted keeping n_estimators fixed.
max_depth is always kept small (3–5) since GBDTs assume each tree is a weak learner.

Gradient Boosted Decision Trees are among the best off-the-shelf supervised learning methods available and use only modest memory & runtimes (model training though is computationally very intensive). They too have the advantages that come with Decision Trees (no feature preprocessing/ normalisation needed, can accommodate multiple feature types). Most major commercial application we see on machine learning are based on GBDTs. And, just like other Decision trees, GBDTs too, aren’t recommended for classification models with very high number of features such as those needed in Speech Recognition & Text Classification.

This guide is essentially a summary of some parts of the popular University of Michigan Course on Applied Machine Learning with Python.