In the previous blog, we learned how to build a regressor. It is also important to understand how to evaluate the quality of a regressor as well. In this context, an error is defined as the difference between the actual value and the predicted value by the regressor.
Some metrics can be used to measure the quality of a regressor. A regressor can be evaluated using many different metrics, such as the following:
Mean Absolute Error: This is the average of absolute errors of all the data points in the given dataset.
Mean Squared Error: This is the average of the squares of the errors of all the data points in the given dataset. It is one of the most popular metrics out there!
Median Absolute Error: This is the median of all the errors in the given dataset. The main advantage of this metric is that it is robust to outliers. A single bad point in the test dataset would not skew the entire error metric, as opposed to the mean error metric.
Explained Variance Score: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfect.
R² score: This is pronounced as R-squared, and this score refers to the coefficient of determination. This simply tells us how well the unknown samples will be predicted by our model. The best possible score is 1.0, and the values can be negative as well.
To do this in Python, we can use scikit-learn, which provides functionalities to compute all the following metrics.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn import linear_model
import sklearn.metrics as sm
# Filepath to your dataset
filename = "/Users/frosthash/Downloads/insurance.csv"
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(filename)
# Features and target
features = ['age', 'bmi', 'children', 'sex', 'smoker', 'region']
target = 'charges'
# Split the data into training and test sets (80% training, 20% testing)
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for categorical variables
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
# Use OneHotEncoder for categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
]
)
# Create a pipeline with preprocessing and linear regression
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', linear_model.LinearRegression())
])
# Train the model
model.fit(X_train, y_train)
# Predict on the test data
y_test_pred = model.predict(X_test)
# Calculate and print metrics
print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explained variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))
Building a Ridge Regressor
One of the main problems with linear regression is that it is sensitive to outliers. During data collection in the real world, it is quite common to measure the output incorrectly. Linear regression uses ordinary least squares, as we have discussed, which tries to minimize the squares of errors. Outliers tend to cause problems because they contribute significantly to the overall error. This disrupts the entire model.
Considering the following figure:
The points at the bottom are the outliers, but the model is trying to fit all the points. Hence, the overall model tends to be inaccurate. By visual inspection, we can see that the following figure represents a better model:
Ordinary least squares consider every single data point when building a model. Hence, the actual model ends up looking like the dotted line shown in the preceding figure. We can see that this model is suboptimal (as opposed to achieving an optimal result). To avoid a suboptimal result, we use regularization, where a penalty is imposed on the size of the coefficient. This method is called Ridge Regression.
Here is how to perform Ridge Regression in Python.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
import sklearn.metrics as sm
import matplotlib.pyplot as plt
# Filepath to your dataset
filename = "/Users/frosthash/Downloads/insurance.csv"
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(filename)
# Features and target
features = ['age', 'bmi', 'children', 'sex', 'smoker', 'region']
target = 'charges'
# Split the data into training and test sets (80% training, 20% testing)
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for categorical variables
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
# Use OneHotEncoder for categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
]
)
# Create a pipeline with preprocessing and Ridge regression
ridge_model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', Ridge(alpha=1.0)) # You can adjust alpha for regularization strength
])
# Train the Ridge regression model
ridge_model.fit(X_train, y_train)
# Predict on the training and test data
y_train_pred = ridge_model.predict(X_train)
y_test_pred = ridge_model.predict(X_test)
# Calculate and print metrics
print("Ridge Regression Metrics:")
print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explained variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))
# Plot actual vs predicted for both datasets
plt.figure(figsize=(14, 6))
# Training dataset plot
plt.subplot(1, 2, 1)
plt.scatter(y_train, y_train_pred, alpha=0.6, color='blue')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=2)
plt.title('Training Data: Actual vs. Predicted (Ridge Regression)')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.grid(True)
# Test dataset plot
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred, alpha=0.6, color='green')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.title('Test Data: Actual vs. Predicted (Ridge Regression)')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.grid(True)
plt.tight_layout()
plt.show()
Building a Polynomial Regressor
One of the main constraints of a linear regression model is the fact that it tries to fit a linear function to the input data. The polynomial regression model overcomes this issue by allowing the function to be polynomial, thereby increasing the accuracy of the model.
Consider the following figure:
We can see that there is a natural curve to the pattern of data points. This linear model is unable to capture this. A polynomial model would look like:
In the figure above, the dotted line represents the linear regression model, and the solid line represents the polynomial regression model. The curviness of this model is controlled by the degree of the polynomial. As the curviness increases, so does the accuracy, and vice versa. The trade-off to this is that as the curviness increases, it adds complexity to the model, making it slower.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import sklearn.metrics as sm
import matplotlib.pyplot as plt
# Filepath to your dataset
filename = "/Users/frosthash/Downloads/insurance.csv"
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(filename)
# Features and target
features = ['age', 'bmi', 'children', 'sex', 'smoker', 'region']
target = 'charges'
# Split the data into training and test sets (80% training, 20% testing)
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for categorical and numerical variables
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
# Polynomial regression with degree 2
poly_degree = 2
preprocessor = ColumnTransformer(
transformers=[
('num', PolynomialFeatures(degree=poly_degree, include_bias=False), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
]
)
# Create a pipeline with preprocessing and linear regression
polynomial_model = Pipeline(steps=[
('preprocessor', preprocessor),
('scaler', StandardScaler()), # Optional: Scale the features for better performance
('regressor', LinearRegression())
])
# Train the Polynomial Regression model
polynomial_model.fit(X_train, y_train)
# Predict on the training and test data
y_train_pred = polynomial_model.predict(X_train)
y_test_pred = polynomial_model.predict(X_test)
# Calculate and print metrics
print("Polynomial Regression Metrics (Degree 2):")
print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explained variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))
# Plot Actual vs Predicted Charges for Training Data
plt.figure(figsize=(14, 6))
plt.scatter(y_train, y_train_pred, color='blue', alpha=0.5, label='Training Data')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--', label='Perfect Prediction Line')
plt.title('Polynomial Regression: Actual vs Predicted Charges (Training Data)')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.legend()
plt.grid(True)
plt.show()
# Plot Actual vs Predicted Charges for Test Data
plt.figure(figsize=(14, 6))
plt.scatter(y_test, y_test_pred, color='green', alpha=0.5, label='Test Data')
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--', label='Perfect Prediction Line')
plt.title('Polynomial Regression: Actual vs Predicted Charges (Test Data)')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.legend()
plt.grid(True)
plt.show()
In this case, the results show that we should be using the Linear regression instead of the Polynomial, as this gives us better results than the Polynomial regression. In the next blog, we will learn about classifiers, how to build one, logistic regression classifiers, and Naive Bayes classifiers. Stay tuned for an insightful and practical guide!
Sincerely, Frosthash
Education, Execution, and Consistency