Regression is estimating the relationship between input data and continuous output data. The data is typically in the form of real numbers, and the goal is to calculate the underlying function that maps the input to the output.
Consider the following mapping between input and output: 1 → 2, 3 → 6, 4.3 → 8.6, 7.1 → 14.2. If I ask you to find the relationship between the inputs and outputs, you can easily do this by looking at the pattern. We can see that the output is twice the input value in each case, so the transformation would be: f(x) = 2x. This is a simple function that connects the input values with the output values. However, in the real world, functions are not usually as straightforward!
The goal of linear regression is to extract the underlying linear model that relates the input variable to the output variable. This aims to minimize the sum of the squares of the differences between the actual output and the predicted output using a linear function. This method is called Ordinary least squares.
In this blog, I will be using the Medical Cost Personal Dataset from Kaggle. I have downloaded the CSV file which I will be working with. You will be required to create a Kaggle account to access the dataset. The link to the dataset is: https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download
When we build a machine learning model, we need a way to validate our model and check whether the model is performing at a satisfactory level. To do this, we need to separate our data into two groups: A training dataset and a testing dataset. The training dataset will be used to build the model, and the testing dataset will be used to see how this trained model has performed on unknown data. In the next code section, we will split the dataset into training and testing datasets. Here we will use 80% of the data for training and 20% for testing datasets.
The code for building a linear regression model is as follows:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Filepath to your dataset
filename = "/Users/frosthash/Downloads/insurance.csv"
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(filename)
# Display the first few rows of the DataFrame
print(df.head())
# Features and target
features = ['age', 'bmi', 'children', 'sex', 'smoker', 'region']
target = 'charges'
# Split the data into training and test sets (80% training, 20% testing)
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for categorical variables
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
# Use OneHotEncoder for categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
]
)
# Create a pipeline with preprocessing and linear regression
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', linear_model.LinearRegression())
])
# Train the model
model.fit(X_train, y_train)
# Predict on the training and test data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluate the model
mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)
# Display results
print("\nModel Coefficients:")
print(f"Intercept: {model.named_steps['regressor'].intercept_}")
print(f"Coefficients: {model.named_steps['regressor'].coef_}")
print("\nEvaluation Metrics (Training):")
print(f"Mean Squared Error: {mse_train}")
print(f"R^2 Score: {r2_train}")
print("\nEvaluation Metrics (Testing):")
print(f"Mean Squared Error: {mse_test}")
print(f"R^2 Score: {r2_test}")
# Plot for training data
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.scatter(y_train, y_train_pred, alpha=0.6, color='blue')
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=2)
plt.title('Training Data: Actual vs. Predicted')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.grid(True)
# Plot for test data
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred, alpha=0.6, color='green')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.title('Test Data: Actual vs. Predicted')
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.grid(True)
plt.tight_layout()
plt.show()
After building and running the model, we can visualize the results. The results of the linear regression model indicate that the model performs well in predicting insurance charges. It explains a substantial portion of the variability in the dataset, both in the training and testing sets, suggesting that the model can generalize effectively to unseen data. The coefficients reveal that certain features, such as whether the individual is a smoker, have a strong influence on insurance charges. While the model provides accurate predictions overall, there is still some room for improvement, as indicated by the error values, which suggest that additional refinement or the use of more advanced modeling techniques could lead to even better performance. Overall, the model shows good predictive power but could be enhanced to reduce errors and further increase its accuracy.
In the next blog, we will learn how we can compute the regression accuracy after we build the model. We will cover Mean absolute error, Mean squared error, Median absolute error, and R-squared score. Stay tuned for an insightful and practical guide!
Sincerely, Frosthash
Education, Execution, and Consistency