Introduction
A quick refresher about supervised learning is the building of a machine learning model based on labeled samples. An example is if we build a system to calculate the price of a house based on given parameters, such as size(sq foot), state, credit score, etc. We would need to store the parameters in a database and label them. Based on the saved data, our machine-learning algorithm will be able to calculate the price of a house based on the input parameters mentioned above.
Unsupervised learning is the opposite of what we discussed. There are no labeled data here. We often would have data points and want to separate them into groups. We don't exactly know what the criteria of separation would be. So, an unsupervised learning algorithm will try to separate the given dataset into a fixed number of groups in the best possible way.
Python, with its powerful packages like NumPy, SciPy, scikit-learn, and matplotlib, is essential when working on supervised learning models.
Preprocessing Data Using Different Techniques
In the real world, we would have to deal with a lot of data, which should have passed through some data cleansing and data deduplication techniques as the raw data is not ready to be ingestible by machine learning algorithms. We would have to preprocess the data before it is fed into various algorithms.
We can start by using this simple example: Create a sample Python file, preprocessor.py
import numpy as np
from sklearn import preprocessing
# We can create some sample data here
data = np.array([[3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3,
-1.9, -4.3]])
# Now we can operate on this data
Common Preprocessing Techniques
Mean removal
It is common practice and beneficial to remove the mean from each feature so that is centered on zero. This will help us remove any bias from the features. We can continue adding the following lines that we started earlier.
data_standardized = preprocessing.scale(data)
print "\nMean =", data_standardized.mean(axis=0)
print "Std deviation =", data_standardized.std(axis=0)
We can run the code by typing in your terminal
python preprocessor.py
From the results of running the code, we can see that the mean is almost zero and the standard deviation is 1.
Scaling
It is important to understand that the values of each feature in a dataset can vary widely. Therefore, it is crucial to scale them to create a "level playing field." We can add the following to the preprocessor file you have created.
# Scaling the data
data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled = data_scaler.fit_transform(data)
print ("\nMin max scaled data =", data_scaled)
After scaling, all the feature values range between the specified values. The output will be displayed, as follows:
Normalization
Data normalization is used to adjust the values in a feature vector so they can be measured on a common scale. A common form of normalization in machine learning adjusts the values of a feature vector so that they add up to 1.
We can add the following to the previously created file.
# Data normalization
data_normalized = preprocessing.normalize(data, norm='l1')
print ("\nL1 normalized data =", data_normalized)
This technique is often used to ensure that data points aren't artificially inflated because of the inherent nature of their features.
Binarization
Binarization is used when you want to convert your numerical feature vector into a Boolean vector. Add the following lines to the Python file:
# Binarization
data_binarized = preprocessing.Binarizer(threshold=1.4).transform(data)
print ("\nBinarized data =", data_binarized)
We can see the results as follows. This is a very useful technique that's usually used when we have some prior knowledge of the data.
One Hot Encoding
In the real world, we would deal with data (numeric) that are sparse and scattered over the place. We can use One Hot Encoding as a tool to tighten the feature vector. It looks at each feature and identifies the total number of distinct values. It uses one of the K-schemes to encode the values, Each feature in the feature vector is encoded based on this. This helps us to be more efficient in terms of space as we do not need to store these large values. This helps us to be more efficient in terms of space. For example, we have a 4-dimensional feature vector, to encode the n-th feature in a feature vector, the encoder will go through the n-th feature in each feature vector and count the number of distinct values. Meaning if we have four defined parameters, size, color, credit score, and age, it will count the distinct values in each of these features. If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and the other values are 0. We can now add the following to the Python file:
encoder = preprocessing.OneHotEncoder()
encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4,
3]])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print "\nEncoded vector =", encoded_vector
This should be the expected output
Encoded vector = [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
In the above example, let's consider the third feature in each feature vector. The values are 1, 5, 2, and 4. There are four distinct values here, which means the one-hot encoded vector will be of length 4. If you want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that the value is 5.
Label Encoding
In supervised learning, we often work with different types of labels, which can be numbers or words. If they are numbers, the algorithm can use them directly. However, labels are often in a human-readable form, so people usually label the training data with words. Label encoding is the process of converting these word labels into numbers so that algorithms can work with them. Let's take a look at how to do this.
We can start by creating a new Python file
# Label encoding
# import the preprocessing package
from sklearn import preprocessing
# This package contains various functions that are needed for data preprocessing.
label_encoder = preprocessing.LabelEncoder()
# The label_encoder object knows how to understand word labels.
input_classes = ['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw']
# We are now ready to encode these labels
label_encoder.fit(input_classes)
print ("\nClass mapping:")
for i, item in enumerate(label_encoder.classes_):
print (item, '-->', i)
# The words have been transformed into 0-indexed numbers.
# Now, when you encounter a set of labels, you can simply transform them,
# as follows
labels = ['toyota', 'ford', 'audi']
encoded_labels = label_encoder.transform(labels)
print ("\nLabels =", labels)
print ("Encoded labels =", list(encoded_labels))
# You can check the correctness by transforming numbers back
# to word labels:
encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print ("\nEncoded labels =", encoded_labels)
print ("Decoded labels =", list(decoded_labels))
After running the code we can see the output as follows:
In the next blog, we will discuss how to build a linear regression model. We will cover the fundamentals of linear regression, including its mathematical foundation and the assumptions that need to be met for this model to be effective. Additionally, we will walk through the steps of preparing your data, selecting the appropriate features, and splitting your dataset into training and testing sets.
After that, we'll dive into the actual implementation using popular libraries like Scikit-learn and Pandas in Python. We’ll provide examples and explain how to interpret the results, evaluate the model’s performance, and make predictions. Lastly, we’ll touch on common pitfalls to avoid and ways to improve the model's accuracy. Stay tuned for an insightful and practical guide!
Sincerely, Frosthash
Education, Execution, and Consistency