Machine Learning is gaining traction, and companies are looking towards integrating ML solutions to enhance their business. All this has become possible because of the community and developer's support to make it even more popular and easily usable. Machine Learning developers have created an open-source framework known as Scikit-learn. This serves as the foundation, and ML practitioners do not need to write everything from scratch.
In this article, we will discuss the basic supports provided by Scikit-learn for all the stages of Machine Learning model development. Like other frameworks in software, it contains numerous tools and features, and it is impossible to cover everything in one blog. Here, our focus would be prioritizing specific features that a basic learner should know.
We will cover each stage of the Machine Learning model development to make learners aware of the support it provides in each stage.
But before discussing all this, let's know more about it and the installation steps.
Scikit-learn, also known as sklearn, is a Python-based, free, open-source Machine Learning library that provides support for tasks in data mining, data analytics, data science, and Machine Learning. It is built on top of the famous Python packages Scipy, Numpy, and Matplotlib. As it is open-source, we can easily access its codebase and dive deeper into the codes for each functional support it provides.
The official GitHub repository of the Scikit-learn library can be found here. From this repository, we can see that there are more than 2700 contributors for this library and 55.5k stars, representing its popularity in the ML community.
There is a direct command for installing Scikit-learn using Pip:
pip install -U scikit-learn.
Let's think from a beginner's perspective: What would be required if someone starts their journey in the ML field?
Let's learn each of these supports in greater detail now, as beginners would be very much interested in knowing these supports with codes.
Data lies in the core part of the Machine Learning process. Indeed, data is very specific to the needs for which the ML model is being developed. But from a learning perspective, we need some pre-existing modules to help us experiment with multiple algorithms and understand their behavior.
To begin the journey in Machine Learning, the Scikit-learn library provides a large set of freely available datasets that can be directly imported into our programs. The toy dataset is one of the most famous of various dataset classes provided by the Scikit-learn library. Popular datasets available in this set are:
Refer here to check out all the datasets in this toy set category. Apart from the toy dataset, there are some real-world datasets incorporated in the Scikit-learn library, which comprise:
Some other similar datasets are prepared from real-world scenarios that we can find here. It even allows users to generate data randomly based on their requirement of testing the developed model.
As we know about the datasets, let's quickly see how to load them into our programs.
from sklearn.datasets import load_iris
iris = load_iris()
X,y = iris.data,iris.target
features = iris.feature_names
labels = iris.target_names
print('Available Features :',features)
print('Categories :',labels)
print(len(X))
print(len(y[y==0]))
print(len(y[y==1]))
print(len(y[y==2]))
'''
Available Features : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm) ']
Categories : ['setosa' 'versicolor' 'virginica']
150
50
50
50
'''
The above code shows how to load and view the attributes of a sample dataset (iris flower) sklearn.datasets,
and the total number of samples for each category is shown in this code. We can use the command dir(sklearn.datasets)
to check all the datasets this package provides.
import sklearn
print(dir(sklearn.datasets))
Scikit-learn also provides the option to generate an entirely new dataset as per the requirements.
sklearn.datasets
can be used to generate 2-dimensional two interleaving half circles or full circles. Later, these datasets can be used for classification or clustering tasks.Apart from these, several other datasets were provided by the scikit-learn packages, such as loadsvmlightfile, fetch_openml, etc.
We know the datasets are always flawed and demand preprocessing to extract meaning from them. Scikit-learn provides a lot of in-built modules using which we can analyze and preprocess. So let's have a look.
The objective of the data preprocessing stage of ML models is to obtain the data in trainable format. This requires,
Let's see in detail,
We can use the Sckikit-learn library to fill in the missing values in the Dataset. This process is called imputing. There can be many ways to do this, but here, we would focus on using a simple imputer to replace missing values.
import numpy as np
from sklearn.impute import SimpleImputer
data = np.array([[19, 18, np.NaN, 26],
[85, 53, 76, 45],
[83, 97, 1, np.NaN],
[73, 28, 38, 37],
[87, np.NaN, 86, 66],
[23, 28, 11, 10]])
print('Original Data :')
print(data) #Check Data before imputing
print(np.isnan(data).any()) #Check presence of missing value
imp = SimpleImputer(strategy = 'median') #Define Imputer with strategy (mean/median/most_frequent)
data_new = imp.fit_transform(data) #Transform data as per the strategy
print('New Data :')
print(data_new) #Check Data after imputing
Original Data :
[[19. 18. nan 26.]
[85. 53. 76. 45.]
[83. 97. 1. nan]
[73. 28. 38. 37.]
[87. nan 86. 66.]
[23. 28. 11. 10.]]
Transformed to:
New Data:
[[19. 18. 38. 26.]
[85. 53. 76. 45.]
) [83. 97. 1. 37.]
[73. 28. 38. 37.]
[87. 28. 86. 66.]
[23. 28. 11. 10.]]
True
We can also use strategies like 'mean'/'most_frequent' to replace the missing value with the particular feature's mean or mode (column).
At times, attributes of the Dataset can be in non-numeric form yet informative, and computers only understand numbers. Hence, these non-numeric values cannot be processed by ML models. That's when a label encoder comes into the picture. It can replace non-numerical values with numerical values and make these attributes understandable to machines.
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
path = 'D:/EnjoyAlgorithm/PlayTennis.csv'
PlayTennis = pd.read_csv(path, header = 0, skiprows = 0) #Loading the Text Dataset
print ("Dataset Length: ", len(PlayTennis))
print ("Dataset Shape: ", PlayTennis.shape)
print(PlayTennis) #Before processing
Le = LabelEncoder()
for label in PlayTennis.columns:
PlayTennis[label] = Le.fit_transform(PlayTennis[label])
print(PlayTennis) #After processing
For example, in the above playTennis dataset, the Label Encoder assigned a numerical value to each non-numerical data entry (say 'overcast’=0,' rainy’=1, 'sunny’=2). The processed data is now suitable to develop a Machine Learning model.
In many real-world datasets, different attributes are present in different numerical ranges, which means their minimum and maximum do not match. This will create problems as attributes with higher magnitudes will be preferred more (or less, depending on the algorithm).
For example, a hiring manager has to develop a plan to propose the salary for an individual. Their inputs include previous wages and the number of years of work experience. If we use an ML algorithm, e.g., KNN, the previous salary feature (being in the higher magnitude range) will outweigh the work experience as the numerical quantity of work experience will vary in the range of 0–70, but salary numbers will range from thousands to millions. Hence, we need to scale these features to assign them equal importance.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# define data ---> [salary ($), work ex(yrs)]
data = np.array([[3000, 1],
[3300, 2],
[4500, 2],
[3800, 1],
[4800, 3],
[5000, 5]])
print(data)
scaler = MinMaxScaler() # define an object of the classmin max scaler
new_data = scaler.fit_transform(data) # fit and transform the data
print(new_data)
'''
Original
[[3000 1]
[3300 2]
[4500 2]
[3800 1]
[4800 3]
[5000 5]
Scaled
[[0. 0.]
[0.15 0.251]
[0.75 0.25]
[0.4 0. ]
[0.9 0.5 ]
[1. 1.]
'''
We can use different scalars, such as AbsScalar and StandardScalar, and the choice depends upon the problem statement and dataset nature.
In feature engineering, we prepare the proper input features for ML models. Some popular techniques to perform feature engineering are:
Vectorization using Scikit-learn: Vectorization techniques are mainly used when data is not present in a tabular format, like text data, JSON file, dictionary, etc. It converts the entire data into a vector format, making it more understandable to machines.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
data = [
{'price': 1125000, 'rooms': 4, 'State': 'New York'},
{'price': 1000000, 'rooms': 3, 'State': 'California'},
{'price': 750000, 'rooms': 3, 'State': 'Washington'},
{'price': 800000, 'rooms': 2, 'State': 'California'},
{'price': 850000, 'rooms': 2, 'State': 'New York'},
]
new_data = vec.fit_transform(data)
print(new_data)
'''
Output =
[[ 0 1 0 1125000 4]
[ 1 0 0 1000000 3]
[ 0 0 1 750000 3]
[ 1 0 0 800000 2]
[ 0 1 0 850000 2]]
'''
Dimensionality Reduction using Scikit-learn
Dimensionality reduction brings data samples in high-dimensional data to a low-dimensional space while retaining maximum information. We can not visualize datasets that are present in dimensional space higher than 3, and using dimensional reduction techniques, we bring them in lower dimensions and then visualize them. PCA is one such technique, and the Scikit-learn library supports that.
Let's see how to use Scikit-learn to reduce the dimensionality from 3 to 2.
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#____CREATE RANDOM CLASSIFICATION DATASET____
X, y = datasets.make_classification(n_samples=300, n_features=3, n_classes=3, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3,0.2], random_state=42)
pca = PCA(n_components = 2,svd_solver = 'randomized')
X_fitted = pca.fit(X).transform(X)
fit = pca.fit(X)
print(("Explained Variance: %s") % (fit.explained_variance_ratio_))
#_____PLOT ORIGINAL DATA_____#
fig = plt.figure()
#ax = fig.add_subplot(111,projection = '3d')
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(xs = X[:,0], ys = X[:,1], zs = X[:,2], c=y)
ax.set_title("Original 3-featured data")
ax.set_xlabel("X0")
ax.set_ylabel("X1")
ax.set_zlabel("X2")
plt.show()
#_____PLOT REDUCED DIMENSION DATA_____#
fig, ax = plt.subplots(figsize=(9, 6))
plt.title("Reduced 2-featured data")
plt.xlabel("X_fitted_0", fontsize=20)
plt.ylabel("X_fitted_1", fontsize=20)
plt.scatter(X_fitted[:,0], X_fitted[:,1], s=50, c=y)
As we can see, there is a reduction in the dimension from 3 to 2.
We are now ready to apply ML algorithms to the prepared Dataset and build our model. Let's see what Scikit-learn provides here.
The sklearn library provides support for various machine learning models classified based on their type (linear models, tree-based, SVM-based, ensemble-based, etc.). Some standard algorithms are shown in the code and the commands to show how they can be used inside our Python programs. Check out the complete list here.
from sklearn.linear_model import LinearRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
The general paradigm for Scikit-learn is
#EX1: Creating and deploying a Supervised Learning Model
model = DecisionTreeClassifier() # Create an instance of the Decision Tree Classifier
model = model.fit(X_train,y_train) # Fit the training data into the model
model.predict(X_test) # Use model to make prediction
#EX2: Creating and deploying an Dimensionality Reduction Model
pca = PCA(n_components = 2) # Create an instance of the PCA
X_transformed_data = pca.fit_transform(X_data) # Fit and transform the data to new dimensions
Once the model is trained, we can evaluate its performance. Scikit-learn provides a wide range of modules to evaluate our models.
We evaluate our trained ML models on train and test sets using functional supports provided by Scikit-learn. Based on the performance, we decide whether the model is suffering from problems like Underfitting or Overfitting.
#EX1: Evaluating the Model performance using R2-score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolut_error
y_pred = model.predict(X_test)
r2_score(y_test, y_pred)
So far, we have seen the ways of extracting trainable data from raw data and then using them to train our ML algorithms. This complete process can be organized sequentially, also known as a pipeline. A pipelining process allows the processing and evaluation of a trained model from end to end. Scikit-learn converts this entire process into a pipeline, which makes them readily employable or deployable.
For example, let's see an end-to-end pipeline building using sklearn on the 'iris' flower dataset.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
iris=load_iris()
iris_data = iris.data.copy()
iris_target = iris.target
#print('Iris Data before replacing samples with NaN',iris_data)
c = 10
mask = np.ones(iris_data.shape)
mask.ravel()[np.random.choice(mask.size, c, replace=False)] = 0
#print(np.where(mask==0)) #Checking the (c=10) locations where number is replaced by NaN
iris_data[mask==0] = np.NaN
#print('Iris Data before replacing samples with NaN',iris_data)
X_train,X_test,y_train,y_test=train_test_split(iris_data,iris_target,test_size=0.3,random_state=42)
pipeline=Pipeline([('Imputer',SimpleImputer(strategy='mean')),('Scalar',StandardScaler()),
('PCA',PCA(n_components=2)),('SVC',SVC(kernel = 'linear'))])
model = pipeline.fit(X_train, y_train)
print('SVM performance on Iris Classification',model.score(X_test,y_test))
#To view the data in any intermediate stage of the pipeline
imputer_output = model.named_steps["Imputer"].transform(X_train)
scalar_output = model.named_steps["Scalar"].transform(imputer_output)
pca_output = model.named_steps["PCA"].transform(scalar_output)
model_output = model.named_steps["SVC"].predict(pca_output)
#SVM performance on Iris Classification 0.9333333333333333
To build this pipeline, we need to import the pipeline from sklearn.pipeline library. This pipeline takes the input of different transformations we apply to our Dataset. Let's suppose we want to do the imputation.
The iris dataset has no missing values, so we will randomly replace a sample of values from the Dataset with NaN (not a number) to see the pipeline working closely.
Now that the data is ready, we will split it into the training and testing datasets. Sklearn provides a feature train_test_split
that can split the data into desired fractions.
The next step is building the pipeline. The pipeline takes in input as a list of tuples. The tuple indexed '0' is the desired name for the transformation, and the indexed '1' is the transformation to be applied. The pipeline consists of the following transformation,
We can then fit the pipeline into the training dataset and compute the accuracy on the test dataset. To view the output in any intermediate steps, use named_step[“transformation_name”]
as shown in the code above. This will allow us to effectively see the pipeline results in the intermediate steps and understand how the pipeline is working.
Scikit-learn is an open-source Machine Learning library built on top of famous Python packages and provides support for every Machine Learning model development stage. This article covered the essential introduction to the Scikit-learn library from a beginner's perspective and discussed its support for every ML model development and deployment stage. We hope you find the article enjoyable and learn something new.
Enjoy Learning! Enjoy Algorithms!
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.