Visual understanding enables learning more than any other form of learning. We observe, we analyze, and we learn! Just like humans, now machine learning models can also recognize the characters present in images. The technique of identifying characters using machine learning or computer vision has become very popular among industries. Google, Microsoft, Twitter, Mathworks, and many more technical giants are using Optical Character Recognition(OCR) techniques to solve various tasks, including spam classification, automatic reply, number-plate detection, etc.
Character recognition is a primary step in recognizing whether any text or character is present in the image or not. We design algorithms for our machines to make them able to recognize characters present inside the image. To understand it clearly, let’s take an example.
We might have seen many movies where cops trace the crime vehicles using their number plates. Imagine a scenario where we have a list of all such number plates associated with several crime scenes. Cops want to trace all these vehicles, and for that, they installed cameras over several check-posts present on the road. Manually observing the camera output for every vehicle, crossing the check-post, and then finding whether that vehicle’s number is present in the crime list will be a highly inefficient way.
So, let’s help our cops by providing them with a more sophisticated solution using machine learning. Cops don’t have to check any vehicle manually. Our machine learning model will take the camera images as an input, recognize the characters on the number plate and automatically check whether that vehicle number is present in the crime list or not.
In the image shown below, the working of a similar machine learning model is demonstrated, identifying the characters in the vehicle’s number plate. This recognition can be mapped to data association problems such as identifying the vehicle with a particular ID and checking whether it is present at a specific location or violated any traffic law.
Deep-learning networks perform the best where the data is present in the image form. Deep learning algorithms can extract the hidden features present in the image and complete the recognition task wonderfully. But as a consequence, deep learning models are computationally heavy and possess a non-explainable nature.
There come machine learning algorithms as a savior. Indeed, we need a deeper neural network to solve complex tasks involving image data. Still, a linear model can do a decent job for us for specific simple image classification tasks. Advantages of these linear models are:
The problem of simple character recognition can be solved using algorithms like Multi-Layer Perceptron (MLP), SVMs, Logistic Regression, etc. This article will describe the steps to implement a Logistic Regression classifier for identifying the numbers in the image. For this task, we will be using the famous MNIST dataset. But before that, we need to understand some basic things related to image data.
Computers read images as a matrix, and the entries of this matrix are the color pixel values. These color pixel values represent different colors present in any image at any particular location in the image. The values of these pixels lie in the range of [0, 255]. Generally, images are stored in the RGB (Red-Green-Blue) format. To understand the image data, computers convert them into a 3-Dimensional matrix, and each dimension is represented by one color channel.
The image below represents one sample form of the “image-matrix”. It is a 3D matrix with three dimensions [height, width, number of channels].
There are several libraries in python, like open-cv and Pillow, that can be used to read the image. To install OpenCV-python, run the command below in the terminal (for Mac and Linux systems) or command prompt (for Windows systems).
pip install opencv-python # for windows
sudo pip3 install opencv-python # fro mac and linux
To read any given file having the filename as “download.jpeg” located at the same location where the python script is.
import cv2
import numpy as np
img = cv2.imread('download.jpeg')
print("Hight = ", img.shape[0], "Width = ", img.shape[1], "Channel = ", img.shape[2])
print("Type of img variable : ", type(img))
cv2.imshow('image', img)
cv2.waitKey(0)
## Hight = 225, Width = 225, Channel = 3
## Type of img variable : <class 'numpy.ndarray'>
It may be possible that we don’t need color information for smaller tasks. So we can drop the color channel from our image and convert this image into the black and white format. For that, we will transform our 3-Dimensional matrix into a 2-Dimensional form.
img3 = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print("Image shape : ",img3.shape)
# Image shape : (225, 225)
#Image below shows the conversion of RGB image to grayscale image
# using cv2 library. Please note the newer dimension of image.
Now, we are ready to understand our Digit prediction MNIST dataset. So let’s start without any further delay.
The MNIST dataset is a gray-scaled 28x28 pixel² sized open-source dataset commonly available in the Scikit-learn framework.
The above image is a sample from the MNIST hand-written digit dataset. We will use such samples to show how a linear model can recognize hand-written digits.
Digits can take any value from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], so we can say that we have ten classes in the output data which we want to predict using any classification algorithm. The input of this algorithm will be the processed image data. For cases where the image is well enhanced (In a human-readable format), no restoration (cleaning of images) is needed. In such scenarios, a linear model can deliver a decent level of accuracy.
We will be using the sklearn library to import the data. The MNIST dataset has images of size 28x28 which, when flattened out, will give a vector 784x1 (28*28 = 784). The fetch_openml package from the sklearn.dataset store the already flattened vector dataset, which can be imported using the ‘mnist_784’ keyword.
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', version=1,
return_X_y=True, as_frame=False)
print("Shape of input : ", X.shape, "Shape of target : ", y.shape)
# Shape of input : (70000, 784) Shape of target : (70000,)
There are 70000 images on this dataset; hence loading will take a little time.
If we visualize three samples from our dataset, it will be like the image shown below.
from matplotlib import pyplot as plt
import numpy as np
plt.figure()
for idx, image in enumerate(X[:3]):
plt.subplot(1, 3, idx + 1)
plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
plt.title('Training: ' + y[idx], fontsize = 20);
We have chosen the Logistic Regression model, so let’s quickly split the data into training and testing sets.
There are 70000 samples, so let’s choose 60k samples for training and 10k for testing purposes. We will use the “traintestsplit” function of Scikit-learn.
from sklearn.model_selection import train_test_split
y = [int(i) for i in y] # targets are strings, so need to convert to # int
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=1/7,random_state=0)
print("training samples shape = ", X_train.shape)
print("testing samples shape = ", X_test.shape)
## training samples shape = (60000, 784)
## testing samples shape = (10000, 784)
We can quickly verify whether the data is balanced by plotting the histogram plot for the frequency of all the samples corresponding to 10 labels present in our dataset.
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.hist(y_train, bins=20, rwidth=0.9, color='#607c8e');
plt.title('Frequency of different classes - Training Set');
plt.subplot(1,2,2)
plt.hist(y_test, bins=20, rwidth=0.9, color='#607c8e');
plt.title('Frequency of different classes - Test set');
Samples for all the labels are not balanced, but the difference between the number of samples of two different labels is not too huge, so we can ignore the step of balancing the data.
Learners are free to do the balancing step. If the accuracy improves, please leave a message for us in the chat section.
Logistic Regression falls in the category of Generalized Linear Models (GLM) and is very much like linear Regression, except it predicts categorical target variables. This means that the final layer output values in Logistic Regression are probability values between 0 and 1, which classify any observation into a particular category.
The problem that we are solving comes under the category of Multinomial Logistic Regression, which we discussed in this blog. The model will predict the presence of each label and finally give the output of that label for which it was most confident.
We will use the Logistic Regression model imported from sklearn.linear_model with tunable parameters such as the strength of regularization and penalty type.
As the data size is significant, the model will take some time in training. Be patient :).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept=True,
multi_class='auto',
penalty='l1', #lasso regression
solver='saga',
max_iter=1000,
C=50,
verbose=2, # output progress
n_jobs=5, # parallelize over 5 processes
tol=0.01
)
model
# LogisticRegression(C=50, max_iter=1000, n_jobs=5, penalty='l1', # solver='saga', tol=0.01, verbose=2)
model.fit(X_train, y_train)
### convergence after 48 epochs took 131 seconds
This is a classification problem, and we have discussed different evaluation metrics used to evaluate the classification model here. We can directly print the accuracy of the model by printing model.score(Xtrain, ytrain) and model.score(Xtest, ytest) for checking the training and testing accuracies. Later, we are going to plot the confusion matrix.
print("Training Accuracy = ", np.around(model.score(X_train, y_train)*100,3))
print("Testing Accuracy = ", np.around(model.score(X_test, y_test)*100, 3))
## Training Accuracy = 93.742
## Testing Accuracy = 91.94
from sklearn import metrics
pred_y_test = model.predict(X_test)
cm = metrics.confusion_matrix(y_true=y_test,
y_pred = pred_y_test,
labels = model.classes_)
# Let's see this Confusion matrix using seaborn libraray
import seaborn as sns
plt.figure(figsize=(12,12))
sns.heatmap(cm, annot=True,
linewidths=.5, square = True, cmap = 'Blues_r', fmt='0.4g');
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
There are some misclassifications. Let’s visualize some misclassified images as well. First, let’s extract the 10 sample indexes for which our model did not predict the true class.
index = 0
misclassified_images = []
for label, predict in zip(y_test, pred_y_test):
if label != predict:
misclassified_images.append(index)
index +=1
if len(misclassified_images) == 10:
break
print("Ten Indexes are : ",misclassified_images)
## Ten Indexes are : [4, 5, 18, 61, 78, 82, 129, 134, 141, 161]
Now, let’s plot the images, their actual values, and corresponding predicted values.
plt.figure(figsize=(10,10))
plt.suptitle('Misclassifications');
for plot_index, bad_index in enumerate(misclassified_images):
p = plt.subplot(4,5, plot_index+1) # 4x5 plot
p.imshow(X_test[bad_index].reshape(28,28), cmap=plt.cm.gray,
interpolation='bilinear')
p.set_xticks(()); p.set_yticks(()) # remove ticks
p.set_title(f'Pred: {pred_y_test[bad_index]}, Actual: {y_test[bad_index]}');
Learners are advised to vary the model hyper-parameters from low to high strength of regularization (C:[5e-4:500]). Also, see the effect of different types of scaling-based pre-processing.
First, we will see how the model has performed under different scenarios. We will be using various combinations by varying the pre-processing types and the strength of regularization. To evaluate the model, we will be using the accuracy parameter.
Let us view the weight matrix corresponding to each output class.
coef = model.coef_.copy()
scale = np.abs(coef).max()
plt.figure(figsize=(13,7))
for i in range(10): # 0-9
coef_plot = plt.subplot(2, 5, i + 1) # 2x5 plot
coef_plot.imshow(coef[i].reshape(28,28),
cmap=plt.cm.RdBu,
vmin=-scale, vmax=scale,
interpolation='bilinear')
coef_plot.set_xticks(()); coef_plot.set_yticks(()) # remove ticks
coef_plot.set_xlabel(f'Class {i}')
plt.suptitle('Coefficients for various classes');
When plotted out, the weight matrix shows how the model has learned its parameters to provide its decision while predicting a class. This is the benefit of classical algorithms over Deep-learning as we don’t have access to these weights there.
Character recognition has become an integral part of computer vision and analysis, and several such corporations are actively working to improve upon the current state-of-the-art. Several industries are actively developing algorithms that categorize hand-written digits and notes precisely. Tools such as Optical Character Recognition (OCR) are in practice now and have achieved the state-of-the-art in such objectives.
Cloud Vision lets algorithm designers integrate optical character recognition (OCR ) and vision detection features, including image labeling. These algorithms can be easily incorporated into the user’s overall pipeline. Check out the documentation to learn how to use such APIs and the comprehensive package of their services.
Microsoft Azure’s Computer Vision API includes Optical Character Recognition (OCR) capabilities that extract printed or hand-written text from images. This API provides developers with access to advanced image processing algorithms that have attained the current state-of-the-art. This API works in several languages, making it one of the most used character recognition APIs.
Based on this project, interviewers can ask these questions:
This article discussed one of the best machine learning applications: optical character recognition. We discussed the steps to implement a linear classifier model over the MNIST dataset and evaluated the performance. After that, we discussed some of the use-cases of companies currently using this technology. We hope you have enjoyed the article and sensed how efficient even the linear machine learning models could be.
Next Blog: Introduction to SVM