Optical Character Recognition Using Machine Learning

Introduction

Visual understanding enables learning more than any other form of learning. We observe, we analyze, and we learn! Just like humans, now machine learning models can also recognize the characters present in images. The technique of identifying characters using machine learning or computer vision has become very popular among industries. Google, Microsoft, Twitter, Mathworks, and many more technical giants are using Optical Character Recognition(OCR) techniques to solve various tasks, including spam classification, automatic reply, number-plate detection, etc.

Key takeaways from this blog

What is character recognition?
Basic overview of image data.
Gist about the MNIST data.
Logistic Regression in action.
Implementation of the Logistic Regression.
Evaluation of model.
Company use-case for this machine learning application.
Possible interview questions on this project.

What is character recognition?

Character recognition is a primary step in recognizing whether any text or character is present in the image or not. We design algorithms for our machines to make them able to recognize characters present inside the image. To understand it clearly, let’s take an example.

We might have seen many movies where cops trace the crime vehicles using their number plates. Imagine a scenario where we have a list of all such number plates associated with several crime scenes. Cops want to trace all these vehicles, and for that, they installed cameras over several check-posts present on the road. Manually observing the camera output for every vehicle, crossing the check-post, and then finding whether that vehicle’s number is present in the crime list will be a highly inefficient way.

So, let’s help our cops by providing them with a more sophisticated solution using machine learning. Cops don’t have to check any vehicle manually. Our machine learning model will take the camera images as an input, recognize the characters on the number plate and automatically check whether that vehicle number is present in the crime list or not.

An efficient way, right?

In the image shown below, the working of a similar machine learning model is demonstrated, identifying the characters in the vehicle’s number plate. This recognition can be mapped to data association problems such as identifying the vehicle with a particular ID and checking whether it is present at a specific location or violated any traffic law.

How to use optical character recognition in real-life?

Deep-learning networks perform the best where the data is present in the image form. Deep learning algorithms can extract the hidden features present in the image and complete the recognition task wonderfully. But as a consequence, deep learning models are computationally heavy and possess a non-explainable nature.

There come machine learning algorithms as a savior. Indeed, we need a deeper neural network to solve complex tasks involving image data. Still, a linear model can do a decent job for us for specific simple image classification tasks. Advantages of these linear models are:

They are computationally cheap and hence can be easily deployed on IoT devices.
They are efficient in terms of time and provide predictions in real-time.
The reason behind any prediction can be easily explained.

The problem of simple character recognition can be solved using algorithms like Multi-Layer Perceptron (MLP), SVMs, Logistic Regression, etc. This article will describe the steps to implement a Logistic Regression classifier for identifying the numbers in the image. For this task, we will be using the famous MNIST dataset. But before that, we need to understand some basic things related to image data.

Standard Description About Image Data

Computers read images as a matrix, and the entries of this matrix are the color pixel values. These color pixel values represent different colors present in any image at any particular location in the image. The values of these pixels lie in the range of [0, 255]. Generally, images are stored in the RGB (Red-Green-Blue) format. To understand the image data, computers convert them into a 3-Dimensional matrix, and each dimension is represented by one color channel.

Color representation in 0-255 format

The image below represents one sample form of the “image-matrix”. It is a 3D matrix with three dimensions [height, width, number of channels].

How computers read images?

There are several libraries in python, like open-cv and Pillow, that can be used to read the image. To install OpenCV-python, run the command below in the terminal (for Mac and Linux systems) or command prompt (for Windows systems).

pip install opencv-python # for windows
sudo pip3 install opencv-python # fro mac and linux

To read any given file having the filename as “download.jpeg” located at the same location where the python script is.

import cv2
import numpy as np
img = cv2.imread('download.jpeg')

print("Hight = ", img.shape[0], "Width = ", img.shape[1], "Channel = ", img.shape[2])
print("Type of img variable : ", type(img))

cv2.imshow('image', img)
cv2.waitKey(0)

## Hight = 225, Width = 225, Channel = 3
## Type of img variable : <class 'numpy.ndarray'>

It may be possible that we don’t need color information for smaller tasks. So we can drop the color channel from our image and convert this image into the black and white format. For that, we will transform our 3-Dimensional matrix into a 2-Dimensional form.

img3 = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print("Image shape : ",img3.shape)

# Image shape :  (225, 225)
#Image below shows the conversion of RGB image to grayscale image   
# using cv2 library. Please note the newer dimension of image.

Now, we are ready to understand our Digit prediction MNIST dataset. So let’s start without any further delay.

Dataset description

The MNIST dataset is a gray-scaled 28x28 pixel² sized open-source dataset commonly available in the Scikit-learn framework.

MNIST dataset sample present in Scikit-learn library

The above image is a sample from the MNIST hand-written digit dataset. We will use such samples to show how a linear model can recognize hand-written digits.

Digits can take any value from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], so we can say that we have ten classes in the output data which we want to predict using any classification algorithm. The input of this algorithm will be the processed image data. For cases where the image is well enhanced (In a human-readable format), no restoration (cleaning of images) is needed. In such scenarios, a linear model can deliver a decent level of accuracy.

Dataset

We will be using the sklearn library to import the data. The MNIST dataset has images of size 28x28 which, when flattened out, will give a vector 784x1 (28*28 = 784). The fetch_openml package from the sklearn.dataset store the already flattened vector dataset, which can be imported using the ‘mnist_784’ keyword.

from sklearn.datasets import fetch_openml

X, y = fetch_openml('mnist_784', version=1, 
                    return_X_y=True, as_frame=False)

print("Shape of input : ", X.shape, "Shape of target : ", y.shape)


# Shape of input :  (70000, 784) Shape of target :  (70000,)

There are 70000 images on this dataset; hence loading will take a little time.

Visualize the dataset

If we visualize three samples from our dataset, it will be like the image shown below.

from matplotlib import pyplot as plt
import numpy as np

plt.figure()
for idx, image in enumerate(X[:3]):
    plt.subplot(1, 3, idx + 1)
    plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
    plt.title('Training: ' + y[idx], fontsize = 20);

Training samples used for logistic regression model training

We have chosen the Logistic Regression model, so let’s quickly split the data into training and testing sets.

Data Splitting

There are 70000 samples, so let’s choose 60k samples for training and 10k for testing purposes. We will use the “traintestsplit” function of Scikit-learn.

from sklearn.model_selection import train_test_split
y = [int(i) for i in y] # targets are strings, so need to convert to # int

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=1/7,random_state=0)
print("training samples shape = ", X_train.shape)
print("testing samples shape = ", X_test.shape)


## training samples shape =  (60000, 784)
## testing samples shape =  (10000, 784)

We can quickly verify whether the data is balanced by plotting the histogram plot for the frequency of all the samples corresponding to 10 labels present in our dataset.

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.hist(y_train, bins=20, rwidth=0.9, color='#607c8e');
plt.title('Frequency of different classes - Training Set');


plt.subplot(1,2,2)
plt.hist(y_test, bins=20, rwidth=0.9, color='#607c8e');
plt.title('Frequency of different classes - Test set');

Training and Testing sample count presented as the bar graph

Samples for all the labels are not balanced, but the difference between the number of samples of two different labels is not too huge, so we can ignore the step of balancing the data.

Learners are free to do the balancing step. If the accuracy improves, please leave a message for us in the chat section.

Model Building

Logistic Regression falls in the category of Generalized Linear Models (GLM) and is very much like linear Regression, except it predicts categorical target variables. This means that the final layer output values in Logistic Regression are probability values between 0 and 1, which classify any observation into a particular category.

The problem that we are solving comes under the category of Multinomial Logistic Regression, which we discussed in this blog. The model will predict the presence of each label and finally give the output of that label for which it was most confident.

We will use the Logistic Regression model imported from sklearn.linear_model with tunable parameters such as the strength of regularization and penalty type.

Handwritten digit recognition project pipeline using logistic regression model

As the data size is significant, the model will take some time in training. Be patient :).

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=True,
                        multi_class='auto',
                        penalty='l1', #lasso regression
                        solver='saga',
                        max_iter=1000,
                        C=50,
                        verbose=2, # output progress
                        n_jobs=5, # parallelize over 5 processes
                        tol=0.01
                         )

model

# LogisticRegression(C=50, max_iter=1000, n_jobs=5, penalty='l1',   #       solver='saga', tol=0.01, verbose=2)


model.fit(X_train, y_train)

### convergence after 48 epochs took 131 seconds

Model Evaluation

This is a classification problem, and we have discussed different evaluation metrics used to evaluate the classification model here. We can directly print the accuracy of the model by printing model.score(Xtrain, ytrain) and model.score(Xtest, ytest) for checking the training and testing accuracies. Later, we are going to plot the confusion matrix.

print("Training Accuracy = ", np.around(model.score(X_train,   y_train)*100,3))
print("Testing Accuracy = ", np.around(model.score(X_test, y_test)*100, 3))

## Training Accuracy =  93.742
## Testing Accuracy =  91.94

from sklearn import metrics
pred_y_test = model.predict(X_test)

cm = metrics.confusion_matrix(y_true=y_test, 
                         y_pred = pred_y_test, 
                        labels = model.classes_)
                        
# Let's see this Confusion matrix using seaborn libraray
import seaborn as sns
plt.figure(figsize=(12,12))

sns.heatmap(cm, annot=True, 
            linewidths=.5, square = True, cmap = 'Blues_r', fmt='0.4g');

plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Confusion matrix for the logistic regression model used for handwritten digit recognition

There are some misclassifications. Let’s visualize some misclassified images as well. First, let’s extract the 10 sample indexes for which our model did not predict the true class.

index = 0
misclassified_images = []
for label, predict in zip(y_test, pred_y_test):
    if label != predict: 
        misclassified_images.append(index)
    index +=1
    
    if len(misclassified_images) == 10:
        break
print("Ten Indexes are : ",misclassified_images)

## Ten Indexes are :  [4, 5, 18, 61, 78, 82, 129, 134, 141, 161]

Now, let’s plot the images, their actual values, and corresponding predicted values.

plt.figure(figsize=(10,10))
plt.suptitle('Misclassifications');
for plot_index, bad_index in enumerate(misclassified_images):
p = plt.subplot(4,5, plot_index+1) # 4x5 plot
    
    p.imshow(X_test[bad_index].reshape(28,28), cmap=plt.cm.gray,
            interpolation='bilinear')
    p.set_xticks(()); p.set_yticks(()) # remove ticks
    
    p.set_title(f'Pred: {pred_y_test[bad_index]}, Actual: {y_test[bad_index]}');

Wrong classified samples by the logistic regression model on MNIST dataset

Learners are advised to vary the model hyper-parameters from low to high strength of regularization (C:[5e-4:500]). Also, see the effect of different types of scaling-based pre-processing.

Analyze the model prediction

First, we will see how the model has performed under different scenarios. We will be using various combinations by varying the pre-processing types and the strength of regularization. To evaluate the model, we will be using the accuracy parameter.

How does the hyperparameter c affect the prediction accuracy of logistic regression model?

Observations

Scalar pre-processing need not always help a model develop its generalization ability. It is evident that without pre-processing, the model was performing well.
Tuning a model can aid the model in improving its generalization.

Model weight visualization

Let us view the weight matrix corresponding to each output class.

Collect one sample of each category of the MNIST dataset.
The model weights are reshaped to view(28,28) to plot the images below.

coef = model.coef_.copy()
scale = np.abs(coef).max()

plt.figure(figsize=(13,7))

for i in range(10): # 0-9
    coef_plot = plt.subplot(2, 5, i + 1) # 2x5 plot
    coef_plot.imshow(coef[i].reshape(28,28), 
                     cmap=plt.cm.RdBu,
                     vmin=-scale, vmax=scale,
                    interpolation='bilinear')
    
    coef_plot.set_xticks(()); coef_plot.set_yticks(()) # remove ticks
    coef_plot.set_xlabel(f'Class {i}')
    
plt.suptitle('Coefficients for various classes');

How to visualize the learned weights of logistic regression model on MNIST dataset?

When plotted out, the weight matrix shows how the model has learned its parameters to provide its decision while predicting a class. This is the benefit of classical algorithms over Deep-learning as we don’t have access to these weights there.

Company Use-case

Character recognition has become an integral part of computer vision and analysis, and several such corporations are actively working to improve upon the current state-of-the-art. Several industries are actively developing algorithms that categorize hand-written digits and notes precisely. Tools such as Optical Character Recognition (OCR) are in practice now and have achieved the state-of-the-art in such objectives.

Google Cloud API

Cloud Vision lets algorithm designers integrate optical character recognition (OCR ) and vision detection features, including image labeling. These algorithms can be easily incorporated into the user’s overall pipeline. Check out the documentation to learn how to use such APIs and the comprehensive package of their services.

Computer Vision API, Microsoft Azure

Microsoft Azure’s Computer Vision API includes Optical Character Recognition (OCR) capabilities that extract printed or hand-written text from images. This API provides developers with access to advanced image processing algorithms that have attained the current state-of-the-art. This API works in several languages, making it one of the most used character recognition APIs.

Possible Interview Questions

Based on this project, interviewers can ask these questions:

What is Logistic Regression? Whether it is a classification algorithm or regression algorithm?
How are linear models better to solve lesser complex problems?
What is the cost function associated with the linear Regression?
What are the hyperparameters that need to be tuned to achieve better accuracy?
What are generalized linear models?
Did you use some data pre-processing things on images? What is data augmentation, and what kind of pre-processing is required for image data?

Conclusion

This article discussed one of the best machine learning applications: optical character recognition. We discussed the steps to implement a linear classifier model over the MNIST dataset and evaluated the performance. After that, we discussed some of the use-cases of companies currently using this technology. We hope you have enjoyed the article and sensed how efficient even the linear machine learning models could be.

Next Blog: Introduction to SVM

Optical Character Recognition (OCR) using Logistic Regression