Machine learning models solve some unsolved and challenging tasks. One such task can be predicting the quality of wine with some quantitative measurement. Judging the quality of wine manually is difficult; even professional wine tasters have an accuracy of only 71%.
Gaining the title of a Wine taster is quite an involved process. The Master Sommelier’s Diploma exam is the world’s most challenging wine-tasting examination, and only 200 people have passed since the exam’s inception 40 years ago. With the advancements in machine learning and artificial intelligence, predicting the wine quality is a mere matter of minutes if we have all the required parameters.
After reading this blog, we would be able to get insights of:
Now that we have a basic understanding of parameters let’s dive into the data analysis.
In this blog, we will use the Kaggle Red Wine Quality Dataset. It contains 1600 rows of unique red wines. This dataset is interesting because the problem can be interpreted in two ways:
We will keep ourselves confined by approaching this problem as a Regression Task.
Let’s load the data and take a look!
import pandas as pd
wine_quality = pd.read_csv('winequality-red.csv')
wine_quality.head(5)
As we can see, almost all the parameters have float data types except for Quality, which is also our target variable. Since all the independent features are continuous, we could learn something from the distributions. A distribution plot depicts the variation in the data distribution. Let’s plot the distribution for each variable using the Seaborn library.
fig = plt.figure(figsize = [20,10])
cols = wine_quality.columns
cnt = 1
for col in cols:
plt.subplot(4,3,cnt)
sns.distplot(wine_quality[col],hist_kws=dict(edgecolor="k", linewidth=1,color='blue'),color='red')
cnt+=1
plt.tight_layout()
plt.show()
Most distributions are approximately normal, while others have some skewness. The wine quality scores 5 and 6 are more frequent than others.
We also need to understand the interdependence of parameters over each other. A correlation plot would be helpful in the visualization of such dependencies. Let’s plot the correlation heat map to understand the dependencies!
cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(wine_quality.corr(), cmap=cmap, center=0, square=True)
From the above correlation heat map, we can infer that the wine quality is positively correlated with the alcohol content and sulphates. On the contrary, Volatile acidity has a considerable negative correlation with Volatile acidity. It is reasonable that a lower level of acidity is favored in quality tests.
Let’s confirm the relationship mentioned above using the reg-plots.
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20,5))
cols = ['volatile acidity', 'alcohol', 'sulphates']
for col, ax in zip(cols, axs.flat):
sns.regplot(x = wine_quality[col],y = wine_quality["quality"], color = 'purple', ax=ax)
Wine Quality has a real negative relationship with Volatile acidity and a positive relationship with alcohol and sulphates. Let’s dive deeper into the details by visualizing the dependency between Wine Quality and our numeric variables of interest (Independent Feature).
cols = wine_quality.columns
cnt = 1
for col in cols:
plt.subplot(4,3,cnt)
sns.boxplot(x="quality",y=col,data=wine_quality,palette="coolwarm")
cnt = cnt + 1
plt.show()
Let’s summarise our findings from the above boxplots
Assessing a wine manually is tedious and requires an experienced practitioner to evaluate the Quality. We will address this problem by building a regression model that will take wine parameters’ input and return a predicted quality score. Through this approach, we aim to eliminate the manual tasting and scoring process. To accomplish this task, we must select a regression algorithm that satisfies our requirements.
Following are some regression algorithms that can be used for predicting The Red Wine Quality.
Linear Models are relatively less complex and explainable, but linear models perform poorly on data containing outliers. Also, linear models need to perform better on nonlinear datasets. In such cases, nonlinear regression algorithms Random Forest Regressor and XGBoost Regressor perform better in fitting the nonlinear data.
Which algorithm is best suited for our use case?
We don’t have significant outliers in this data, indicating that we can use linear and complex models. However, the model should have the following qualities:
Keeping all the points mentioned above in mind, we need to select a regression model. For this tutorial, we will be using the k-NN Regressor.
k-NN Algorithm (K Nearest Neighbors) is a supervised machine learning algorithm that can solve classification and regression tasks. It was extensively used in statistical estimations and pattern recognition during the early 1970s. The k-NN algorithm uses feature similarity to predict the values of any new data points. It estimates the value of a data point by taking out the average of ‘K’ closest values in the Euclidean space. The most commonly used method for calculating the distance between two data points is known as Euclidean Distance.
Before implementing the k-NN Regressor, we need to scale the features as this algorithm demands homogenous characteristics. We measure the distance between the pair of samples influenced by the measurement units. To avoid this, we should normalize the data before implementing k-NN.
This algorithm has only one hyperparameter, K, which indicates the count of samples that will be treated as the nearest neighbors. One way to find the optimum value of K is to derive a plot between the error obtained on the test set and K denoting values. Finally, choose the K corresponding to the minimum error rate.
Let’s implement the k-NN Regressor
from sklearn import neighbors
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
target = wine_quality['quality']
features = wine_quality.drop('quality', axis = 1)
X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.3)
scaler = MinMaxScaler(feature_range=(0, 1))
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test)
rms_error = []
for K in range(1,75):
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train, Y_train)
pred = model.predict(X_test)
error = mean_squared_error(Y_test, pred, squared=False)
rms_error.append(error, K)
x = np.linspace(1,75, num = 75)
y = rms_error
fig, ax = plt.figure(figsize = [8,5])
ax.plot(x,y)
annot_optimum(x,y)
plt.xlabel('K - Values')
plt.ylabel('RMSE Error')
plt.show()
We found an optimum Model at K = 27 where the RMS Error is minimum. Our dataset was relatively small, which enabled us to see the optimum K; As the dataset grows, the speed of the k-NN algorithm declines very fast, which is a limitation of this algorithm.
Pros:
Cons:
HEINEKEN N.V
Heineken is second the second-largest producer of Beer in the world. However, they also own Zoetermeer Winery for wine production. Heineken relies on Regression analysis to keep track of the Quality of the Wine. As discussed above, even professional wine tasters are just 71% accurate in determining the Quality of Wine.
GRAMMARLY, INC.
Grammarly is a cross-platform, cloud-based writing assistant that reviews spelling, grammar, punctuation, clarity, engagement, and delivery mistakes. Grammarly relies on the KNN classification algorithm for categorising similar sentences and textual documents.
NETFLIX
Netflix uses the KNN Algorithm to categorize similar content-based shows. Netflix also uses KNN as a Recommendation Engine to recommend similar items; they compare the set of users who like each item — when a matching set of users like two different items, the items are identical!
Based on this project, the following questions can be asked in any machine learning interview:
We started with a brief introduction to Wine Quality Tasting and problems in the manual Wine tasting approach. Moving on, we discussed the impact of each parameter and started the data analysis. Based on the correlation heat map, we found the most significant parameters. We further confirmed their impact on Wine Quality using Boxplots and Regploys. Finally, we built a K-Nearest Neighbors regression model to predict the Quality of Wine and looked at the pros and cons of using the k-NN Regressor model. We could have approached this problem as a Multiclass classification task.
Next Blog: Introduction to Naive Bayes Algorithm