The use of Data Science and Machine Learning is on the rise in the tech industry. This process involves several stages, and Data Visualization plays a critical role in every step, either directly or indirectly. The idea is simple: Data visualization simplifies the process of presenting information in easy-to-understand graphics rather than just statistical terms like mean, median, and distribution.
After going through this blog, we will be able to understand the following things:
Data visualization is the graphical representation of data and statistics. It uses visual elements like graphs, bars, and charts to gain insights from the data. It is beneficial in industries where technical data often needs to be presented to non-technical audiences. The idea is simple: it is easier to understand and retain information through visuals rather than just numbers.
For example, Data Scientists perform complex analyses of business data and provide output to Business Analysts. Business Analysts represent this output in a graphical format so that the board, containing non-technical members, can quickly understand it and make valuable decisions.
In Data Science, data passes through five major stages, and data visualization plays a crucial role in every step. Let's discuss each of these stages.
There are six major stages of data visualization in Data Science.
Understanding the business problems: There are numerous tools available today, such as Google Analytics, that can be integrated with business websites and provide various analyses like customer retention, bounce rates, purchase rates, etc. Sometimes, business stakeholders use these tools to convey their requirements through visualizations. For example, they can convey the problem of increasing the website bounce rate and seek reasons for it.
Data Collection: We gather raw data through various mediums, including sensors, the internet, IoT devices, etc. Data Visualization techniques can be used to show the data from a sensor in the form of signals. Later, based on this visualization, particular readings can be recorded for later usage.
Removing impurities from the collected data: Raw data contains impurities, and Data Scientists try to make data useful by removing various impurities. There are two parts: identification of impurities and removal of these impurities. The identification part is challenging, and data visualization helps there. Data Scientists plot various graphs for collected data and observe the patterns present in them. Based on this analysis, they decide which data samples need to be dropped.
Analyzing the cleaned data: Data Analysis is the primary field where data visualization plays the most crucial role. Data analysts or scientists try to represent the data using various visualization techniques so that it can convey information hidden in it easily. We can also represent the available information in data through different visualization techniques to present it to a broader audience.
Model Development: Once we have the final data on which we want to train our Machine Learning or Deep Learning models, we watch whether the loss function is decreasing. We use the data visualization technique to plot the collected data of loss values in every iteration. If the loss is not decreasing, the model is not learning, so training needs to be stopped.
Delivering results and insights: After the training is finished, it provides the predictions in numerical form. Data Scientists and ML engineers need to represent the insights given by the model using some visualization techniques to make them easily understandable to non-technical people. ML engineers also represent the accuracy or latency values using various visualization techniques to justify the potential of the trained model.
Four essential features determine how useful the Graph will be for our audience. These are:
Aesthetics: This is about the appealing aspect of a picture. The appropriate combination of axis, shape, color, layout, etc., is important for aesthetic beauty. This will help to improve the user's focus on particular information. For this, data scientists can use their knowledge about the audience to make the Graph more appealing.
Novelty: It refers to a unique way of representing the data to the audience or a format that attracts users' attention at first glance and helps them understand the data better. Please note that every graph does not need novelty to express meaningful information, like a bar graph, which sacrifices novelty but is still helpful as it is easy to understand and can convey information. So, there is a tradeoff between novelty and ease of understanding.
Informativeness: The graph should be able to convey the intended information as per our usage. Otherwise, no matter how novel and beautiful it looks, it will not be helpful.
Efficient communication: The graph's primary purpose is to convey information and should be conveyed as straightforwardly as possible.
We have to balance all four features depending on our intended message and context of usage. This means what information we want to show to our audience and the purpose of showing that information.
Let's take a typical example of 2008 United States election data. It shows which candidate won between the Democratic and Republican parties in the 2008 US presidential election. In the geographical map of the USA below, we can see which candidate won in each state.
Here, as we can see, the above map is a perfect representation if we are interested in a geographic area. But it can mislead us in terms of the weightage of votes. For example, the total combined areas of the states of Idaho, Montana, Wyoming, North Dakota, and South Dakota are more than 476,000 square miles, which is about 55 times the area of New Jersey. But in terms of the electoral vote count, they have 16 seats, while New Jersey has 15. That's why we need to visualize this data from a critical perspective.
We can represent the same data via a new way where one electoral vote is represented as 1 unit square, and the map can be redrawn as shown below. Now we see that it gives us a much more accurate picture of who is winning. Please note that we sacrificed aesthetics here to make it more informative per our context, which is worth it.
We talked about four essential features that help us show the information effectively. But why do we even need to display information? There are two primary purposes for that:
Let us look into a typical real-life example where data visualization is used simultaneously for both.
We are familiar with the Periodic table in our class 10th chemistry chapters. This is a classic example of data visualization. The world is made up of different kinds of elements like Gold(Au), Hydrogen(H2), Oxygen(O), etc.These elements have their properties and atomic numbers. Oxygen is a non-metal, while Gold is a metal. All the known elements are arranged in the Periodic table, as shown below.
As near the elements are, the more similar properties they tend to have. All column elements, especially columns present on two extremes of the table, have similar properties. Mendeleev placed elements in such a manner that elements with similar chemical properties are placed near each other.
In some places, the periodic table is empty, which helps us predict what elements are yet to be discovered and what potential properties they are to have. So this is a case where we represent the available information systematically and facilitate discovery.
In the current era, firms store gigabytes of data without using them. They do it due to FOMO (Fear of Missing Out), i.e. there might be some data that they might require in the future. This data needs to be organized before it can be used by ML and Big Data engineers.
The data types can be broadly divided into four categories. We need to understand that each data type can be represented by a specific set of Graphs. These help us get insights which we might have missed. We can see different graphs for each data type, along with their explanation below.
Discrete Data: It is data in a countable format like 1, 2, 3, 4 and cannot take values in between. For example, the number of hospitalization cases represented by a hospital in a day. We can use a bar graph to represent such types of data.
Continuous Data: It is numerical data that can be measured using some instruments like a ruler. For example, in shotput throw, we can measure the distance of the shotput thrown, which is continuous data. A line chart and box plots are used to represent continuous data. The box plot displays a summary of data using the minimum, first quartile, median, third quartile, and maximum and also helps us identify outliers in our dataset.
Nominal Data: If we can apply a tag to data or simply brand it. For example, the gender of a person can be divided into male or female. Scatter plots, pie charts, and bar plots are commonly used to represent the data in graph format.
Ordinal Data: It is data in which we can rank the categories. For example, weather like mild, hot, and extremely hot can be ranked. A point plot can represent ordinal data, and it helps us see the difference in data present for each category.
The graphs can be divided into four categories, as shown below:
Relationship graph: These graphs help us observe the relationship between different data points. A scatter plot is an example of a Relationship graph.
Comparison graph: These graphs are used to compare two entities present in the data. Bar charts and line charts are two examples used for comparison. For instance, we draw a multi-line chart to compare house prices in Mumbai and Delhi over the years.
Distribution graph: These help us observe the distribution of our data points. Histograms and Box plots fall under this category. A Box plot displays a five-number summary of data, which includes the minimum, first quartile, median, third quartile, and maximum, and also helps us identify outliers in our dataset.
Composition graph: These graphs are used when we want to know the composition of data points in the whole dataset. Pie charts, Stacked Bar Charts, and Stacked Area Charts are examples of composition graphs.
To become an expert in Data Science, we need to follow specific processes before representing our analysis in graphical form:
Step 1: We need to "develop the purpose" of drawing graphs, i.e., who our audience is, what information we need to show, and what the purpose of showing graphs to the respective audience is.
Step 2: We need to "collect the data" as per our needs (discussed in step 1) from various sources like the internet, company servers, etc.
Step 3: We need to preprocess the collected data, as the raw data can contain anomalies. We also need to normalize the data so that some features don't appear more important than they are in reality.
Step 4: We need to choose a suitable graph to represent our thoughts discussed in step 1. There are multiple graphs, and their purposes vary as per our needs. This selection uses four essential features: novelty, aesthetics, information, and efficiency.
Step 5: We need to choose a tool to draw the required graph. There is multiple software that supports a wider variety of plots like Tableau and Power BI. Libraries that help us plot graphs in Python are:
Step 6: We need to draw the graph with proper labels and legends to communicate our thoughts to the audience.
Data Visualization is an integral part of data science. We have seen its importance in every step of the project in data science. While drawing Data visualization, the critical step is choosing the most suitable Graph, which depends on the four essential features of novelty, aesthetics, information, and efficiency. We hope you enjoyed the article.
If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy data science, Enjoy algorithms!