Pandas is a library package in Python widely used by Data Scientists and Machine Learning engineers. It offers a variety of features for quick data analysis and preprocessing, making it a valuable tool for mastering Data Science. Its broad range of support and easy integration with other Python data analyzing packages like scikit-learn makes it more widespread. This article will guide you through installing Pandas in Python-enabled systems and introduce some of its essential functions commonly used in machine learning projects.
Let us start by learning more about Pandas.
Pandas is a software library comprised of numerous tools that efficiently cater to data manipulation and analysis tasks. Wes McKinney developed the basics of Pandas in 2008 and then made it public in 2009. Since then, it has kept the charm among Data Analysts and Engineers.
In the machine learning community, Pandas is highly praised, but why? Let's see some practical use cases of Pandas.
Pandas is a widely-used library for managing and analyzing datasets. It is commonly used in domains such as economics, data science, and the stock market. In economics, Pandas helps to visualize large amounts of data. It has structures like DataFrames that make it easy to work with massive datasets. Pandas is also helpful in preprocessing and analyzing data used to build machine learning models. Specifically, in the stock market, it is used to provide quick analysis of market trends.
Now that we understand the importance of Pandas, let's begin learning its basics. Before we start, let's ensure the library is installed on our python-enabled systems.
One can find the detailed instruction to install Pandas on all operating systems in our make your system machine learning enabled blog. To install Pandas via Python PyPI (pip), we can use the commands below,
Python2 on terminal → pip install pandas
Python3 on terminal → pip3 install pandas
Jupyter notebook python2 → !pip install pandas
Once installed, we can import this library and use it in our codes. For example:
import pandas as pd
We have imported Pandas and shortened its name to "pd". So in future sections, we will use 'pd 'instead of the complete name Pandas. Pandas is mainly a data-handling library; whenever data is stored or accessed efficiently, we need Data Structures. So let's start discussing the various data structures in Pandas.
There are mainly two data structures present in Pandas,
Pandas Series can be said to be a column of a table. It is a one-dimensional array holding data of any data type in Python.
pd.Series([1,2,3])
###output
'''
0 1
1 2
2 3
dtype: int64
'''
The output is contacting elements and their corresponding position as indexes. We can define indexes at our convenience. The default index ranges from 0 to n-1, where n is the length of the Series. We access elements the same way as in a python list, where the index can be a default or a custom one, as shown.
l = pd.Series([1,2,3],index = ["A","B","C"])
print(l)
'''
## Output
A 1
B 2
C 3
dtype: int64
## To access any element at index A,
l["A"]
###output
1
'''
Pandas DataFrame is a 2-dimensional labeled data structure that consists of columns and rows. Here, different columns can have data with different data types. It is the most used type of data structure when building data science applications. Let's see how we can create a DataFrame to store our existing data.
While collecting the dataset, we store the readings in files in various formats like CSV, Excel, and JSON. These datasets need to be preprocessed before applying machine learning. But to perform the preprocessing methods on our dataset, we need to bring them into a python environment by reading them as Pandas DataFrame. Let's see how to read some of the most frequent file formats using Pandas.
CSV file is the format that contains ',' separated data. For example, a dataset of song statistics will look as shown below. This can be saved to a file having an extension of '.csv' (e.g., song_stats.csv).
df = pd.DataFrame({'Name of Track':["Roar","Dark Horse","Blank Space"],"Duration of track ":[269,225,272],"Singer":["Katy Perry","Katy Perry","Taylor Swift"]})
df
'''
###output
Name of Track Duration of track Singer
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
'''
As shown below, we can read the saved CSV file as Dataframe using the Pandas read_csv function.
df = pd.read_csv('songs_stats.csv')
df
'''
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
'''
Please note that we have not specified a delimiter, i.e., a symbol that separates different column values in a CSV file. By default, it is a comma. Also, the first row was used as the header for the Dataframe, which in our case is "Name of the track, Duration of the track, Singer name".
A dictionary is a primary datatype of Python that stores data in the format (key, value) where each key is unique. We can give a dictionary as input to the function DataFrame, as shown in the example below. It will convert our Dictionary into a DataFrame.
df = pd.DataFrame({'Name of Track':["Roar","Dark Horse","Blank Space"],"Duration of track ":[269,225,272],"Singer":["Katy Perry","Katy Perry","Taylor Swift"]})
df
###output
Name of Track Duration of track Singer
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
The keys act as headings of columns. The value corresponding to a key is used to fill all the values in a column under that key in DataFrame. For example, take the key: Name of Track and its corresponding values are ["Roar"," Dark Horse"," Blank Space"]. We can see below that the first column of the Dataframe heading is the key → Name of Track, and corresponding column values are our values in Dictionary.
JSON stands for Javascript Object Notation, a text-based format representing semi-structured data. It is commonly used for transmitting data in web applications. A standard JSON file is shown below.
[
{
color: "red",
value: "#f00"
},
{
color: "green",
value: "#0f0"
}
]
We read the JSON file using Python's 'json' library to get a 'json object'. This json object is read using Pandas to get the required DataFrame, as shown below.
import json
# Opening JSON file
json_object = open('data.json')
# returns JSON object as a dictionary
df10 = pd.read_json(json_object)
df10
###output
Name of Track Duration of track Singer
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
We have learned how to read the data from a file into Pandas DataFrame. Let's move towards understanding some data analysis functions provided by Pandas to assist us in analyzing and preprocessing the dataset.
df.head(2)
'''
#Here we specified value of n as 2 so we see first two rows
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
'''
df.tail(n) → Similar to the head function, the tail function shows the structure of the DataFrame but from the last index. By default, it shows the last five rows of DataFrame unless we specify a value of n.
df.tail(1)
'''
#Here we specified value of n as 1 so we see last row
###output
Name of track Duration of track Singer name
2 Blank Space 272 Taylor Swift
'''
df.info() → Clear from its name, it summarises our DataFrame information by showing column names, memory utilization of our DataFrame, non-null values, data types, etc. Also, as we can see, non-null values and columns having null values can be found, which will help us better understand our data. For example, some ML models will give errors if data has null values. We can use the info function to get details about null data and take necessary actions to handle it.
df.info()
'''
###ouput
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name of track 3 non-null object
1 Duration of track 3 non-null int64
2 Singer name 3 non-null object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
'''
df.shape() → As the name suggests, it tells us the shape of DataFrame, i.e., the number of rows and columns in the form of a tuple.
df.shape()
'''
###output
(3, 3)
'''
df.to_numpy() → This gives us a Numpy representation of our dataset. It can be beneficial when we have a dataFrame (or a part of DataFrame) consisting of numerical values, and we want to perform mathematical operations on them, as the processing speed of NumPy is very fast compared to Pandas.
df.to_numpy()
'''
###output
array([['Roar', 269, 'Katy Perry'],
['Dark Horse', 225, 'Katy Perry'],
['Blank Space', 272, 'Taylor Swift']], dtype=object)
'''
df.describe(): It tells us mathematical statistics for all numerical columns like average, mean, etc. Continuing the same example, in our Dataframe, the only numerical column is the track's Duration, and its various statistics can be seen below using the df.describe function.
df.describe()
'''
###output
Duration of track
count 3.000000
mean 255.333333
std 26.312228
min 225.000000
25% 247.000000
50% 269.000000
75% 270.500000
max 272.000000
'''
df.columns: It gives us all names of columns present in our DataFrame.
df.columns
'''
###output
Index(['Name of track', 'Duration of track ', 'Singer name'], dtype='object')
'''
df.setindex(columnname): Here, we can use a particular column of DataFrame as an index instead of default 0 to n-1. Please note that the index of the below dataframe has changed.
df_custom_index = df.set_index("Name of Track")
df_custom_index
'''
###output
Duration of track Singer
Roar 269 Katy Perry
Dark Horse 225 Katy Perry
Blank Space 272 Taylor Swift
'''
df[col_name].unique() → It returns all unique values in that particular column. Sometimes for the categorical features (classes numerical representation of classes ) stored in the columns, we need to see how many unique classes are present in the data and then decide which ML model would be best suitable. This function can help there.
df['Singer name'].unique()
###output
###here we see that we get 1 Katy Perry instead of two
array(['Katy Perry', 'Taylor Swift'], dtype=object)
df['Singer name'].value_counts()
###output
Katy Perry 2
Taylor Swift 1
Name: Singer name, dtype: int64
Now that we have seen the essential functions of Pandas, it is important that after making changes to any DataFrame, we can store it in our server for later usage. Let's see how we can do that.
Here we use the filename and a required extension to save our DataFrame. For example:
df.to_csv('songs.csv') # Saving the file in CSV fomat
df.to_json('songs.json') # Saving the file in JSON format
Generally, companies store a lot of data in single or multiple files. Often, only some of this data is helpful, and we only need a part of it. In the next section, we will see how to take part in the DataFrames as per our requirements.
We often only want a part of the dataFrame to get helpful information out of it, and we use the slicing method. This can be done in three ways:
Let us look at how we can implement these three methods.
df[rowindex1:rowindex2]: Here, we have to give a starting and ending index of the rows we want to extract from the dataframe. Please note that rowindex1 is inclusive while rowindex2 is exclusive in the above slicing, similar to lists. We will get a new DataFrame after the slicing.
1)###code when both indexes are mentioned of row
df[0:1]
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
We can also give starting index only, which means all rows ahead of that will be considered part of row slicing, as shown below:
2)###code when one index is mentioned for row
df[1:]
###output
Name of track Duration of track Singer name
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
Directly accessing any row_index will give an error, i.e., accessing the first row as df[0] will result in an error.
3)###code for accesing direct index element
df[0]
### we get the following error
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 0
Using loc function: This can be used to access a group of rows and columns by label or a boolean array. More detail can be read in the official documentation here. Do note that dfcustomindex, earlier defined in df.set_index, can be seen below.
df_custom_index
###output
Duration of track Singer
Roar 269 Katy Perry
Dark Horse 225 Katy Perry
Blank Space 272 Taylor Swift
loc_result = df_custom_index.loc["Dark Horse"]
loc_result
###output
Duration of track 225
Singer Katy Perry
Name: Dark Horse, dtype: object
Using iloc function: This is an integer-based location indexing that helps us to get required rows from DataFrame using numerical index ranging from 0 to n-1.
iloc_result = df_custom_index.iloc[1]
>>> iloc_result
'''
###output
Duration of track 225
Singer Katy Perry
Name: Dark Horse, dtype: object
'''
Please note that iloc and loc functions can be used for slicing. An example of both is shown below.
# 1)Using loc
df_custom_index.loc["Roar":"Blank Space"]
'''
###output
Duration of track Singer
Name of Track
Roar 269 Katy Perry
Dark Horse 225 Katy Perry
Blank Space 272 Taylor Swift
2)Using iloc
df_custom_index.iloc[0:2]
###output
Duration of track Singer
Name of Track
Roar 269 Katy Perry
Dark Horse 225 Katy Perry
'''
Please note that in the case of loc, both indexes are inclusive, while in the case of iloc, only starting index is inclusive. Also, in iloc the indexes do not have to be in the bound of 0 to n-1, which is similar to NumPy and lists in Python.
We can use the following to access any column from a dataframe
As mentioned in point2, we have to give names of columns as a list
1)###code for Point1
df["Singer name"]
###output
0 Katy Perry
1 Katy Perry
2 Taylor Swift
Name: Singer name, dtype: object
2)###code for Point2
df[["Singer name","Name of track"]]
###output
Singer name Name of track
0 Katy Perry Roar
1 Katy Perry Dark Horse
2 Taylor Swift Blank Space
We can combine the techniques learned earlier.
As mentioned with columns, we get a series if we mention only a single column name. If we mention a list of columns, we get a DataFrame. Here it follows the same rules for "rowsindex" as in Retrieving Rows, and for "columnindex", it follows the same rules mentioned in Retrieving Columns.
#1)###Example 1
df[1:3]["Singer name"]
'''
###output
1 Katy Perry
2 Taylor Swift
Name: Singer name, dtype: object
'''
#2)###Example 2
df[1:3][["Singer name"]]
'''
###output
Singer name
1 Katy Perry
2 Taylor Swift
'''
Now we have learned how to take part in DataFrame instead of full. But what will we do if we want the data to be sorted in a Particular order before slicing it? Let's see how we can sort DataFrames in the next section.
df.sort_index(ascending=False): This will help us sort DataFrame in descending order of row index values. If we specify ascending=True, then it will be sorted in ascending order.
df.sort_index(ascending=False)
###output
Name of track Duration of track Singer name
2 Blank Space 272 Taylor Swift
1 Dark Horse 225 Katy Perry
0 Roar 269 Katy Perry
df.sortvalues(by=columnname): It will help us sort DataFrame using the values in the column name we specified. In the code below, we try to sort dataFrame to get songs in increasing order of the song's length.
df.sort_values(by='Duration of track ')
We may expect missing or inconsistent values in rows or columns in real-world datasets. Often these samples can be outliers that hamper our analysis and need to be removed from the final dataset.
In this section, we will see how to identify null values and handle them. Let's create dummy data with some None values.
df3 = df.replace({"Taylor Swift":None})
'''
### here we create a dataset with None value
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 None
'''
Now let's see some essential functions that can inform us about these missing values and provide some remedies.
df3.isnull().sum()
'''
###output
Name of track 0
Duration of track 0
Singer name 1
dtype: int64
'''
df.dropna(subset=["col_name"]): Here, this command drops all rows containing null values if the subset is not specified. If a subset value is specified, then it drops rows only for those column names we specified, and the None value is present in that column name.
df3.dropna()
'''
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
'''
Please note that the above methods do not change the existing DataFrame. They make changes to a DataFrame and return a new DataFrame. To prevent this and make sure that changes happen to the same DataFrame, we have the value "inplace=True", which we can use with many Pandas methods.
Sometimes data can be in multiple files, and we need to combine two different DataFrames. Let's see how we can do that.
These are useful when the data is in different files, but we want it in a single data structure.
###here we try to make single dataframe into two using slicing and then concatenating them
df4 = df[0:1]
df5 = df[1:3].reset_index()
df5
'''
Name of track Duration of track Singer name
0 Dark Horse 225 Katy Perry
1 Blank Space 272 Taylor Swift
>>> df4
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
'''
pd.concat([dataframe1,dataframe2],ignore_index=True) → It helps us to contact two DataFrames.
pd.concat([df4,df5],ignore_index=True)
'''
###output
Name of track Duration of track Singer name
0 Roar 269 Katy Perry
1 Dark Horse 225 Katy Perry
2 Blank Space 272 Taylor Swift
'''
Also, 'ignore_index=True' is used, so original index values are not in the final DataFrame. Otherwise, the default value of ignore_index is false, and we would have two or more rows having the same index number.
pd.concat([df4,df5])
'''
###output
Name of track Duration of track Singer name index
0 Roar 269 Katy Perry NaN
0 Dark Horse 225 Katy Perry 1.0
1 Blank Space 272 Taylor Swift 2.0
'''
pd.merge() function helps join two DataFrames based on the specified method. An example is shown below.
data_1 = {
"name": ["Katy", "Taylor", "John"],
"Experience": [3, 5, 2]
}
data_2 = {
"name": ["Katy", "Taylor", "John"],
"experience": [1, 4, 7]
}
df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
newdf = df_1.merge(df_2, how='right')
###Output
name Experience experience
0 Katy 3 1
1 Taylor 5 4
2 John 2 7
Here we utilize the benefits of the lambda function in the DataFrame using the apply function, as shown in the example. We use the lambda function when we want to operate on every value corresponding to one column in our data.
df["Singer name"] = df["Singer name"].apply(lambda x : x[:4])
df
'''
##output
Name of track Duration of track Singer name
0 Roar 269 Katy
1 Dark Horse 225 Katy
2 Blank Space 272 Tayl
'''
In the above example, we wanted to shorten the singer names and consider the first four characters of the singer's name for a singing contest. Also, we reassigned the output to the singer name column of the original DataFrame; otherwise, it will just give us output in series format, and nothing will change in the original DataFrame.
That's all for the basics about Pandas, but before closing this session, let's think about one critical question,
We need to know that different data structures have different speeds. The order of time taken for operating is → list > DataFrame > numpy ndarray. Lists are Python's most commonly used data structure as they are easy to use and can hold multiple data types, but they are not convenient for mathematical processing. So we use ndarray to perform mathematical operations as the processing of mathematical operations with ndarray is extremely fast. But what if we want to merge data from multiple datasets or read data from excel or HDFS? For that, we need DataFrames.
Each data structure has its benefits, uses, and processing time, but if we want a one-stop destination library of Python for analyzing data, Pandas would be the best.
Some of the most common mistakes which people make while using Pandas are:
Performing a single task on Dataframe at a time makes the code efficient and easy to understand. This is not true, as the performance of code depends on two things: the algorithm's efficiency and the CPU's memory consumption. The algorithm's efficiency remains the same even if multiple operations are performed in a single step instead of creating multiple Dataframes. So the deciding factor becomes memory utilization. Please see the code below.
import pandas as pd
df1 = pd.read_csv('song_stats.csv')
df2 = df1.dropna()
df3 = df2.groupby('Singer names')
Now, in the end, we only used the Dataframe df3, but df1 and df2 are also stored, which leads to a wastage of CPU memory and hence affects the performance of the code. To resolve this, we can do both steps in a single step, as shown below:
df = pd.read_csv('song_stats.csv').dropna().groupby('Singer names')
We have learned all about Pandas. Let us see the most commonly asked question in interviews about Pandas.
Pandas is a valuable tool for Python developers, allowing for efficient data visualization, exploration, and cleaning. In today's data-driven world, it has become necessary for developers in various fields, from model building to data engineering. In this article, we covered the basics of Pandas, including handling missing values, performing basic operations, and joining DataFrames. We hope you found the information helpful.