Introduction to Pandas Library in Python

Pandas is a library package in Python widely used by Data Scientists and Machine Learning engineers. It offers a variety of features for quick data analysis and preprocessing, making it a valuable tool for mastering Data Science. Its broad range of support and easy integration with other Python data analyzing packages like scikit-learn makes it more widespread. This article will guide you through installing Pandas in Python-enabled systems and introduce some of its essential functions commonly used in machine learning projects.

Key takeaways from this blog

What is Pandas and its multiple Data structures
Indexing and Sorting DataFrames
Null value handling and merging two Dataframes
How to use the Lambda function in a data frame?
When to use pandas list, numpy ndarrays, or Python DataFrame?

Let us start by learning more about Pandas.

What is Pandas?

Pandas is a software library comprised of numerous tools that efficiently cater to data manipulation and analysis tasks. Wes McKinney developed the basics of Pandas in 2008 and then made it public in 2009. Since then, it has kept the charm among Data Analysts and Engineers.

In the machine learning community, Pandas is highly praised, but why? Let's see some practical use cases of Pandas.

Practical use-cases of Pandas in ML problems

Pandas is a widely-used library for managing and analyzing datasets. It is commonly used in domains such as economics, data science, and the stock market. In economics, Pandas helps to visualize large amounts of data. It has structures like DataFrames that make it easy to work with massive datasets. Pandas is also helpful in preprocessing and analyzing data used to build machine learning models. Specifically, in the stock market, it is used to provide quick analysis of market trends.

Now that we understand the importance of Pandas, let's begin learning its basics. Before we start, let's ensure the library is installed on our python-enabled systems.

Installation and Import of Pandas

One can find the detailed instruction to install Pandas on all operating systems in our make your system machine learning enabled blog. To install Pandas via Python PyPI (pip), we can use the commands below,

Python2 on terminal → pip install pandas

Python3 on terminal → pip3 install pandas

Jupyter notebook python2 → !pip install pandas

Once installed, we can import this library and use it in our codes. For example:

import pandas as pd

We have imported Pandas and shortened its name to "pd". So in future sections, we will use 'pd 'instead of the complete name Pandas. Pandas is mainly a data-handling library; whenever data is stored or accessed efficiently, we need Data Structures. So let's start discussing the various data structures in Pandas.

What are multiple Pandas Datastructures?

There are mainly two data structures present in Pandas,

Pandas Series
Pandas DataFrame

Pandas Series

Pandas Series can be said to be a column of a table. It is a one-dimensional array holding data of any data type in Python.

pd.Series([1,2,3])

###output
'''
0    1
1    2
2    3
dtype: int64
'''

The output is contacting elements and their corresponding position as indexes. We can define indexes at our convenience. The default index ranges from 0 to n-1, where n is the length of the Series. We access elements the same way as in a python list, where the index can be a default or a custom one, as shown.

l = pd.Series([1,2,3],index = ["A","B","C"])
print(l)
'''
## Output
A    1
B    2
C    3
dtype: int64

## To access any element at index A,
l["A"]
###output
1
'''

Pandas DataFrame

Pandas DataFrame is a 2-dimensional labeled data structure that consists of columns and rows. Here, different columns can have data with different data types. It is the most used type of data structure when building data science applications. Let's see how we can create a DataFrame to store our existing data.

What are the Pandas DataFrames in Python?

Creating Pandas Dataframe

While collecting the dataset, we store the readings in files in various formats like CSV, Excel, and JSON. These datasets need to be preprocessed before applying machine learning. But to perform the preprocessing methods on our dataset, we need to bring them into a python environment by reading them as Pandas DataFrame. Let's see how to read some of the most frequent file formats using Pandas.

Creating DataFrame by reading a CSV file

CSV file is the format that contains ',' separated data. For example, a dataset of song statistics will look as shown below. This can be saved to a file having an extension of '.csv' (e.g., song_stats.csv).

df = pd.DataFrame({'Name of Track':["Roar","Dark Horse","Blank Space"],"Duration of track ":[269,225,272],"Singer":["Katy Perry","Katy Perry","Taylor Swift"]})
df
'''
###output
  Name of Track  Duration of track         Singer
0          Roar         269              Katy Perry
1    Dark Horse         225              Katy Perry
2   Blank Space         272              Taylor Swift
'''

As shown below, we can read the saved CSV file as Dataframe using the Pandas read_csv function.

df = pd.read_csv('songs_stats.csv')
df
'''
###output
  Name of track  Duration of track    Singer name
0          Roar        269             Katy Perry
1    Dark Horse        225             Katy Perry
2   Blank Space        272             Taylor Swift

'''

Please note that we have not specified a delimiter, i.e., a symbol that separates different column values in a CSV file. By default, it is a comma. Also, the first row was used as the header for the Dataframe, which in our case is "Name of the track, Duration of the track, Singer name".

Creating DataFrame from a Dictionary

A dictionary is a primary datatype of Python that stores data in the format (key, value) where each key is unique. We can give a dictionary as input to the function DataFrame, as shown in the example below. It will convert our Dictionary into a DataFrame.

df = pd.DataFrame({'Name of Track':["Roar","Dark Horse","Blank Space"],"Duration of track ":[269,225,272],"Singer":["Katy Perry","Katy Perry","Taylor Swift"]})
df

###output
  Name of Track  Duration of track         Singer
0          Roar         269              Katy Perry
1    Dark Horse         225              Katy Perry
2   Blank Space         272              Taylor Swift

The keys act as headings of columns. The value corresponding to a key is used to fill all the values in a column under that key in DataFrame. For example, take the key: Name of Track and its corresponding values are ["Roar"," Dark Horse"," Blank Space"]. We can see below that the first column of the Dataframe heading is the key → Name of Track, and corresponding column values are our values in Dictionary.

Creating DataFrame by reading a JSON file

JSON stands for Javascript Object Notation, a text-based format representing semi-structured data. It is commonly used for transmitting data in web applications. A standard JSON file is shown below.

[
 {
  color: "red",
  value: "#f00"
 },
 {
  color: "green",
  value: "#0f0"
 }
]

We read the JSON file using Python's 'json' library to get a 'json object'. This json object is read using Pandas to get the required DataFrame, as shown below.

import json
# Opening JSON file
json_object = open('data.json')  
# returns JSON object as a dictionary

df10 = pd.read_json(json_object)
df10
###output
  Name of Track  Duration of track         Singer
0          Roar         269              Katy Perry
1    Dark Horse         225              Katy Perry
2   Blank Space         272              Taylor Swift

We have learned how to read the data from a file into Pandas DataFrame. Let's move towards understanding some data analysis functions provided by Pandas to assist us in analyzing and preprocessing the dataset.

List of Important functions used for Analyzing Pandas DataFrame

df.head(n) → When working with new data files, we sometimes need to see the table's structure. With the help of the head function, we can see a sample of our DataFrame from the starting index. By default, it shows the first five rows of DataFrame unless we specify a value of n as a parametric input to the head function. This n represents the number of rows we want to see from the DataFrame.

df.head(2)
'''
#Here we specified value of n as 2 so we see first two rows
###output
Name of track  Duration of track  Singer name
0          Roar                 269  Katy Perry
1    Dark Horse                 225  Katy Perry
'''

df.tail(n) → Similar to the head function, the tail function shows the structure of the DataFrame but from the last index. By default, it shows the last five rows of DataFrame unless we specify a value of n.
```
df.tail(1)
'''
#Here we specified value of n as 1 so we see last row
###output
  Name of track  Duration of track    Singer name
2    Blank Space           272           Taylor Swift
'''
```
df.info() → Clear from its name, it summarises our DataFrame information by showing column names, memory utilization of our DataFrame, non-null values, data types, etc. Also, as we can see, non-null values and columns having null values can be found, which will help us better understand our data. For example, some ML models will give errors if data has null values. We can use the info function to get details about null data and take necessary actions to handle it.
```
df.info()
'''
###ouput
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
#   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
0   Name of track       3 non-null      object
1   Duration of track   3 non-null      int64 
2   Singer name         3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
'''
```
df.shape() → As the name suggests, it tells us the shape of DataFrame, i.e., the number of rows and columns in the form of a tuple.
```
df.shape()
'''
###output
(3, 3)
'''
```
df.to_numpy() → This gives us a Numpy representation of our dataset. It can be beneficial when we have a dataFrame (or a part of DataFrame) consisting of numerical values, and we want to perform mathematical operations on them, as the processing speed of NumPy is very fast compared to Pandas.
```
df.to_numpy()
'''
###output
array([['Roar', 269, 'Katy Perry'],
     ['Dark Horse', 225, 'Katy Perry'],
     ['Blank Space', 272, 'Taylor Swift']], dtype=object)
'''
```
df.describe(): It tells us mathematical statistics for all numerical columns like average, mean, etc. Continuing the same example, in our Dataframe, the only numerical column is the track's Duration, and its various statistics can be seen below using the df.describe function.
```
df.describe()
'''
###output
     Duration of track 
count            3.000000
mean           255.333333
std             26.312228
min            225.000000
25%            247.000000
50%            269.000000
75%            270.500000
max            272.000000
'''
```

df.columns: It gives us all names of columns present in our DataFrame.

df.columns
'''
###output
Index(['Name of track', 'Duration of track ', 'Singer name'], dtype='object')
'''

df.setindex(columnname): Here, we can use a particular column of DataFrame as an index instead of default 0 to n-1. Please note that the index of the below dataframe has changed.

df_custom_index = df.set_index("Name of Track")
df_custom_index
'''
###output
                   Duration of track         Singer
                                
Roar                   269                   Katy Perry
Dark Horse             225                   Katy Perry
Blank Space            272                   Taylor Swift
'''

df[col_name].unique() → It returns all unique values in that particular column. Sometimes for the categorical features (classes numerical representation of classes ) stored in the columns, we need to see how many unique classes are present in the data and then decide which ML model would be best suitable. This function can help there.
```
df['Singer name'].unique()

###output
###here we see that we get 1 Katy Perry instead of two

array(['Katy Perry', 'Taylor Swift'], dtype=object)
df['Singer name'].value_counts()

###output
Katy Perry      2
Taylor Swift    1
Name: Singer name, dtype: int64
```

Now that we have seen the essential functions of Pandas, it is important that after making changes to any DataFrame, we can store it in our server for later usage. Let's see how we can do that.

Converting DataFrame to CSV or JSON

Here we use the filename and a required extension to save our DataFrame. For example:

df.to_csv('songs.csv') # Saving the file in CSV fomat
df.to_json('songs.json') # Saving the file in JSON format

Generally, companies store a lot of data in single or multiple files. Often, only some of this data is helpful, and we only need a part of it. In the next section, we will see how to take part in the DataFrames as per our requirements.

Indexing and Slicing of DataFrames

We often only want a part of the dataFrame to get helpful information out of it, and we use the slicing method. This can be done in three ways:

Choosing all columns and only some rows. This is known as Row slicing, as we slice along a row.
Choosing all rows and only some columns.
Choosing some columns and some rows.

Let us look at how we can implement these three methods.

Rows Slicing in DaraFrames

df[rowindex1:rowindex2]: Here, we have to give a starting and ending index of the rows we want to extract from the dataframe. Please note that rowindex1 is inclusive while rowindex2 is exclusive in the above slicing, similar to lists. We will get a new DataFrame after the slicing.

1)###code when both indexes are mentioned of row 
df[0:1]
###output
Name of track  Duration of track  Singer name
0          Roar                 269  Katy Perry

We can also give starting index only, which means all rows ahead of that will be considered part of row slicing, as shown below:

2)###code when one index is mentioned for row 
df[1:]
###output 
  Name of track  Duration of track    Singer name
1    Dark Horse                 225    Katy Perry
2   Blank Space                 272  Taylor Swift

Directly accessing any row_index will give an error, i.e., accessing the first row as df[0] will result in an error.

3)###code for accesing direct index element
df[0]
### we get the following error
The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 0

Using loc function: This can be used to access a group of rows and columns by label or a boolean array. More detail can be read in the official documentation here. Do note that dfcustomindex, earlier defined in df.set_index, can be seen below.

df_custom_index
###output
                     Duration of track         Singer
                                  
Roar                   269                   Katy Perry
Dark Horse             225                   Katy Perry
Blank Space            272                   Taylor Swift

loc_result = df_custom_index.loc["Dark Horse"]
loc_result

###output
Duration of track            225
Singer                Katy Perry
Name: Dark Horse, dtype: object

Using iloc function: This is an integer-based location indexing that helps us to get required rows from DataFrame using numerical index ranging from 0 to n-1.

iloc_result = df_custom_index.iloc[1]
>>> iloc_result
'''
###output
Duration of track            225
Singer                Katy Perry
Name: Dark Horse, dtype: object
'''

Please note that iloc and loc functions can be used for slicing. An example of both is shown below.

# 1)Using loc
df_custom_index.loc["Roar":"Blank Space"]
'''
###output
               Duration of track         Singer
Name of Track                                  
Roar                          269    Katy Perry
Dark Horse                    225    Katy Perry
Blank Space                   272  Taylor Swift

2)Using iloc
df_custom_index.iloc[0:2]
###output
               Duration of track       Singer
Name of Track                                
Roar                          269  Katy Perry
Dark Horse                    225  Katy Perry
'''

Please note that in the case of loc, both indexes are inclusive, while in the case of iloc, only starting index is inclusive. Also, in iloc the indexes do not have to be in the bound of 0 to n-1, which is similar to NumPy and lists in Python.

Column Retrieval From a DataFrame

We can use the following to access any column from a dataframe

df[columnname] or df.columnname: This will produce an output of Output type →< class’ pandas.core.series.Series'>
df[[columnname, columnname2]]: This will produce an output of data type, DataFrame. Please note the double square brackets, as we mostly make mistakes here by providing the single bracket.

As mentioned in point2, we have to give names of columns as a list

1)###code for Point1
df["Singer name"]

###output
0      Katy Perry
1      Katy Perry
2    Taylor Swift
Name: Singer name, dtype: object

2)###code for Point2
df[["Singer name","Name of track"]]
###output
    Singer name Name of track
0    Katy Perry          Roar
1    Katy Perry    Dark Horse
2  Taylor Swift   Blank Space

Retrieving Required Columns and Rows Together from A DataFrame

We can combine the techniques learned earlier.

df[rowindex1 : rowindex2][column_name]
df[rowindex1 : rowindex2][[columnname,columnname2]]

As mentioned with columns, we get a series if we mention only a single column name. If we mention a list of columns, we get a DataFrame. Here it follows the same rules for "rowsindex" as in Retrieving Rows, and for "columnindex", it follows the same rules mentioned in Retrieving Columns.

#1)###Example 1
df[1:3]["Singer name"]
'''
###output
1      Katy Perry
2    Taylor Swift
Name: Singer name, dtype: object
'''
#2)###Example 2
df[1:3][["Singer name"]]
'''
###output
    Singer name
1    Katy Perry
2  Taylor Swift
'''

Now we have learned how to take part in DataFrame instead of full. But what will we do if we want the data to be sorted in a Particular order before slicing it? Let's see how we can sort DataFrames in the next section.

Sorting DataFrames

df.sort_index(ascending=False): This will help us sort DataFrame in descending order of row index values. If we specify ascending=True, then it will be sorted in ascending order.

df.sort_index(ascending=False) 
###output
Name of track  Duration of track    Singer name
2   Blank Space        272           Taylor Swift
1    Dark Horse        225             Katy Perry
0          Roar        269             Katy Perry

df.sortvalues(by=columnname): It will help us sort DataFrame using the values in the column name we specified. In the code below, we try to sort dataFrame to get songs in increasing order of the song's length.
```
df.sort_values(by='Duration of track ')
```

How to sort the values in pandas Dataframe in Python?

Null values and duplicate handling in Dataframe

We may expect missing or inconsistent values in rows or columns in real-world datasets. Often these samples can be outliers that hamper our analysis and need to be removed from the final dataset.

In this section, we will see how to identify null values and handle them. Let's create dummy data with some None values.

df3 = df.replace({"Taylor Swift":None})
'''
### here we create a dataset with None value
###output
  Name of track  Duration of track  Singer name
0          Roar       269            Katy Perry
1    Dark Horse       225            Katy Perry
2   Blank Space       272               None
'''

Now let's see some essential functions that can inform us about these missing values and provide some remedies.

df.isnull.sum() : It tells us all column's count of null values present in the dataset

df3.isnull().sum()
'''
###output
Name of track         0
Duration of track     0
Singer name           1
dtype: int64
'''

df.dropna(subset=["col_name"]): Here, this command drops all rows containing null values if the subset is not specified. If a subset value is specified, then it drops rows only for those column names we specified, and the None value is present in that column name.
```
df3.dropna()
'''
###output
Name of track  Duration of track  Singer name
0          Roar       269            Katy Perry
1    Dark Horse       225            Katy Perry
'''
```
df.drop_dupliactes() → Here, this function is used to drop duplicate rows from DataFrame. It is beneficial while analyzing data or building a model as it prevents us from giving more importance to data points due to their duplication.

Please note that the above methods do not change the existing DataFrame. They make changes to a DataFrame and return a new DataFrame. To prevent this and make sure that changes happen to the same DataFrame, we have the value "inplace=True", which we can use with many Pandas methods.

Sometimes data can be in multiple files, and we need to combine two different DataFrames. Let's see how we can do that.

Concatenating two DataFrames

These are useful when the data is in different files, but we want it in a single data structure.

###here we try to make single dataframe into two using slicing and then concatenating them
df4 = df[0:1]
df5 = df[1:3].reset_index()
df5
'''
  Name of track  Duration of track    Singer name
0    Dark Horse                 225    Katy Perry
1   Blank Space                 272  Taylor Swift
>>> df4
  Name of track  Duration of track  Singer name
0          Roar                 269  Katy Perry
'''

pd.concat([dataframe1,dataframe2],ignore_index=True) → It helps us to contact two DataFrames.

pd.concat([df4,df5],ignore_index=True)
'''
###output
  Name of track  Duration of track    Singer name
0          Roar                 269    Katy Perry
1    Dark Horse                 225    Katy Perry
2   Blank Space                 272  Taylor Swift
'''

Also, 'ignore_index=True' is used, so original index values are not in the final DataFrame. Otherwise, the default value of ignore_index is false, and we would have two or more rows having the same index number.

pd.concat([df4,df5])
'''
###output
  Name of track  Duration of track    Singer name  index
0          Roar                 269    Katy Perry    NaN
0    Dark Horse                 225    Katy Perry    1.0
1   Blank Space                 272  Taylor Swift    2.0

'''

Merging DataFrames

pd.merge() function helps join two DataFrames based on the specified method. An example is shown below.

data_1 = {
  "name": ["Katy", "Taylor", "John"],
  "Experience": [3, 5, 2]
}

data_2 = {
  "name": ["Katy", "Taylor", "John"],
  "experience": [1, 4, 7]
}
df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)
newdf = df_1.merge(df_2, how='right')
###Output
     name      Experience  experience
0    Katy           3           1
1   Taylor          5           4
2    John           2           7

How to use the Lambda function in a DataFrame?

Here we utilize the benefits of the lambda function in the DataFrame using the apply function, as shown in the example. We use the lambda function when we want to operate on every value corresponding to one column in our data.

df["Singer name"] = df["Singer name"].apply(lambda x : x[:4])
df
'''
##output
  Name of track  Duration of track  Singer name
0          Roar                 269        Katy
1    Dark Horse                 225        Katy
2   Blank Space                 272        Tayl
'''

In the above example, we wanted to shorten the singer names and consider the first four characters of the singer's name for a singing contest. Also, we reassigned the output to the singer name column of the original DataFrame; otherwise, it will just give us output in series format, and nothing will change in the original DataFrame.

That's all for the basics about Pandas, but before closing this session, let's think about one critical question,

When to use list, numpy ndarrays, or python DataFrames?

We need to know that different data structures have different speeds. The order of time taken for operating is → list > DataFrame > numpy ndarray. Lists are Python's most commonly used data structure as they are easy to use and can hold multiple data types, but they are not convenient for mathematical processing. So we use ndarray to perform mathematical operations as the processing of mathematical operations with ndarray is extremely fast. But what if we want to merge data from multiple datasets or read data from excel or HDFS? For that, we need DataFrames.

Each data structure has its benefits, uses, and processing time, but if we want a one-stop destination library of Python for analyzing data, Pandas would be the best.

Most common mistakes while using Pandas

Some of the most common mistakes which people make while using Pandas are:

Series and Dataframe are considered to be the same. This is not the truth. The Series is a one-dimensional Pandas array that does not have the concept of columns, while the Dataframe is a 2-dimensional data structure with rows and columns.
Performing a single task on Dataframe at a time makes the code efficient and easy to understand. This is not true, as the performance of code depends on two things: the algorithm's efficiency and the CPU's memory consumption. The algorithm's efficiency remains the same even if multiple operations are performed in a single step instead of creating multiple Dataframes. So the deciding factor becomes memory utilization. Please see the code below.
```
import pandas as pd
df1 = pd.read_csv('song_stats.csv')
df2 = df1.dropna()
df3 = df2.groupby('Singer names')
```
Now, in the end, we only used the Dataframe df3, but df1 and df2 are also stored, which leads to a wastage of CPU memory and hence affects the performance of the code. To resolve this, we can do both steps in a single step, as shown below:
```
df = pd.read_csv('song_stats.csv').dropna().groupby('Singer names')
```
- To read large files into Dataframe, use datatable or dask instead of Pandas to reduce the time taken.
- Use lighter data formats like feather or parquet to save files instead of CSV, as saving in these formats is much faster.
We have learned all about Pandas. Let us see the most commonly asked question in interviews about Pandas.

Some Commonly Asked Interview Questions on Pandas

What is the primary use of Pandas
What are the different Datastructures of Pandas
When should you use NumPy, and when should you use Pandas
How do you handle null today's in Pandas Dataframe

Conclusion

Pandas is a valuable tool for Python developers, allowing for efficient data visualization, exploration, and cleaning. In today's data-driven world, it has become necessary for developers in various fields, from model building to data engineering. In this article, we covered the basics of Pandas, including handling missing values, performing basic operations, and joining DataFrames. We hope you found the information helpful.