Database Indexing is a technique to improve the performance of a database by organizing data in a way that makes it easier to retrieve information. An index is similar to an index in a book, serving as a reference to the data stored in a database table.
When you index a database, you create an additional data structure that helps quickly locate a specific field value and its corresponding record. This structure is usually sorted for efficient binary search operation, which makes it faster to perform read operations. Indexing can also help to enforce the uniqueness of data in the database.
There are several different indexing methods available. But before exploring those, it's important to understand the need for indexing. For example, indexing might be necessary when a database has many fields and needs to be organized for improved performance.
Suppose you are in your city’s central library and have 5 minutes to find a particular book. Would you accomplish this task and find the book in the specified time frame? It would be nearly impossible to search for any book in such an extensive collection by checking each section. However, if we request access to the library’s index, our task would be easier because indexes provide all the information required to locate the book quickly. Similarly, a database index includes all of the information needed to access data efficiently.
Nowadays, IT industry is generating vast amounts of data. All such data need to be stored, but we need a way to efficiently store and retrieve such a massive amount of data. The ability of a company to access the data quickly is critical to its success. One way to achieve this is using the similar idea of maintaining indexes as we saw in the Library example. The indexes function as lookup tables for faster retrieval.
There are primarily three methods of database indexing:
Cluster indexing is a form of storage in which more than two records are stored in the same file. We may reduce the search cost by using cluster indexing since many records related to the same item are stored in one location.
In this method, an ordered data file is used to describe the clustering index, and a non-key field is used to sort the data file. In some instances, the index is built on columns other than the primary key, which may or may not be unique for each record. In such instances, we can group two or more columns to get the unique values and index them to make it easier to find the data. Hence records with similar characteristics are grouped, and indexes for these groups are generated.
A secondary index indicates where the data is stored by providing a list of virtual pointers or references to the data’s location. The data is not physically processed in the index’s order but is stored in leaf nodes.
For example, consider a book’s contents page. Each entry specifies the information’s page number or position. Although the data is not structured, we have an ordered reference showing where the data points are located. We can only have dense ordering in the non-clustered index because data is not physically ordered, so sparse ordering is unlikely. It takes longer than the clustered index because more work is required to extract the data by following the pointer further. The data is immediately in front of the index in the case of a clustered index.
Indexes expand in tandem with the size of the database. A single-level index can become too big to store with multiple disc accesses because it is stored in the main memory. Multilevel indexing divides the main block into smaller blocks stored together in a single block. Internal blocks are separated into outer blocks, which are then pointed to the data blocks. With fewer overheads, this can be conveniently stored in the main memory.
Data is stored in rows and organized into tables in a database. Each row has a unique key that distinguishes it from the others, and these keys are stored in an index for easy retrieval. Since keys are stored in indexes, the index is automatically modified whenever a new row with a specific key is inserted.
As data is saved on a disk-based storage unit, it is organized into blocks. The atomic disc access operation is when these blocks are accessed in their entirety. Disk blocks are organized similarly to linked lists: they have a data section and a pointer to the next node (or block), and they don’t have to be stored in the same order.
We may claim that searching on a field that isn’t sorted requires a Linear Search because multiple records can only be sorted on one field. If that field is a non-key field, N block accesses are needed to search the entire table space. With a sorted field with log2 N block accesses, a Binary Search can be used. Furthermore, since the data is sorted based on a non-key field, the rest of the table does not need to be searched for duplicate values until a higher value is discovered. As a result, the performance boost is essential.
Indexes are a powerful method that a database uses in the background to speed up querying. Indexes power queries by allowing for a fast lookup of the requested information. It’s a data structure technique for quickly locating and accessing data in a database. Indexing is a technique for improving database efficiency by reducing the number of disc accesses needed when a query is running.