Count Distinct Elements in Every k Size Window

Difficulty: Medium, Asked-in: Facebook, Amazon, Microsoft

Key takeaway: An excellent problem to learn problem-solving using hashing and sliding window approach.

Let’s understand the problem!

You are given an array X[] of size n and an integer k, write a program to count the distinct elements present in every k sized window of the array. 

  • The size of the array will always be greater than or equal to k. 
  • A k sized window is a series of continuous k elements present in the sequence.
  • We need to return an array with the count of distinct elements in each k size window.

Examples

Input: X[ ] = [1, 1, 1, 3, 4, 2, 3], k = 4, Output: [2, 3, 4, 3]

Explanation:

  • 1st window = [1, 1, 1, 3], Distinct elements count = 2
  • 2nd window = [1, 1, 3, 4], Distinct elements count = 3
  • 3rd window = [1, 3, 4, 2], Distinct elements count = 4
  • 4th window = [3, 4, 2, 3], Distinct elements count = 3

Input: X[ ] = [1, 2, 4, 4, 2], k = 3, Output: [3, 2, 2]

Explanation:

  • 1st window = [1, 2, 4], Distinct elements count = 3
  • 2nd window = [2, 4, 4], Distinct elements count = 2
  • 3rd window = [4, 4, 2], Distinct elements count = 2

Discussed solutions approaches

  • Brute force approach  by considering every window
  • Further improvement using sorting and binary search
  • Further improvement using sorting and single loop
  • Further improvement using a hash table
  • Efficient approach using the sliding window approach

Brute force approach  by considering every window

Solution idea

The basic idea would be to consider each window of size k while traversing the array and simultaneously counting the distinct elements in each window.

Solution steps

  • We declare an output[] array of size n - k + 1 to store the distinct elements count of each window. Note: In an array size n, we will have n - k + 1 window of size k. Think!
  • Now we traverse the array using nested loops to explore each window of size k.
  • We run the outer loop from i = 0 to n - k + 1 and explore each window starting from index i and ending at index i + k - 1. Before beginning the inner loop, we initialize a variable distCount to count the distinct elements in the current window.
  • Now we traverse each window using the inner loop from j = i to i + k - 1 and count the distinct elements in that window.
  • Inside the inner loop, for each element X[i], we search linearly in the prefix of the window from k = i to j - 1 to check its unique presence. If the element is not present in the window prefix, then no duplicate element is present from index i to j - 1, so we increase the distCount by 1.
  • By the end of the inner loop, we store the distCount of the current window in the output[] array.
  • By the end of the outer loop, we return the output[] array containing the distinct count of each window.

Solution code C++

int* getDistinctCount(int X[], int n, int k)
{
    int* output = new int[n - k + 1];
    for (int i = 0; i <= n - k; i = i + 1)
    {
        int distCount = 0;
        for (int j = i; j < i + k; j = j + 1)
        {
            distCount = distCount + 1;
            for (int k = i; k < j; k = k  + 1)
            {
                if (X[j] == X[k])
                {
                    distCount = distCount - 1;
                    break;
                }
            }
        }
        output[i] = distCount;
    }
    return output;
}

Time and space complexity analysis

Here we are using three nested loops: the first loop runs n - k + 1 times, the second loop runs k times, and the third loop runs j - i times or k times in the worst case. So total number of loop iterations in the worst case = (n - k + 1) * k * k = O(n*k^2). At each step of the iteration, we are doing O(1) operations, so time complexity = O(n*k^2)* O(1) = O(n*k^2)

Space complexity = O(1), we are using constant extra space to generate the output. Think!

Further improvement using sorting and binary search

Now the critical question is: can we improve the time complexity of the above approach further? In the inner loop, searching is a critical operation to find the duplicate presence of element X[j] in the prefix [i, j-1] of the current window. So, instead of running two loops to find the distinct element, can we apply the idea of binary search by sorting the current window? Let's think!

The idea would be simple: Inside the inner loop, we sort each window and apply binary search on element X[i] to check its unique presence in the prefix of the window from k = i to j - 1. If element X[j] is not present in the window prefix, then we increase the distCount by 1 for that window. Think!

Solution pseudocode

int[] getDistinctCount(int X[], int n, int k)
{
    int output[n - k + 1]
    for (int i = 0; i <= n - k; i = i + 1)
    {
        int distCount = 0
        sort(X, i, i + k - 1)
        for (int j = i, j < i + k; j = j + 1)
        {
            distCount = distCount + 1
            int index = binarySearch(X, i, j - 1, X[j])
            if(index != -1)
                distCount = distCount - 1
        }
        output[i] = distCount
    }
    return output
}

Solution code C++

int* getDistinctCount(int X[], int n, int k)
{
    int* output = new int[n - k + 1];
    for (int i = 0; i <= n - k; i = i + 1)
    {
        int distCount = 0;
        sort(X + i, X + i + k);
        for (int j = i; j < i + k; j = j + 1)
        {
            distCount = distCount + 1;
            if (binary_search(X + i, X + j, X[j]))
                distCount = distCount - 1;
        }
        output[i] = distCount;
    }
    return output;
}

Time and space complexity analysis

Time complexity = (n - k + 1) * (Time complexity of sorting k size window + Time complexity of searching X[j] using binary search) = (n - k + 1) * (O(k*logk) + k * O(logk)) = O(n) * O(k*logk) = O(n*k*logk)

Space complexity = O(1), we are using constant extra space to generate the output. Think!

Further improvement using sorting and single loop

Can we further optimize the above approach? Let's think!

After sorting the window in the above approach, all duplicate elements will be placed adjacent to each other. So rather than applying binary search for each window element in a loop, we can count distinct elements using a single loop or single traversal of the window. Here, we can use the idea of the fast and slow pointer used in this blog: Remove duplicates from sorted array.

Solution code C++

int* getDistinctCount(int X[], int n, int k)
{
    int* output = new int[n - k + 1];
    for (int i = 0; i <= n - k; i++)
    {
        sort(X + i, X + i + k);
        int slow = i;
        int fast = i + 1;
        int distCount = 1;
        while (fast < i + k)
        {
            if (X[slow] != X[fast])
            {
                distCount = distCount + 1;
                slow = slow + 1;
            }
            fast = fast + 1;
        }
        output[i] = distCount;
    }
    return output;
}

Time and space complexity analysis

Time complexity = (n - k + 1) * (Time complexity of sorting k size window + Time complexity of counting unique elements in the current window) = (n - k + 1) * (O(klogk) + O(k)) = O(n) * O(klogk) = O(n*k*logk). Here time complexity looks similar to the above approach but it is slightly better in terms of optimzation.

Space complexity = O(1), we are using constant extra space to generate the output. Think!

Solution using hash table

Solution idea

Based on the above approach, sorting contributes a significant factor in the time complexity. Can we think of counting unique elements in a window without sorting? Can we apply the idea of hashing in a window to improve the time complexity of searching distinct elements? Let's think!

The idea is: we traverse each window of size k using a loop and use a hash table to store unique elements of that window. Whenever a window element is not present in the hash table, we insert it there and increase the duplicate count by 1. Think!

Solution pseudocode

int[] getDistinctCount(int X[], int n, int k)
{
    int output[n - k + 1]
    for(int i = 0; i <= n - k; i = i + 1)
    {
        hash table H
        int distCount = 0
        for(int j = i; j < i + k; j = j + 1)
        {
            if (H.search(X[j]) == false)
            {
                H.insert(X[j])
                distCount = distCount + 1
            }    
        }    
        output[i] = distCount
    }
    return output
}

Solution code C++

int* getDistinctCount(int X[], int n, int k)
{
    int* output = new int[n - k + 1];
    for (int i = 0; i <= n - k; i = i + 1)
    {
        unordered_map<int, bool> H;
        int distCount = 0;
        for (int j = i; j < i + k; j = j + 1)
        {
            if (H.count(X[j]) == 0)
            {
                H[X[j]] = true;
                distCount = distCount + 1;
            }
        }
        output[i] = distCount;
    }
    return output;
}

Solution code Python

def getDistinctCount(X, n, k):
    output = [0] * (n - k + 1)
    for i in range(n - k + 1):
        H = {}
        distCount = 0
        for j in range(i, i + k):
            if X[j] not in H:
                H[X[j]] = True
                distCount = distCount + 1
        output[i] = distCount
    return output

Solution code Java

public class Solution
{
    public int[] getDistinctCount(int[] X, int n, int k) 
    {
        int[] output = new int[n - k + 1];
        for (int i = 0; i <= n - k; i = i + 1) 
        {
            HashMap<Integer, Boolean> H = new HashMap<>();
            int distCount = 0;
            for (int j = i; j < i + k; j = j + 1) 
            {
                if (!H.containsKey(X[j])) 
                {
                    H.put(X[j], true);
                    distCount = distCount + 1;
                }
            }
            output[i] = distCount;
        }
        return output;
    }
}

Time and space complexity analysis

Time complexity = (n - k + 1) * Time complexity of counting unique elements using hash table in k size window = O(n) * k * O(1) = O(n*k). Note: the time complexity of searching and insertion in a hash table would be O(1) average.

Space complexity = O(k), for using the O(k) size hash table to count distinct elements in the k size window.

Efficient approach using a sliding window approach

Solution idea

Now the critical question is can we solve this problem in a single scan without using extra memory? Can we use some additional insight from the problem to get the idea of an efficient solution?

From the observation of the problem, k size windows are placed linearly adjacent to each other. For example, the first window is [0, k-1], the second window is [1, k], the third window is [2, k+1], and so on. If we observe, in any k size window, the k-1 element would be the same as the previous adjacent window i.e. ignoring the first element of the previous window and including one new element at the end of it.

The idea behind a sliding window is: we can easily calculate the unique element count of any current window if we know the unique element count of the previous window. So we find the count of distinct elements in each window of size k by shifting the window by one right and using the computation done in the previous step.

Now the critical question is: how do we track the unique element count of the current window using the unique element count of the previous window?

The idea is: we can use a hash table to store the frequency count of elements in each window. Suppose elements from i to i + k – 1 are stored in a hash table as an element-frequency pair. So for updating the Hash Map in range i + 1 to i + k, we reduce the frequency count of the ith element by one and increase the frequency of (i + k)th element by 1.

Solution steps

  • We create an empty hash table freq to store the element-frequency pair and Initialize the count of distinct elements to 0 i.e. distCount = 0.
  • Now we traverse the first window using a loop and store the frequency count of every element in the hash table. Also, keep updating the value of distCount. By end of the loop, we store the distinct count for the first window.
  • Now we traverse through the remaining array and access each window of size k. For each window:
  • We remove the first element of the previous window. If the removed element appeared only once in the hash table, we remove it from the hash table and decrease the distCount by 1. Otherwise, that element appeared multiple times in the hash table, so we decrement its count.
  • Now we add a new element to the current window which would be the last element of the new window. If the added element is not present in the hash table, then we add it there and increase the distCount by one. Otherwise, the newly added element appeared multiple times, So we do not increment the distCount but update its frequency count in the hash table.
  • By end of this process, we store the distinct count of the current window in the output.
  • By end of the code, the distinct count of each window will get stored in the output array and we return it.

Solution pseudocode

int[] getDistinctCount(int X[], int n, int k)
{
    int output[n - k + 1]
    Hash Table <int, int> freq
    int distCount = 0
    for (int i = 0; i < k; i = i + 1) 
    {
        if (freq.search(X[i]) == false)
            distCount = distCount + 1
        freq[X[i]] = freq[X[i]] + 1
    }
    
    output[0] = distCount
    for (int i = k; i < n; i = i + 1)
    {
        if (freq[X[i - k]] == 1) 
            distCount = distCount - 1
          
        freq[X[i - k]] = freq[X[i - k]] - 1
        
        if (freq[X[i]] == 0) 
            distCount = distCount + 1
            
        freq[X[i]] = freq[X[i]] + 1
        output[i - k + 1] = distCount
   }
   return output
}

Solution code C++

int* getDistinctCount(int X[], int n, int k)
{
    int* output = new int[n - k + 1];
    unordered_map<int, int> freq;
    int distCount = 0;
    for (int i = 0; i < k; i = i + 1) 
    {
        if (freq.count(X[i]) == 0)
            distCount = distCount + 1;
        freq[X[i]] = freq[X[i]] + 1;
    }
    
    output[0] = distCount;
    for (int i = k; i < n; i = i + 1)
    {
        if (freq[X[i - k]] == 1) 
            distCount = distCount - 1;
          
        freq[X[i - k]] = freq[X[i - k]] - 1;
        
        if (freq[X[i]] == 0) 
            distCount = distCount + 1;
            
        freq[X[i]] = freq[X[i]] + 1;
        output[i - k + 1] = distCount;
   }
   return output;
}

Solution code Python

def getDistinctCount(X, n, k):
    output = []
    freq = defaultdict(int)
    distCount = 0

    for i in range(k):
        if X[i] not in freq:
            distCount = distCount + 1
        freq[X[i]] = freq[X[i]] + 1

    output.append(distCount)

    for i in range(k, n):
        if freq[X[i-k]] == 1:
            distCount = distCount - 1
        freq[X[i-k]] = freq[X[i-k]] - 1

        if X[i] not in freq:
            distCount = distCount + 1
        freq[X[i]] = freq[X[i]] + 1

        output.append(distCount)

    return output

Solution code Java

public class Solution
{
    public int[] getDistinctCount(int[] X, int n, int k) 
    {
        int[] output = new int[n - k + 1];
        HashMap<Integer, Integer> freq = new HashMap<>();
        int distCount = 0;
        
        for (int i = 0; i < k; i = i + 1) 
        {
            if (!freq.containsKey(X[i]))
                distCount = distCount + 1;

            freq.put(X[i], freq.getOrDefault(X[i], 0) + 1);
        }
        
        output[0] = distCount;
        for (int i = k; i < n; i = i + 1) 
        {
            if (freq.get(X[i - k]) == 1) 
                distCount = distCount - 1;

            freq.put(X[i - k], freq.get(X[i - k]) - 1);
            
            if (!freq.containsKey(X[i]))
                distCount = distCount + 1;

            freq.put(X[i], freq.getOrDefault(X[i], 0) + 1);
            output[i - k + 1] = distCount;
        }
        return output;
    }
}

Time and space complexity analysis

Time complexity = Time complexity of counting unique elements of the first window + Time complexity of counting unique elements of remaining n - k window using sliding window = O(k) + O(n - k) = O(n). Note: the time complexity of searching and insertion in a hash table would be O(1) average.

Space complexity = O(k), for using the O(k) size hash table.

Critical ideas to think!

  • What would be the best and worst-case scenario for all the above approaches?
  • What sorting algorithm will you prefer when there are duplicates in the input?
  • How time complexity of the above approaches depend on the valuer of k? Analyze the above approaches for k = 2, n/4, n/2 and n - 2.
  • Can you modify the above approaches to find the unique elements of each window?
  • In the last two approaches, what would be the total number of hash table operations?
  • Can you think to solve this problem using some other approaches?
  • Why are we not considering output[] array in space complexity?

Comparison of time and space complexities

  • Brute force approach : Time = O(n*k^2), Space = O(1)
  • Sorting and binary search: Time = O(n*klogk), Space = O(1)
  • Sorting and single loop: Time = O(n*klogk), Space = O(1)
  • Hash table: Time = O(n*k), Space = O(1)
  • Sliding window approach: Time = O(n), Space = O(1)

Suggested coding questions to practice

  • Longest substring without repeating characters
  • Longest consecutive sequence
  • Minimum size subarray sum
  • Sliding window maximum
  • Minimum window substring
  • Subarray product less than K
  • First missing positive
  • n Repeated element in 2n size array

Please write comments if you find an error or you want to share more insights about the topic. Enjoy learning, Enjoy coding, Enjoy algorithms!

More from EnjoyAlgorithms

Self-paced Courses and Blogs