Find the Index of the First Occurrence in a String

Easy

Find the Index of the First Occurrence in a String

Easy

Solution

Brute Force Approach

One straightforward approach is to use a brute-force method. We iterate through the haystack string, and at each index, check if the needle string matches the substring of haystack starting from that index. If a match is found, we return the starting index. If we reach the end of the haystack without finding a match, we return -1.

Code (Brute Force)

def strStr_brute_force(haystack: str, needle: str) -> int:
    if not needle:
        return 0
    
    for i in range(len(haystack) - len(needle) + 1):
        if haystack[i:i+len(needle)] == needle:
            return i
    
    return -1

Time Complexity (Brute Force)

O(m*n), where n is the length of haystack and m is the length of needle. In the worst case, we might need to compare needle with almost every substring of haystack.

Space Complexity (Brute Force)

O(1), as we are only using a constant amount of extra space.

Optimal Solution: KMP Algorithm

A more efficient approach is to use the Knuth-Morris-Pratt (KMP) algorithm. The KMP algorithm preprocesses the needle string to create a table that helps to avoid unnecessary comparisons. This table, often called the longest proper prefix which is also a suffix (LPS) table, stores the length of the longest proper prefix of needle that is also a suffix of the prefix ending at each index. The KMP algorithm achieves O(n) time complexity where n is the length of the haystack.

Code (KMP)

def strStr(haystack: str, needle: str) -> int:
    if not needle:
        return 0

    def computeLPSArray(needle: str) -> list[int]:
        length = len(needle)
        lps = [0] * length
        length_of_prefix = 0 # Length of the previous longest prefix suffix
        i = 1

        while i < length:
            if needle[i] == needle[length_of_prefix]:
                length_of_prefix += 1
                lps[i] = length_of_prefix
                i += 1
            else:
                if length_of_prefix != 0:
                    length_of_prefix = lps[length_of_prefix - 1]
                else:
                    lps[i] = 0
                    i += 1
        return lps

    lps = computeLPSArray(needle)
    i = 0 # index for haystack
    j = 0 # index for needle
    while i < len(haystack):
        if needle[j] == haystack[i]:
            i += 1
            j += 1

        if j == len(needle):
            return i - j
        elif i < len(haystack) and needle[j] != haystack[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return -1

Time Complexity (KMP)

O(n + m), where n is the length of haystack and m is the length of needle. The computeLPSArray function takes O(m) time, and the main loop takes O(n) time.

Space Complexity (KMP)

O(m), as we need to store the LPS array, which has the same length as the needle.

Edge Cases

Empty needle: If needle is an empty string, the function should return 0, as an empty string is always found at the beginning of any string.
needle longer than haystack: If needle is longer than haystack, the function should return -1, as needle cannot be a substring of haystack.
haystack or needle containing non-lowercase English characters: The problem statement specifies lowercase characters only, but you could clarify this during the interview and ask about handling other characters. If required to handle all characters, the same logic applies; the character set does not affect the algorithm's time complexity.

Solution

Brute Force Approach

Code (Brute Force)

def strStr_brute_force(haystack: str, needle: str) -> int:
    if not needle:
        return 0
    
    for i in range(len(haystack) - len(needle) + 1):
        if haystack[i:i+len(needle)] == needle:
            return i
    
    return -1

Time Complexity (Brute Force)

O(m*n), where n is the length of haystack and m is the length of needle. In the worst case, we might need to compare needle with almost every substring of haystack.

Space Complexity (Brute Force)

O(1), as we are only using a constant amount of extra space.

Optimal Solution: KMP Algorithm

Code (KMP)

def strStr(haystack: str, needle: str) -> int:
    if not needle:
        return 0

    def computeLPSArray(needle: str) -> list[int]:
        length = len(needle)
        lps = [0] * length
        length_of_prefix = 0 # Length of the previous longest prefix suffix
        i = 1

        while i < length:
            if needle[i] == needle[length_of_prefix]:
                length_of_prefix += 1
                lps[i] = length_of_prefix
                i += 1
            else:
                if length_of_prefix != 0:
                    length_of_prefix = lps[length_of_prefix - 1]
                else:
                    lps[i] = 0
                    i += 1
        return lps

    lps = computeLPSArray(needle)
    i = 0 # index for haystack
    j = 0 # index for needle
    while i < len(haystack):
        if needle[j] == haystack[i]:
            i += 1
            j += 1

        if j == len(needle):
            return i - j
        elif i < len(haystack) and needle[j] != haystack[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return -1

Time Complexity (KMP)

O(n + m), where n is the length of haystack and m is the length of needle. The computeLPSArray function takes O(m) time, and the main loop takes O(n) time.

Space Complexity (KMP)

O(m), as we need to store the LPS array, which has the same length as the needle.

Edge Cases

Empty needle: If needle is an empty string, the function should return 0, as an empty string is always found at the beginning of any string.
needle longer than haystack: If needle is longer than haystack, the function should return -1, as needle cannot be a substring of haystack.
haystack or needle containing non-lowercase English characters: The problem statement specifies lowercase characters only, but you could clarify this during the interview and ask about handling other characters. If required to handle all characters, the same logic applies; the character set does not affect the algorithm's time complexity.