Master Rolling Hash for Fast String Matching & Interview Prep

Rolling hash is a string hashing technique that updates a substring hash in constant time as the window moves. It matters because search systems, plagiarism detectors, log scanners, and DNA sequence tools often need fast pattern matching over millions of characters. After reading, you can implement it safely and explain its trade-offs.

String algorithms usually compare characters one by one, but hashing converts substrings into compact numeric fingerprints. Rolling hash builds on hashing and powers the Rabin-Karp algorithm, substring equality checks, duplicate detection, palindrome queries, and two-dimensional pattern matching in grids or images.

You will be able to choose bases and moduli, compute prefix hashes, run Rabin-Karp, handle collisions, compare substrings in O(1), and answer common GATE-style and interview questions with clear complexity analysis.

Core Concepts

Rolling hash is not one isolated trick. It combines string encoding, polynomial hashing, prefix precomputation, modular arithmetic, sliding window updates, and collision-aware verification. The table below covers the standard variants and use cases you should know before implementing a production-quality or interview-ready solution.

1.Character Encoding

String hashing begins by converting characters into integers. The hash function does not understand letters directly; it works on numbers. For lowercase English letters, a common mapping is a = 1, b = 2, and so on. For general Unicode text, using the code point with ord() is safer, though it may require normalization when visually identical text can have different byte representations.

A familiar example is searching for a customer name inside a CRM export: the word “Riya” must first become numbers before hashing can begin. An industry-specific example is scanning hospital discharge summaries where tokens such as “HbA1c” and “insulin” should be encoded consistently before medical record deduplication.

Never map a valid character to zero in polynomial hashing unless you fully understand the consequence. Leading zero values can make different strings easier to confuse, especially when prefixes are involved.

Code Example

2.Polynomial Hashing

Polynomial hashing treats a string like a number written in a custom base. For a string such as “abc”, one common form is 1 × p² + 2 × p¹ + 3 × p⁰, computed under a large modulus to keep numbers small. The base spreads character positions apart, while the modulus prevents integer overflow and keeps operations efficient.

A familiar example is checking whether two saved coupon codes in an e-commerce app are likely identical before doing a full string comparison. A more industry-specific example is matching invoice reference strings in banking reconciliation, where millions of transaction remarks may need quick equality checks.

A standard GATE/interview question asks for the time complexity of hashing a string of length n. Computing one polynomial hash from scratch is O(n); comparing already computed hash values is O(1), subject to collision handling.

Code Example

3.Prefix Hashes

Prefix hashes precompute the hash of every prefix of a string, allowing any substring hash to be extracted in O(1). If H[i] stores the hash of the first i characters, the substring from l to r can be obtained by subtracting the earlier prefix after aligning powers of the base.

A familiar example is checking repeated phrases inside a long WhatsApp export without comparing every character each time. A SaaS-specific example is comparing route patterns in API gateway logs, where substrings of request paths such as “/v1/payments” must be matched quickly.

The most common prefix-hash bug is forgetting to multiply the left prefix by the correct base power before subtraction. The formula must align both hashes to the same positional scale.

Code Example

4.Rolling Window Hash

A rolling window hash updates the hash of a fixed-length substring when the window shifts by one character. Instead of recomputing the hash for every window in O(m), it removes the outgoing character, shifts the remaining contribution, and adds the incoming character. This is the core performance improvement behind fast pattern matching.

A familiar example is detecting whether a short OTP template appears inside SMS text while scanning inbox messages. An observability example is searching a repeated error signature inside large server logs streamed from Kubernetes pods.

For a window of length m, precompute base^(m-1). That value is needed to remove the outgoing character contribution before shifting the window.

Code Example

5.Rabin-Karp Algorithm

The Rabin-Karp algorithm uses rolling hash to search a pattern of length m inside a text of length n. It hashes the pattern once, then rolls a window of size m across the text. When the hash matches, it verifies the actual substring to avoid false positives caused by collisions.

A familiar example is finding a train number pattern inside an IRCTC notification message. A content-platform example is scanning uploaded article text for a copyrighted paragraph snippet before publication.

The standard answer: Rabin-Karp has expected O(n + m) time with a good hash and verification only on hash matches. The worst case is O(nm) when many collisions force repeated full comparisons.

Code Example

6.Collision Handling

A collision occurs when two different strings produce the same hash. Hashing is fast because it compresses a potentially long string into a fixed-size value, but that compression means collisions are mathematically possible. Correct pattern matching code must either verify candidate substrings or use a collision-reduction strategy such as double hashing.

A familiar example is two different food item names in a Zomato menu producing the same fingerprint by chance. A banking-security example is a fraud scanner comparing transaction remarks; a hash match should trigger verification, not automatic classification as identical text.

Never say rolling hash guarantees equality. Equal strings have equal hashes, but equal hashes do not always prove equal strings unless the original substring is also verified or a collision-free domain-specific method is used.

Code Example

7.Double Hashing

Double hashing computes two independent hash values, usually with different large prime moduli. A substring is considered a candidate match only when both hashes match. This does not make collisions impossible, but it makes accidental collisions far less likely than a single hash in typical programming-contest and interview settings.

A familiar example is comparing PAN-like identifiers in a cleaned dataset where one hash may be too risky for deduplication. An ed-tech example is detecting duplicate answer snippets across thousands of submissions, where false matches can unfairly flag students unless the system verifies or uses stronger hashing.

Double hashing is a practical compromise: faster than full comparison for every window and safer than one modulus. In security-sensitive applications, use cryptographic hashes or direct verification instead of relying on polynomial hashes alone.

Code Example

8.Base And Modulus

The base and modulus control both distribution and performance. A small base can create more structure in the hash values, while a poor modulus can increase collisions. Common practice is to use a large random-looking base and one or two large prime moduli. In C++, unsigned 64-bit overflow is also common because arithmetic naturally wraps around modulo 2^64.

A familiar example is hashing product SKUs in an online grocery catalogue where many strings share prefixes such as “MILK”. A healthcare-analytics example is hashing lab-test codes where systematic prefixes like “CBC” and “LFT” make good distribution important.

Use the same base, modulus, and character mapping for every hash you compare. Hashes computed under different parameters are not comparable.

Code Example

9.Multiple Pattern Search

Rabin-Karp can search multiple patterns efficiently when the patterns have the same length. Hash every pattern and store those hashes in a set or dictionary. Then roll one window across the text and check whether the window hash belongs to the pattern-hash set. For different pattern lengths, group patterns by length and run one rolling scan per group.

A familiar example is scanning SMS messages for several bank-alert phrases of equal length. A cybersecurity example is detecting a set of known malicious command fragments in shell history logs from cloud servers.

For k patterns of equal length m, Rabin-Karp can hash all patterns in O(km) and scan the text in O(n), with verification on candidate matches. If pattern lengths differ, group by length.

Code Example

10.Palindrome Queries

Forward-reverse hashing answers palindrome queries by comparing the hash of a substring in the original string with the corresponding hash in the reversed string. If both hashes match, the substring is a palindrome candidate. With double hashing or verification, this becomes a fast technique for repeated palindrome checks.

A familiar example is checking whether usernames contain palindromic segments for a puzzle app. A bioinformatics example is identifying reverse-complement-like patterns in simplified DNA strings after transforming symbols into a comparable representation.

Index conversion is the main palindrome-hash trap. A substring text[l:r] maps to reversed_text[n-r:n-l] in the reversed string.

Code Example

11.Longest Duplicate Substring

Rolling hash can be combined with binary search to find the longest duplicate substring. The idea is to guess a length, hash every substring of that length, and check whether any hash repeats with verification. If a duplicate exists, try a longer length; otherwise, try a shorter length.

A familiar example is finding repeated clauses in a rental agreement draft. A publishing-platform example is locating copied paragraphs across uploaded manuscripts where repeated substrings can indicate reused content.

The tested pattern is binary search on answer plus rolling hash check. The typical expected complexity is O(n log n), assuming O(n) duplicate checking for each guessed length.

Code Example

12.Two-Dimensional Hashing

Two-dimensional rolling hash extends the same idea to matrices. First hash horizontal sequences in each row, then roll vertically over those row-hash values. This lets you search for a smaller grid pattern inside a larger grid, which is useful for images, board games, maps, and structured tables.

A familiar example is finding a small word-search block inside a school puzzle grid. An industry-specific example is detecting a repeated layout fragment in screenshot testing for a fintech mobile app.

Two-dimensional hashing is conceptually just nested rolling hash: one dimension compresses rows, and the second dimension compresses the sequence of row hashes.

Code Example

13.Dynamic String Hashes

Static prefix hashes work when the string does not change. For editable strings, you need a data structure that supports updates and range-hash queries, such as a Fenwick tree or segment tree. Each character contributes its value multiplied by a positional power, and updates adjust only the affected position.

A familiar example is checking a changing search query while a user types in a shopping app. A collaborative-document example is maintaining fast fingerprints of paragraphs as multiple editors modify shared policy documents.

For dynamic strings, mention the trade-off clearly: prefix hashes give O(1) queries but expensive updates; Fenwick or segment trees give O(log n) updates and O(log n) range queries.

Code Example

Rolling hash speeds up candidate detection, not logical equality. For correctness, combine it with consistent parameters, modular arithmetic discipline, and verification or double hashing when collisions matter.

Complexity Summary

Rolling hash is popular because it separates preprocessing from querying. Once prefix hashes and powers are ready, substring hashes are constant-time. Rabin-Karp uses this property to avoid comparing the pattern against every text position character by character.

Common Pitfalls

Most rolling hash bugs are not syntax errors; they are mathematical mismatches. Wrong index ranges, negative modulo results, inconsistent bases, or skipped verification can produce answers that pass small tests but fail hidden cases.

A rolling hash implementation that passes sample tests can still be wrong if it treats modulo subtraction differently across languages. Python normalizes with %, while C and C++ may need explicit addition of mod before taking %.

Learning Path

Build rolling hash in layers. Start with ordinary hashing and modular arithmetic, then move to prefix hashes, Rabin-Karp, collision handling, and advanced applications such as duplicate substring search and dynamic updates.

Frequently Asked Questions

What is rolling hash?

Rolling hash is a hashing technique that updates the hash of a moving substring without recomputing it from scratch. It is used in fast pattern matching, substring equality, duplicate detection, and algorithms such as Rabin-Karp.

What is the difference between string hashing and rolling hash?

String hashing converts a full string or substring into a numeric fingerprint. Rolling hash is a specific kind of string hashing where the fingerprint can be updated efficiently when a window slides by one position.

How does the Rabin-Karp algorithm use rolling hash?

Rabin-Karp hashes the pattern once and compares it with the rolling hash of each text window of the same length. When hashes match, the algorithm verifies the actual substring to avoid errors caused by collisions.

Is rolling hash better than KMP?

Rolling hash is flexible and easy to extend to multiple patterns, substring queries, and duplicate detection. KMP gives deterministic O(n + m) matching without collisions for a single pattern, while Rabin-Karp has expected O(n + m) but can degrade with many collisions.

Can two different strings have the same hash?

Yes. That situation is called a collision. Polynomial hashing reduces the chance of collisions with good parameters, and double hashing reduces it further, but correctness-sensitive code should still verify candidate matches.

Which base and modulus should I choose?

Use a base larger than the alphabet size or a large random-looking base, and use a large prime modulus such as 1,000,000,007 or 1,000,000,009. For safer comparisons, use two moduli and compare the pair of hashes.

Why does Rabin-Karp have worst-case O(nm) time?

If many text windows have the same hash as the pattern, the algorithm must verify many candidate substrings. Each verification can take O(m), so repeated collisions can cause O(nm) worst-case time.

When should I use rolling hash in interviews?

Use rolling hash when the problem asks for repeated substring comparison, duplicate substring search, plagiarism-style matching, palindrome queries, or multiple same-length pattern search. Always mention collision handling because interviewers often test whether you know hashing is probabilistic.

Key Takeaways

Rolling hash turns substring comparison into fast numeric comparison by using polynomial hashing, prefix hashes, and constant-time sliding-window updates. Rabin-Karp applies this idea to pattern matching by hashing the pattern once and rolling across the text. Double hashing, careful base-modulus selection, and direct verification are essential for collision-safe implementations.

For GATE and interviews, focus on the exact substring-hash formula, rolling update formula, expected O(n + m) Rabin-Karp complexity, worst-case O(nm) collision behaviour, and the difference between hash equality and string equality. Problems such as longest duplicate substring and palindrome queries often combine rolling hash with binary search or reverse hashing.

The natural next step is to practise classic string algorithms side by side: implement naive search, Rabin-Karp, KMP, and prefix-hash substring queries on the same test cases, then compare correctness, complexity, and collision behaviour.

Menu

Rolling Hash: String Hashing, Rabin-Karp & Fast Pattern Matching

Core Concepts

1.Character Encoding

Code Example

2.Polynomial Hashing

Code Example

3.Prefix Hashes

Code Example

4.Rolling Window Hash

Code Example

5.Rabin-Karp Algorithm

Code Example

6.Collision Handling

Code Example

7.Double Hashing

Code Example

8.Base And Modulus

Code Example

9.Multiple Pattern Search

Code Example

10.Palindrome Queries

Code Example

11.Longest Duplicate Substring

Code Example

12.Two-Dimensional Hashing

Code Example

13.Dynamic String Hashes

Code Example

Complexity Summary

Common Pitfalls

Learning Path

Frequently Asked Questions

What is rolling hash?

What is the difference between string hashing and rolling hash?

How does the Rabin-Karp algorithm use rolling hash?

Is rolling hash better than KMP?

Can two different strings have the same hash?

Which base and modulus should I choose?

Why does Rabin-Karp have worst-case O(nm) time?

When should I use rolling hash in interviews?

Key Takeaways

Share this