Rolling Hash: String Hashing, Rabin-Karp & Fast Pattern Matching
Rolling hash is a string hashing technique that updates a substring hash in constant time as the window moves. It matters because search systems, plagiarism detectors, log scanners, and DNA sequence tools often need fast pattern matching over millions of characters. After reading, you can implement it safely and explain its trade-offs.
String algorithms usually compare characters one by one, but hashing converts substrings into compact numeric fingerprints. Rolling hash builds on hashing and powers the Rabin-Karp algorithm, substring equality checks, duplicate detection, palindrome queries, and two-dimensional pattern matching in grids or images.
You will be able to choose bases and moduli, compute prefix hashes, run Rabin-Karp, handle collisions, compare substrings in O(1), and answer common GATE-style and interview questions with clear complexity analysis.
Core Concepts
Rolling hash is not one isolated trick. It combines string encoding, polynomial hashing, prefix precomputation, modular arithmetic, sliding window updates, and collision-aware verification. The table below covers the standard variants and use cases you should know before implementing a production-quality or interview-ready solution.
1.Character Encoding
String hashing begins by converting characters into integers. The hash function does not understand letters directly; it works on numbers. For lowercase English letters, a common mapping is a = 1, b = 2, and so on. For general Unicode text, using the code point with ord() is safer, though it may require normalization when visually identical text can have different byte representations.
A familiar example is searching for a customer name inside a CRM export: the word “Riya” must first become numbers before hashing can begin. An industry-specific example is scanning hospital discharge summaries where tokens such as “HbA1c” and “insulin” should be encoded consistently before medical record deduplication.
Code Example
2.Polynomial Hashing
Polynomial hashing treats a string like a number written in a custom base. For a string such as “abc”, one common form is 1 × p² + 2 × p¹ + 3 × p⁰, computed under a large modulus to keep numbers small. The base spreads character positions apart, while the modulus prevents integer overflow and keeps operations efficient.
A familiar example is checking whether two saved coupon codes in an e-commerce app are likely identical before doing a full string comparison. A more industry-specific example is matching invoice reference strings in banking reconciliation, where millions of transaction remarks may need quick equality checks.
Code Example
3.Prefix Hashes
Prefix hashes precompute the hash of every prefix of a string, allowing any substring hash to be extracted in O(1). If H[i] stores the hash of the first i characters, the substring from l to r can be obtained by subtracting the earlier prefix after aligning powers of the base.
A familiar example is checking repeated phrases inside a long WhatsApp export without comparing every character each time. A SaaS-specific example is comparing route patterns in API gateway logs, where substrings of request paths such as “/v1/payments” must be matched quickly.
Code Example
4.Rolling Window Hash
A rolling window hash updates the hash of a fixed-length substring when the window shifts by one character. Instead of recomputing the hash for every window in O(m), it removes the outgoing character, shifts the remaining contribution, and adds the incoming character. This is the core performance improvement behind fast pattern matching.
A familiar example is detecting whether a short OTP template appears inside SMS text while scanning inbox messages. An observability example is searching a repeated error signature inside large server logs streamed from Kubernetes pods.
Code Example
5.Rabin-Karp Algorithm
The Rabin-Karp algorithm uses rolling hash to search a pattern of length m inside a text of length n. It hashes the pattern once, then rolls a window of size m across the text. When the hash matches, it verifies the actual substring to avoid false positives caused by collisions.
A familiar example is finding a train number pattern inside an IRCTC notification message. A content-platform example is scanning uploaded article text for a copyrighted paragraph snippet before publication.
Code Example
6.Collision Handling
A collision occurs when two different strings produce the same hash. Hashing is fast because it compresses a potentially long string into a fixed-size value, but that compression means collisions are mathematically possible. Correct pattern matching code must either verify candidate substrings or use a collision-reduction strategy such as double hashing.
A familiar example is two different food item names in a Zomato menu producing the same fingerprint by chance. A banking-security example is a fraud scanner comparing transaction remarks; a hash match should trigger verification, not automatic classification as identical text.
Code Example
7.Double Hashing
Double hashing computes two independent hash values, usually with different large prime moduli. A substring is considered a candidate match only when both hashes match. This does not make collisions impossible, but it makes accidental collisions far less likely than a single hash in typical programming-contest and interview settings.
A familiar example is comparing PAN-like identifiers in a cleaned dataset where one hash may be too risky for deduplication. An ed-tech example is detecting duplicate answer snippets across thousands of submissions, where false matches can unfairly flag students unless the system verifies or uses stronger hashing.
Code Example
8.Base And Modulus
The base and modulus control both distribution and performance. A small base can create more structure in the hash values, while a poor modulus can increase collisions. Common practice is to use a large random-looking base and one or two large prime moduli. In C++, unsigned 64-bit overflow is also common because arithmetic naturally wraps around modulo 2^64.
A familiar example is hashing product SKUs in an online grocery catalogue where many strings share prefixes such as “MILK”. A healthcare-analytics example is hashing lab-test codes where systematic prefixes like “CBC” and “LFT” make good distribution important.
Code Example
9.Multiple Pattern Search
Rabin-Karp can search multiple patterns efficiently when the patterns have the same length. Hash every pattern and store those hashes in a set or dictionary. Then roll one window across the text and check whether the window hash belongs to the pattern-hash set. For different pattern lengths, group patterns by length and run one rolling scan per group.
A familiar example is scanning SMS messages for several bank-alert phrases of equal length. A cybersecurity example is detecting a set of known malicious command fragments in shell history logs from cloud servers.
Code Example
10.Palindrome Queries
Forward-reverse hashing answers palindrome queries by comparing the hash of a substring in the original string with the corresponding hash in the reversed string. If both hashes match, the substring is a palindrome candidate. With double hashing or verification, this becomes a fast technique for repeated palindrome checks.
A familiar example is checking whether usernames contain palindromic segments for a puzzle app. A bioinformatics example is identifying reverse-complement-like patterns in simplified DNA strings after transforming symbols into a comparable representation.
Code Example
11.Longest Duplicate Substring
Rolling hash can be combined with binary search to find the longest duplicate substring. The idea is to guess a length, hash every substring of that length, and check whether any hash repeats with verification. If a duplicate exists, try a longer length; otherwise, try a shorter length.
A familiar example is finding repeated clauses in a rental agreement draft. A publishing-platform example is locating copied paragraphs across uploaded manuscripts where repeated substrings can indicate reused content.
Code Example
12.Two-Dimensional Hashing
Two-dimensional rolling hash extends the same idea to matrices. First hash horizontal sequences in each row, then roll vertically over those row-hash values. This lets you search for a smaller grid pattern inside a larger grid, which is useful for images, board games, maps, and structured tables.
A familiar example is finding a small word-search block inside a school puzzle grid. An industry-specific example is detecting a repeated layout fragment in screenshot testing for a fintech mobile app.
Code Example
13.Dynamic String Hashes
Static prefix hashes work when the string does not change. For editable strings, you need a data structure that supports updates and range-hash queries, such as a Fenwick tree or segment tree. Each character contributes its value multiplied by a positional power, and updates adjust only the affected position.
A familiar example is checking a changing search query while a user types in a shopping app. A collaborative-document example is maintaining fast fingerprints of paragraphs as multiple editors modify shared policy documents.
Code Example
Complexity Summary
Rolling hash is popular because it separates preprocessing from querying. Once prefix hashes and powers are ready, substring hashes are constant-time. Rabin-Karp uses this property to avoid comparing the pattern against every text position character by character.
Common Pitfalls
Most rolling hash bugs are not syntax errors; they are mathematical mismatches. Wrong index ranges, negative modulo results, inconsistent bases, or skipped verification can produce answers that pass small tests but fail hidden cases.
Learning Path
Build rolling hash in layers. Start with ordinary hashing and modular arithmetic, then move to prefix hashes, Rabin-Karp, collision handling, and advanced applications such as duplicate substring search and dynamic updates.
Frequently Asked Questions
What is rolling hash?
Rolling hash is a hashing technique that updates the hash of a moving substring without recomputing it from scratch. It is used in fast pattern matching, substring equality, duplicate detection, and algorithms such as Rabin-Karp.
What is the difference between string hashing and rolling hash?
String hashing converts a full string or substring into a numeric fingerprint. Rolling hash is a specific kind of string hashing where the fingerprint can be updated efficiently when a window slides by one position.
How does the Rabin-Karp algorithm use rolling hash?
Rabin-Karp hashes the pattern once and compares it with the rolling hash of each text window of the same length. When hashes match, the algorithm verifies the actual substring to avoid errors caused by collisions.
Is rolling hash better than KMP?
Rolling hash is flexible and easy to extend to multiple patterns, substring queries, and duplicate detection. KMP gives deterministic O(n + m) matching without collisions for a single pattern, while Rabin-Karp has expected O(n + m) but can degrade with many collisions.
Can two different strings have the same hash?
Yes. That situation is called a collision. Polynomial hashing reduces the chance of collisions with good parameters, and double hashing reduces it further, but correctness-sensitive code should still verify candidate matches.
Which base and modulus should I choose?
Use a base larger than the alphabet size or a large random-looking base, and use a large prime modulus such as 1,000,000,007 or 1,000,000,009. For safer comparisons, use two moduli and compare the pair of hashes.
Why does Rabin-Karp have worst-case O(nm) time?
If many text windows have the same hash as the pattern, the algorithm must verify many candidate substrings. Each verification can take O(m), so repeated collisions can cause O(nm) worst-case time.
When should I use rolling hash in interviews?
Use rolling hash when the problem asks for repeated substring comparison, duplicate substring search, plagiarism-style matching, palindrome queries, or multiple same-length pattern search. Always mention collision handling because interviewers often test whether you know hashing is probabilistic.
Key Takeaways
Rolling hash turns substring comparison into fast numeric comparison by using polynomial hashing, prefix hashes, and constant-time sliding-window updates. Rabin-Karp applies this idea to pattern matching by hashing the pattern once and rolling across the text. Double hashing, careful base-modulus selection, and direct verification are essential for collision-safe implementations.
For GATE and interviews, focus on the exact substring-hash formula, rolling update formula, expected O(n + m) Rabin-Karp complexity, worst-case O(nm) collision behaviour, and the difference between hash equality and string equality. Problems such as longest duplicate substring and palindrome queries often combine rolling hash with binary search or reverse hashing.
The natural next step is to practise classic string algorithms side by side: implement naive search, Rabin-Karp, KMP, and prefix-hash substring queries on the same test cases, then compare correctness, complexity, and collision behaviour.