Boyer Moore Algorithm | Good Suffix heuristic
We have already discussed
variation of Boyer Moore algorithm. In this article we will discuss
Good Suffix
heuristic for pattern searching. Just like bad character heuristic, a preprocessing table is generated for good suffix heuristic.
Good Suffix Heuristic
Let
t
be substring of text
T
which is matched with substring of pattern
P
. Now we shift pattern until : 1) Another occurrence of t in P matched with t in T. 2) A prefix of P, which matches with suffix of t 3) P moves past t
Case 1: Another occurrence of t in P matched with t in T
Pattern P might contain few more occurrences of
t
. In such case, we will try to shift the pattern to align that occurrence with t in text T. For example-
Explanation:
In the above example, we have got a substring t of text T matched with pattern P (in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We have found an occurrence starting at position 1 (in yellow background) so we will right shift the pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore and not much effective, we will discuss a
Strong Good Suffix rule
shortly.
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find the occurrence of t in P. Sometimes there is no occurrence at all, in such cases sometimes we can search for some
suffix of t
matching with some
prefix of P
and try to align them by shifting P. For example –
Explanation:
In above example, we have got t (“BAB”) matched with P (in green) at index 2-4 before mismatch . But because there exists no occurrence of t in P we will search for some prefix of P which matches with some suffix of t. We have found prefix “AB” (in the yellow background) starting at index 0 which matches not with whole t but the suffix of t “AB” starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.
Case 3: P moves past t
If the above two cases are not satisfied, we will shift the pattern past the t. For example –
Explanation:
If above example, there exist no occurrence of t (“AB”) in P and also there is no prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect match before index 4, so we will shift the P past the t ie. to index 5.
Strong Good suffix Heuristic
Suppose substring
q = P[i to n]
got matched with
t
in T and
c = P[i-1]
is the mismatching character. Now unlike case 1 we will search for t in P which is not preceded by character
c
. The closest such occurrence is then aligned with t in T by shifting pattern P. For example –
Explanation:
In above example,
q = P[7 to 8]
got matched with t in T. The mismatching character
c
is “C” at position P[6]. Now if we start searching t in P we will get the first occurrence of t starting at position 4. But this occurrence is preceded by “C” which is equal to c, so we will skip this and carry on searching. At position 1 we got another occurrence of t (in the yellow background). This occurrence is preceded by “A” (in blue) which is not equivalent to c. So we will shift pattern P 6 times to align this occurrence with t in T.We are doing this because we already know that character
c = “C”
causes the mismatch. So any occurrence of t preceded by c will again cause mismatch when aligned with t, so that’s why it is better to skip this.
Preprocessing for Good suffix heuristic
As a part of preprocessing, an array
shift
is created. Each entry
shift[i]
contain the distance pattern will shift if mismatch occur at position
i-1
. That is, the suffix of pattern starting at position
i
is matched and a mismatch occur at position
i-1
. Preprocessing is done separately for strong good suffix and case 2 discussed above.
1) Preprocessing for Strong Good Suffix
Before discussing preprocessing, let us first discuss the idea of border. A
border
is a substring which is both proper suffix and proper prefix. For example, in string
“ccacc”
,
“c”
is a border,
“cc”
is a border because it appears in both end of string but
“cca”
is not a border. As a part of preprocessing an array
bpos
(border position) is calculated. Each entry
bpos[i]
contains the starting index of border for suffix starting at index i in given pattern P. The suffix
?
beginning at position m has no border, so
bpos[m]
is set to
m+1
where
m
is the length of the pattern. The shift position is obtained by the borders which cannot be extended to the left. Following is the code for preprocessing –
void preprocess_strong_suffix(int *shift, int *bpos,
char *pat, int m)
{
int i = m, j = m+1;
bpos[i] = j;
while(i > 0)
{
while(j <= m && pat[i-1] != pat[j-1])
{
if (shift[j] == 0)
shift[j] = j-i;
j = bpos[j];
}
i--; j--;
bpos[i] = j;
}
}
Explanation:
Consider pattern
P = “ABBABAB”, m = 7
.
- The widest border of suffix “AB” beginning at position i = 5 is ?(nothing) starting at position 7 so bpos[5] = 7.
- At position i = 2 the suffix is “BABAB”. The widest border for this suffix is “BAB” starting at position 4, so j = bpos[2] = 4.
- We can understand
- bpos[i] = j
- using following example –
- If character
- #
- Which is at position
- i-1
- is equivalent to character
- ?
- at position
- j-1
- , we know that border will be
- ? + border of suffix at position i
- which is starting at position
- j
- which is equivalent to saying that
- border of suffix at i-1 begin at j-1
- or
- bpos[ i-1 ] = j-1
- or in the code –
-
i--;
j--;
bpos[ i ] = j - But if character
- #
- at position
- i-1
- do not match with character
- ?
- at position
- j-1
- then we continue our search to the right. Now we know that –
-
- Border width will be smaller than the border starting at position j ie. smaller than x…?
- Border has to begin with # and end with ? or could be empty (no border exist).
- With above two facts we will continue our search in sub string
- x…?
- from position
- j to m
- . The next border should be at
- j = bpos[j]
- . After updating
- j
- , we again compare character at position
- j-1 (?)
- with # and if they are equal then we got our border otherwise we continue our search to right
- until j>m
- . This process is shown by code –
-
while(j <= m && pat[i-1] != pat[j-1])
{
j = bpos[j];
}
i--; j--;
bpos[i]=j; - In above code look at these conditions –
-
pat[i-1] != pat[j-1]
- This is the condition which we discussed in case 2. When the character preceding the occurrence of t in pattern P is different than mismatching character in P, we stop skipping the occurrences and shift the pattern. So here
- P[i] == P[j]
- but
- P[i-1] != p[j-1]
- so we shift pattern from
- i to j
- . So
- shift[j] = j-i
- is recorder for
- j
- . So whenever any mismatch occur at position
- j
- we will shift the pattern
- shift[j+1]
- positions to the right. In above code the following condition is very important –
-
if (shift[j] == 0 )
- This condition prevent modification of
- shift[j]
- value from suffix having same border. For example, Consider pattern
- P = “addbddcdd”
- , when we calculate bpos[ i-1 ] for i = 4 then j = 7 in this case. we will be eventually setting value of shift[ 7 ] = 3. Now if we calculate bpos[ i-1 ] for i = 1 then j = 7 and we will be setting value shift[ 7 ] = 6 again if there is no test shift[ j ] == 0. This mean if we have a mismatch at position 6 we will shift pattern P 3 positions to right not 6 position.
- 2) Preprocessing for Case 2
- In the preprocessing for case 2, for each suffix the
- widest border of the whole pattern
- that is contained in that suffix is determined. The starting position of the widest border of the pattern at all is stored in
- bpos[0]
- In the following preprocessing algorithm, this value bpos[0] is stored initially in all free entries of array shift. But when the suffix of the pattern becomes shorter than bpos[0], the algorithm continues with the next-wider border of the pattern, i.e. with bpos[j]. Following is the implementation of the search algorithm –
-
C++
/* C program for Boyer Moore Algorithm with
Good Suffix heuristic to find pattern in
given text string */
#include <stdio.h>
#include <string.h>
// preprocessing for strong good suffix rule
void
preprocess_strong_suffix(
int
*shift,
int
*bpos,
char
*pat,
int
m)
{
// m is the length of pattern
int
i=m, j=m+1;
bpos[i]=j;
while
(i>0)
{
/*if character at position i-1 is not equivalent to
character at j-1, then continue searching to right
of the pattern for border */
while
(j<=m && pat[i-1] != pat[j-1])
{
/* the character preceding the occurrence of t in
pattern P is different than the mismatching character in P,
we stop skipping the occurrences and shift the pattern
from i to j */
if
(shift[j]==0)
shift[j] = j-i;
//Update the position of next border
j = bpos[j];
}
/* p[i-1] matched with p[j-1], border is found.
store the beginning position of border */
i--;j--;
bpos[i] = j;
}
}
//Preprocessing for case 2
void
preprocess_case2(
int
*shift,
int
*bpos,
char
*pat,
int
m)
{
int
i, j;
j = bpos[0];
for
(i=0; i<=m; i++)
{
/* set the border position of the first character of the pattern
to all indices in array shift having shift[i] = 0 */
if
(shift[i]==0)
shift[i] = j;
/* suffix becomes shorter than bpos[0], use the position of
next widest border as value of j */
if
(i==j)
j = bpos[j];
}
}
/*Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule */
void
search(
char
*text,
char
*pat)
{
// s is shift of the pattern with respect to text
int
s=0, j;
int
m =
strlen
(pat);
int
n =
strlen
(text);
int
bpos[m+1], shift[m+1];
//initialize all occurrence of shift to 0
for
(
int
i=0;i<m+1;i++) shift[i]=0;
//do preprocessing
preprocess_strong_suffix(shift, bpos, pat, m);
preprocess_case2(shift, bpos, pat, m);
while
(s <= n-m)
{
j = m-1;
/* Keep reducing index j of pattern while characters of
pattern and text are matching at this shift s*/
while
(j >= 0 && pat[j] == text[s+j])
j--;
/* If the pattern is present at the current shift, then index j
will become -1 after the above loop */
if
(j<0)
{
printf
(
"pattern occurs at shift = %d\n"
, s);
s += shift[0];
}
else
/*pat[i] != pat[s+j] so shift the pattern
shift[j+1] times */
s += shift[j+1];
}
}
//Driver
int
main()
{
char
text[] =
"ABAAAABAACD"
;
char
pat[] =
"ABA"
;
search(text, pat);
return
0;
}
Java
/* Java program for Boyer Moore Algorithm with
Good Suffix heuristic to find pattern in
given text string */
class
GFG
{
// preprocessing for strong good suffix rule
static
void
preprocess_strong_suffix(
int
[]shift,
int
[]bpos,
char
[]pat,
int
m)
{
// m is the length of pattern
int
i = m, j = m +
1
;
bpos[i] = j;
while
(i >
0
)
{
/*if character at position i-1 is not
equivalent to character at j-1, then
continue searching to right of the
pattern for border */
while
(j <= m && pat[i -
1
] != pat[j -
1
])
{
/* the character preceding the occurrence of t
in pattern P is different than the mismatching
character in P, we stop skipping the occurrences
and shift the pattern from i to j */
if
(shift[j] ==
0
)
shift[j] = j - i;
//Update the position of next border
j = bpos[j];
}
/* p[i-1] matched with p[j-1], border is found.
store the beginning position of border */
i--; j--;
bpos[i] = j;
}
}
//Preprocessing for case 2
static
void
preprocess_case2(
int
[]shift,
int
[]bpos,
char
[]pat,
int
m)
{
int
i, j;
j = bpos[
0
];
for
(i =
0
; i <= m; i++)
{
/* set the border position of the first character
of the pattern to all indices in array shift
having shift[i] = 0 */
if
(shift[i] ==
0
)
shift[i] = j;
/* suffix becomes shorter than bpos[0],
use the position of next widest border
as value of j */
if
(i == j)
j = bpos[j];
}
}
/*Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule */
static
void
search(
char
[]text,
char
[]pat)
{
// s is shift of the pattern
// with respect to text
int
s =
0
, j;
int
m = pat.length;
int
n = text.length;
int
[]bpos =
new
int
[m +
1
];
int
[]shift =
new
int
[m +
1
];
//initialize all occurrence of shift to 0
for
(
int
i =
0
; i < m +
1
; i++)
shift[i] =
0
;
//do preprocessing
preprocess_strong_suffix(shift, bpos, pat, m);
preprocess_case2(shift, bpos, pat, m);
while
(s <= n - m)
{
j = m -
1
;
/* Keep reducing index j of pattern while
characters of pattern and text are matching
at this shift s*/
while
(j >=
0
&& pat[j] == text[s+j])
j--;
/* If the pattern is present at the current shift,
then index j will become -1 after the above loop */
if
(j <
0
)
{
System.out.printf(
"pattern occurs at shift = %d\n"
, s);
s += shift[
0
];
}
else
/*pat[i] != pat[s+j] so shift the pattern
shift[j+1] times */
s += shift[j +
1
];
}
}
// Driver Code
public
static
void
main(String[] args)
{
char
[]text =
"ABAAAABAACD"
.toCharArray();
char
[]pat =
"ABA"
.toCharArray();
search(text, pat);
}
}
// This code is contributed by 29AjayKumar
Python3
# Python3 program for Boyer Moore Algorithm with
# Good Suffix heuristic to find pattern in
# given text string
# preprocessing for strong good suffix rule
def
preprocess_strong_suffix(shift, bpos, pat, m):
# m is the length of pattern
i
=
m
j
=
m
+
1
bpos[i]
=
j
while
i >
0
:
'''if character at position i-1 is
not equivalent to character at j-1,
then continue searching to right
of the pattern for border '''
while
j <
=
m
and
pat[i
-
1
] !
=
pat[j
-
1
]:
''' the character preceding the occurrence
of t in pattern P is different than the
mismatching character in P, we stop skipping
the occurrences and shift the pattern
from i to j '''
if
shift[j]
=
=
0
:
shift[j]
=
j
-
i
# Update the position of next border
j
=
bpos[j]
''' p[i-1] matched with p[j-1], border is found.
store the beginning position of border '''
i
-
=
1
j
-
=
1
bpos[i]
=
j
# Preprocessing for case 2
def
preprocess_case2(shift, bpos, pat, m):
j
=
bpos[
0
]
for
i
in
range
(m
+
1
):
''' set the border position of the first character
of the pattern to all indices in array shift
having shift[i] = 0 '''
if
shift[i]
=
=
0
:
shift[i]
=
j
''' suffix becomes shorter than bpos[0],
use the position of next widest border
as value of j '''
if
i
=
=
j:
j
=
bpos[j]
'''Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule '''
def
search(text, pat):
# s is shift of the pattern with respect to text
s
=
0
m
=
len
(pat)
n
=
len
(text)
bpos
=
[
0
]
*
(m
+
1
)
# initialize all occurrence of shift to 0
shift
=
[
0
]
*
(m
+
1
)
# do preprocessing
preprocess_strong_suffix(shift, bpos, pat, m)
preprocess_case2(shift, bpos, pat, m)
while
s <
=
n
-
m:
j
=
m
-
1
''' Keep reducing index j of pattern while characters of
pattern and text are matching at this shift s'''
while
j >
=
0
and
pat[j]
=
=
text[s
+
j]:
j
-
=
1
''' If the pattern is present at the current shift,
then index j will become -1 after the above loop '''
if
j <
0
:
print
(
"pattern occurs at shift = %d"
%
s)
s
+
=
shift[
0
]
else
:
'''pat[i] != pat[s+j] so shift the pattern
shift[j+1] times '''
s
+
=
shift[j
+
1
]
# Driver Code
if
__name__
=
=
"__main__"
:
text
=
"ABAAAABAACD"
pat
=
"ABA"
search(text, pat)
# This code is contributed by
# sanjeev2552
C#
/* C# program for Boyer Moore Algorithm with
Good Suffix heuristic to find pattern in
given text string */
using
System;
class
GFG
{
// preprocessing for strong good suffix rule
static
void
preprocess_strong_suffix(
int
[]shift,
int
[]bpos,
char
[]pat,
int
m)
{
// m is the length of pattern
int
i = m, j = m + 1;
bpos[i] = j;
while
(i > 0)
{
/*if character at position i-1 is not
equivalent to character at j-1, then
continue searching to right of the
pattern for border */
while
(j <= m && pat[i - 1] != pat[j - 1])
{
/* the character preceding the occurrence of t
in pattern P is different than the mismatching
character in P, we stop skipping the occurrences
and shift the pattern from i to j */
if
(shift[j] == 0)
shift[j] = j - i;
//Update the position of next border
j = bpos[j];
}
/* p[i-1] matched with p[j-1], border is found.
store the beginning position of border */
i--; j--;
bpos[i] = j;
}
}
//Preprocessing for case 2
static
void
preprocess_case2(
int
[]shift,
int
[]bpos,
char
[]pat,
int
m)
{
int
i, j;
j = bpos[0];
for
(i = 0; i <= m; i++)
{
/* set the border position of the first character
of the pattern to all indices in array shift
having shift[i] = 0 */
if
(shift[i] == 0)
shift[i] = j;
/* suffix becomes shorter than bpos[0],
use the position of next widest border
as value of j */
if
(i == j)
j = bpos[j];
}
}
/*Search for a pattern in given text using
Boyer Moore algorithm with Good suffix rule */
static
void
search(
char
[]text,
char
[]pat)
{
// s is shift of the pattern
// with respect to text
int
s = 0, j;
int
m = pat.Length;
int
n = text.Length;
int
[]bpos =
new
int
[m + 1];
int
[]shift =
new
int
[m + 1];
// initialize all occurrence of shift to 0
for
(
int
i = 0; i < m + 1; i++)
shift[i] = 0;
// do preprocessing
preprocess_strong_suffix(shift, bpos, pat, m);
preprocess_case2(shift, bpos, pat, m);
while
(s <= n - m)
{
j = m - 1;
/* Keep reducing index j of pattern while
characters of pattern and text are matching
at this shift s*/
while
(j >= 0 && pat[j] == text[s + j])
j--;
/* If the pattern is present at the current shift,
then index j will become -1 after the above loop */
if
(j < 0)
{
Console.Write(
"pattern occurs at shift = {0}\n"
, s);
s += shift[0];
}
else
/*pat[i] != pat[s+j] so shift the pattern
shift[j+1] times */
s += shift[j + 1];
}
}
// Driver Code
public
static
void
Main(String[] args)
{
char
[]text =
"ABAAAABAACD"
.ToCharArray();
char
[]pat =
"ABA"
.ToCharArray();
search(text, pat);
}
}
// This code is contributed by PrinciRaj1992
Javascript
// Preprocessing for strong good suffix rule
function
preprocessStrongSuffix(shift, bpos, pat, m) {
let i = m;
let j = m + 1;
bpos[i] = j;
while
(i > 0) {
// If character at position i-1 is not equivalent to character at j-1,
// then continue searching to the right of the pattern for a border.
while
(j <= m && pat[i - 1] !== pat[j - 1]) {
// The character preceding the occurrence of t in pattern P is different
// than the mismatching character in P, so stop skipping the occurrences
// and shift the pattern from i to j.
if
(shift[j] === 0) {
shift[j] = j - i;
}
// Update the position of the next border.
j = bpos[j];
}
// pat[i-1] matched with pat[j-1], a border is found.
// Store the beginning position of the border.
i--;
j--;
bpos[i] = j;
}
}
// Preprocessing for case 2
function
preprocessCase2(shift, bpos, pat, m) {
let j = bpos[0];
for
(let i = 0; i <= m; i++) {
// Set the border position of the first character of the pattern
// to all indices in the shift array having shift[i] = 0.
if
(shift[i] === 0) {
shift[i] = j;
}
// If the suffix becomes shorter than bpos[0], use the position of
// the next widest border as the value of j.
if
(i === j) {
j = bpos[j];
}
}
}
// Search for a pattern in given text using Boyer-Moore algorithm with Good Suffix rule
function
search(text, pat) {
let s = 0;
let j;
const m = pat.length;
const n = text.length;
const bpos =
new
Array(m + 1);
const shift =
new
Array(m + 1).fill(0);
// Initialize all occurrences of shift to 0
for
(let i = 0; i < m + 1; i++) {
shift[i] = 0;
}
// Perform preprocessing
preprocessStrongSuffix(shift, bpos, pat, m);
preprocessCase2(shift, bpos, pat, m);
while
(s <= n - m) {
j = m - 1;
// Keep reducing index j of pattern while characters of
// pattern and text are matching at this shift s
while
(j >= 0 && pat[j] === text[s + j]) {
j--;
}
// If the pattern is present at the current shift, then index j
// will become -1 after the above loop
if
(j < 0) {
console.log(`Pattern occurs at shift = ${s}`);
s += shift[0];
}
else
{
// pat[i] != pat[s+j] so shift the pattern shift[j+1] times
s += shift[j + 1];
}
}
}
// Driver
function
main() {
const text =
"ABAAAABAACD"
;
const pat =
"ABA"
;
search(text, pat);
}
main();
- Output:
-
pattern occurs at shift = 0
pattern occurs at shift = 5 - References
-
- http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm
- < If you like w3wiki and would like to contribute, you can also write an article using
- write.w3wiki.net
- or mail your article to review-team@w3wiki.net. See your article appearing on the w3wiki main page and help other Beginner.
Contact Us