jueves, 21 de febrero de 2013

Homework 2

To this entry undertook analyze algorithms "String matching" seen in class, these are the algorimo Boyer-Moore and Knuth-Morris-Pratt.

Boyer-Moore
The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible.

The algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving downward: the strings are matched from the end and toward the beginning of P. The comparisons continue until either a mismatch occurs or the beginning of P is reached (which means there is a match), after which the alignment is shifted to the right according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T.

The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.


Knuth-Morris-Pratt.

A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[].

A straightforward algorithm simply tries successive values of m until it finds a match or fails. Each trial involves using a loop that checks S[m+i] = W[i] for each character W[i] in the search word.

Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26^2 (1 in 676). So if the characters are random, then the expected complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after about 1 billion character comparisons (which might take a few seconds).


here the code:

import random
import sys
import time
def Chain(size):
word = ''
for i in range(size):
letter = chr(random.randint(97,101))
word+=letter
return word
def String(size):
pattern =''
for i in range(size):
letter = chr(random.randint(97,101))
pattern += letter
return pattern
def boyerMoore(text, pattern):
Txt = len(text)
Pat = len(pattern)
fail = failure(text,pattern,Txt,Pat)
fix = suff(pattern,Pat)
matches = []
j = 0
cont = 0
while (j <= Txt - Pat):
i = Pat-1
while True:
cont +=1
if i>=0 and pattern[i] == text[i+j]:
i -= 1
else:
break
if i<0:
matches.append(j)
j+= fix[0]
else:
j += max(fix[i], fail[text[i+j]]-Pat+1+i)
return cont
def failure(text,pattern,t,p):
'''Dictionary that tells us
'''
wrongchar = {}
for i in range(t-1):
if text[i] not in pattern:
wrongchar[text[i]] = p
for j in range(p-1):
wrongchar[pattern[j]] = p-j-1
return wrongchar
def suffixes(pattern, x):
a = x - 1
b = a
suff = []
for i in range(x):
suff.append(0)
suff[x-1] = x
for i in range(x-2,-1,-1):
if i > a and suff[i+x-1-b] < i-a:
suff[i]= suff[i+x-1-b]
else:
if i < a:
a = i
b = i
while (a>=0 and pattern[a] == pattern[a+x-1-b]):
a-=1
suff[i] = b - a
return suff
def suff(pattern,x):
Suff = []
s = suffixes(pattern,x)
for i in range(x):
Suff.append(x)
for i in range(x-2,-1,-1):
if (s[i] == i+1):
for j in range(x-2):
if Suff[j] == x:
Suff[j] = x-1-i
for i in range(x-1):
Suff[x-1-s[i]] = x-1-i
return Suff
def main():
text = 'abbaaabbbabbabbabaaabaabbaabbbabba'
pattern = 'abba'
initialTime = time.time()
cont = boyerMoore(text,pattern)
endTime = time.time() - initialTime
print cont, ' ', endTime
if __name__ == "__main__":
main()
view raw Boyer.py hosted with ❤ by GitHub
here a image of the result of the code
makes 44 comparasions, and just take a few time

Now the code of the bruteforce

#!/usr/bin/python
import time
def bruteforce(t,p):
i = 0
while i <= len(t)-len(p):
if t[i:i+len(p)] == p:
return i
else:
i += 1
t = "abbabababbabbbbaabbabbbabaabababbababaababaaababababba"
p = "abba"
initialTime = time()
print bruteforce(t, p)
print "Finish %fs"%(time()-initialTime)
view raw bruteforce.py hosted with ❤ by GitHub
I have an error but this algorithm is more fast that Boyer

Bibliografias:
http://cs.indstate.edu/~kmandumula/presentation.pdf
http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

1 comentario:

  1. The second algorithm is missing. When the experiment shows that a naive algorithm outperforms a fancy one, you could conclude that your experimental design is bad and you should feel bad. Code: 3 pts, report 2 pts.

    ResponderEliminar