python - Text segmentation: Algorithm to match input with the longest words from the dictionary -
i need split string words such each word comes dictionary. make sure longest possible word left chosen. hence
thisisinsane => insane (correct longest possible word left) thisisinsane => in sane(wrong) assuming 'this', 'is', 'in', 'insane' words in dictionary.
i managed solved problem traversing end of string beginning matching longest word possible. problem started cropping problems these..
shareasale => share ale(wrong 'ale' not in dictionary) shareasale => share sale(correct) assuming 'share', 'a', 'sale' words in dictionary unlike 'ale'.
i tried solve problem removing valid segments found before encountering error i.e.
shareasale => 'share' , 'as' (error = 'ale')
and removing them once dictionary , solve problem. so
shareasale => no valid segments when share removed shareasale => share sale (when 'as' removed dictionary.
thus managed solve problem too. unable solve
asignas => 'as' ( error = 'ignas')
my solution remove 'as' dictionary , try solve it
asignas => 'a' 'sign' (error = 'as')
because in new recursive call 'as' has been removed dictionary. function wrote in link. hope can go through , me find better algorithm solve else suggest modification existing algorithm.
essentially problem tree problem, @ every level words form prefix of tree form branches. branch leaves no part of string correct solution.
thisisinsane | | (this)isinsane / \ / \ (this,i)sinsane (this,is)insane / / \ / / \ (this,i,sin)ane (this,is,in)sane (this,is,insane) / / (this,is,in,sane)
so in example there 2 solutions, want select solution using longest words, want explore tree right using depth-first-search strategy.
so our algorithm should:
- sort dictionary descending length.
- find prefixes of current string. if there none, return
false
. - set
prefix
longest unexplored prefix. - remove string. if string empty, found solution, return list of prefixes.
- recurse 2.
- this branch has failed, return
false
.
a sample implementation of solution:
def segment(string,wset): """segments string words prefering longer words givens dictionary wset.""" # sort wset in decreasing string order wset.sort(key=len, reverse=true) result = tokenize(string, wset, "") if result: result.pop() # remove empty string token result.reverse() # put list correct order return result else: raise exception("no possible segmentation!") def tokenize(string, wset, token): """returns either false if string can't segmented current wset or list of words segment string in reverse order.""" # done yet? if string == "": return [token] # find possible prefixes pref in wset: if string.startswith(pref): res = tokenize(string.replace(pref, '', 1), wset, pref) if res: res.append(token) return res # not possible return false print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # insane print segment("shareasale", ['share', 'a', 'sale', 'as']) # share sale print segment("asignas", ['as', 'sign', 'a']) # sign
Comments
Post a Comment