python - Text segmentation: Algorithm to match input with the longest words from the dictionary -


i need split string words such each word comes dictionary. make sure longest possible word left chosen. hence

thisisinsane => insane (correct longest possible word left) thisisinsane => in sane(wrong)  assuming 'this', 'is', 'in', 'insane' words in dictionary. 

i managed solved problem traversing end of string beginning matching longest word possible. problem started cropping problems these..

shareasale => share ale(wrong 'ale' not in dictionary) shareasale => share sale(correct) assuming 'share', 'a', 'sale' words in dictionary unlike 'ale'.  

i tried solve problem removing valid segments found before encountering error i.e.

shareasale => 'share' , 'as' (error = 'ale') 

and removing them once dictionary , solve problem. so

shareasale => no valid segments when share removed shareasale => share sale (when 'as' removed dictionary. 

thus managed solve problem too. unable solve

asignas => 'as' ( error = 'ignas') 

my solution remove 'as' dictionary , try solve it

asignas => 'a' 'sign' (error = 'as') 

because in new recursive call 'as' has been removed dictionary. function wrote in link. hope can go through , me find better algorithm solve else suggest modification existing algorithm.

essentially problem tree problem, @ every level words form prefix of tree form branches. branch leaves no part of string correct solution.

                      thisisinsane                           |                           |                      (this)isinsane                      /            \                     /              \           (this,i)sinsane         (this,is)insane               /                     /           \              /                     /             \   (this,i,sin)ane          (this,is,in)sane    (this,is,insane)                                 /                                /                        (this,is,in,sane) 

so in example there 2 solutions, want select solution using longest words, want explore tree right using depth-first-search strategy.

so our algorithm should:

  1. sort dictionary descending length.
  2. find prefixes of current string. if there none, return false.
  3. set prefix longest unexplored prefix.
  4. remove string. if string empty, found solution, return list of prefixes.
  5. recurse 2.
  6. this branch has failed, return false.

a sample implementation of solution:

def segment(string,wset):     """segments string words prefering longer words givens     dictionary wset."""     # sort wset in decreasing string order     wset.sort(key=len, reverse=true)     result = tokenize(string, wset, "")     if result:         result.pop() # remove empty string token         result.reverse() # put list correct order         return result     else:         raise exception("no possible segmentation!")  def tokenize(string, wset, token):     """returns either false if string can't segmented      current wset or list of words segment string     in reverse order."""     # done yet?     if string == "":         return [token]     # find possible prefixes     pref in wset:         if string.startswith(pref):             res = tokenize(string.replace(pref, '', 1), wset, pref)             if res:                 res.append(token)                 return res     # not possible     return false  print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # insane print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share sale print segment("asignas", ['as', 'sign', 'a'])                 # sign 

Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

reactjs - React router and this.props.children - how to pass state to this.props.children -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -