remove duplicate lines in each paragraph of a file using sed or awk -

- September 15, 2010

i want remove duplicate lines pargraphs begin "set current" in file, share same first line , have same sentences , don't remove duplicate lines belong different paragraphs example:

if have following file:

set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file3 1001.file3 ; create syn file3 1001.file3 ;  set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file7 1000.file7 ;  set current = 'bbb' ; create syn file5 1002.file5 ; create syn file6 1003.file6 ;  set current = 'bbb' ;   create syn file1 1000.file1 ; create syn file8 1002.file8 ; create syn file6 1003.file6 ;

the result

set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file3 1001.file3 ;  set current = 'aaa' ; create syn file7 1000.file7 ;  set current = 'bbb' ; create syn file5 1002.file5 ; create syn file6 1003.file6 ;  set current = 'bbb' ; create syn file1 1000.file1 ; create syn file8 1002.file8 ;

with awk this:

awk 'nf==0{print;next};/^set current/{c=$4;print;next}!seen[c,$0]++' file

with comments make more readable:

awk ' nf == 0 {       # if find empty line           print       # print line           next        # , skip next record       }       /^set current/{ # if find line beginning wiith "set current"           c = $4      # store value in 4th field           print       # print current line           next        # , skip next record         }       !seen[c,$0]++  # print if combination of "c" value                       # , current line has not been stored                        # in array "seen", , store                       # combination in array                       # (in order prevent other lines printed)       ' file

the !seen[c,$0]++ works this: when use comma in array index, 2 tokens combined single string joined subsep character. in case use index combination of c character , current line ($0), since needs unique after filtering. !seen[c,$0] check see if combination exists index array. if index not present, expression evaluates true, results line being printed. if index present, expression evaluates false, , line not printed. post-fix increment operator count occurrences of index, line printed @ first occurrence, not subsequent matches.

Search This Blog

Earony

remove duplicate lines in each paragraph of a file using sed or awk -

Comments

Post a Comment

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

python - pip wont install .WHL files -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -