remove duplicate lines in each paragraph of a file using sed or awk -
i want remove duplicate lines pargraphs begin "set current" in file, share same first line , have same sentences , don't remove duplicate lines belong different paragraphs example:
if have following file:
set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file3 1001.file3 ; create syn file3 1001.file3 ; set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file7 1000.file7 ; set current = 'bbb' ; create syn file5 1002.file5 ; create syn file6 1003.file6 ; set current = 'bbb' ; create syn file1 1000.file1 ; create syn file8 1002.file8 ; create syn file6 1003.file6 ; the result
set current = 'aaa' ; create syn file1 1000.file1 ; create syn file2 1000.file2 ; create syn file3 1001.file3 ; set current = 'aaa' ; create syn file7 1000.file7 ; set current = 'bbb' ; create syn file5 1002.file5 ; create syn file6 1003.file6 ; set current = 'bbb' ; create syn file1 1000.file1 ; create syn file8 1002.file8 ;
with awk this:
awk 'nf==0{print;next};/^set current/{c=$4;print;next}!seen[c,$0]++' file with comments make more readable:
awk ' nf == 0 { # if find empty line print # print line next # , skip next record } /^set current/{ # if find line beginning wiith "set current" c = $4 # store value in 4th field print # print current line next # , skip next record } !seen[c,$0]++ # print if combination of "c" value # , current line has not been stored # in array "seen", , store # combination in array # (in order prevent other lines printed) ' file the !seen[c,$0]++ works this: when use comma in array index, 2 tokens combined single string joined subsep character. in case use index combination of c character , current line ($0), since needs unique after filtering. !seen[c,$0] check see if combination exists index array. if index not present, expression evaluates true, results line being printed. if index present, expression evaluates false, , line not printed. post-fix increment operator count occurrences of index, line printed @ first occurrence, not subsequent matches.
Comments
Post a Comment