sed - Ruby file reading parallelisim -


i have file lot of lines (say 1 billion). script iterating through lines compare them against data set.

since running on 1 thread/1 core @ moment, i'm wondering if start multiple forks, each processing part of file simultaneously.

the solution came mind far sed unix command. sed it's possible read "slices" of file (line x line y). so, couple of forks process output of corresponding seds. problem ruby load whole sed output ram first.

are there better solutions sed, or there way "stream" sed output ruby?

what asking wont you.

firstly, jump line n of file, firstly have read previous part of file, count number of line breaks there are. example:

$ ruby -e '(1..10000000).each { |i| puts "this line number #{i}"}' > large_file.txt $ du -h large_file.txt  266m   large_file.txt $ purge # mac os x command - clears in memory disk caches in use $ time sed -n -e "5000000p; 5000000q" large_file.txt line number 5000000 sed -n -e "5000000p; 5000000q" large_file.txt  0.52s user 0.13s system 28% cpu 2.305 total $ time sed -n -e "5000000p; 5000000q" large_file.txt line number 5000000 sed -n -e "5000000p; 5000000q" large_file.txt  0.49s user 0.05s system 99% cpu 0.542 total 

note how sed command wasn't instant, had read through initial part of file figure out 5 millionth line was. why running second time faster me - computer cached file ram.

even if pull off (by splitting file manually), poor io performance if jumping between different parts of file or files reading next line.


what better process every nth line on separate thread (or process) instead. allow use of multiple cpu cores, yet still have io performance. can done parallel library.

example use (my computer has 4 cores):

$ ruby -e '(1..10000000).each { |i| puts "this line number #{i}"}' > large_file.txt # use smaller file speed tests $ time ruby -r parallel -e "parallel.each(file.open('large_file.txt').each_line, in_processes: 4) { |line| puts line if (line * 10000) =~ /9999/ }" line number 9999 line number 19999 line number 29999 line number 39999 line number 49999 line number 59999 line number 69999 line number 79999 line number 89999 line number 99990 line number 99991 line number 99992 line number 99993 line number 99994 line number 99995 line number 99996 line number 99997 line number 99999 line number 99998 ruby -r parallel -e   55.84s user 10.73s system 400% cpu 16.613 total  $ time ruby -r parallel -e "parallel.each(file.open('large_file.txt').each_line, in_processes: 1) { |line| puts line if (line * 10000) =~ /9999/ }" line number 9999 line number 19999 line number 29999 line number 39999 line number 49999 line number 59999 line number 69999 line number 79999 line number 89999 line number 99990 line number 99991 line number 99992 line number 99993 line number 99994 line number 99995 line number 99996 line number 99997 line number 99998 line number 99999 ruby -r parallel -e   47.04s user 7.46s system 97% cpu 55.738 total 

the second version (using 4 processes) completed 29.81% of time of original, 4 times faster.


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

python - pip wont install .WHL files -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -