python - Likely memory leak in generator loop with islice -
i working large files holding several million records each (approx 2gb unpacked, several hundred mbs gzip).
i iterate on records islice
, allows me either small portion (for debug , development) or whole thing when want test code. have noticed absurdly large memory usage code , trying find memory leak in code.
below output memory_profiler on paired read (where open 2 files , zip records), 10**5 values (the default value overwritten).
line # mem usage increment line contents ================================================ 137 27.488 mib 0.000 mib @profile 138 def paired_read(read1, read2, nbrofitems = 10**8): 139 """ procedure reading both sequences , stitching them """ 140 27.488 mib 0.000 mib seqfreqs = counter() 141 27.488 mib 0.000 mib linker_str = "~" 142 #for rec1, rec2 in izip(read1, read2): 143 3013.402 mib 2985.914 mib rec1, rec2 in islice(izip(read1, read2), nbrofitems): 144 3013.398 mib -0.004 mib rec1 = rec1[9:] # trim primer variable sequence 145 3013.398 mib 0.000 mib rec2 = rec2[:150].reverse_complement() # trim low quality half of 3' read , take rev complement 146 #aaseq = seq.translate(rec1 + rec2) 147 148 global nseqs 149 3013.398 mib 0.000 mib nseqs += 1 150 151 3013.402 mib 0.004 mib if filter_seq(rec1, direction=5) , filter_seq(rec2, direction=3): 152 3013.395 mib -0.008 mib aakey = str(seq.translate(rec1)) + linker_str + str(seq.translate(rec2)) 153 3013.395 mib 0.000 mib seqfreqs.update({ aakey : 1 }) 154 155 3013.402 mib 0.008 mib print "========================================" 156 3013.402 mib 0.000 mib print "# of total sequences: %d" % nseqs 157 3013.402 mib 0.000 mib print "# of filtered sequences: %d" % sum(seqfreqs.values()) 158 3013.461 mib 0.059 mib print "# of repeated occurances: %d" % (sum(seqfreqs.values()) - len(list(seqfreqs))) 159 3013.461 mib 0.000 mib print "# of low-score sequences (<20): %d" % lowqseq 160 3013.461 mib 0.000 mib print "# of sequences stop codon: %d" % starseqs 161 3013.461 mib 0.000 mib print "========================================" 162 3013.504 mib 0.043 mib pprint(seqfreqs.most_common(100), width = 240)
the code, in short, filtering on records , keeps track of how many times strings occur in file (zipped pair of strings in particular case).
100 000 strings of 150 chars integer values in counter should land around 100 mbs tops, checked using following function @aaronhall.
given memory_profiler output suspect islice doesn't let go of previous entities on course of iteration. google search landed me @ this bug report it's marked solved python 2.7 running @ moment.
any opinions?
edit: have tried skip islice
per comment below , use loop
for rec in list(next(read1) _ in xrange(10**5)):
which makes no significant difference. in case of single file, in order avoid izip
comes itertools
.
a secondary troubleshooting idea had check if gzip.open()
reads , expands file memory, , cause issue here. running script on decompressed files doesn't make difference.
note memory_profiler reports maximum memory consumption each line. long loops can misleading first line of loop seem report disproportionate amount of memory.
that because compares first line of loop respect memory consumption of line before, out of loop. doesn't mean first line of loop consumes 2985mb rather difference between peak in memory within loop 2985mb higher out of loop.
Comments
Post a Comment