python - Likely memory leak in generator loop with islice -


i working large files holding several million records each (approx 2gb unpacked, several hundred mbs gzip).

i iterate on records islice, allows me either small portion (for debug , development) or whole thing when want test code. have noticed absurdly large memory usage code , trying find memory leak in code.

below output memory_profiler on paired read (where open 2 files , zip records), 10**5 values (the default value overwritten).

line #    mem usage    increment   line contents ================================================    137   27.488 mib    0.000 mib   @profile    138                             def paired_read(read1, read2, nbrofitems = 10**8):    139                              """ procedure reading both sequences , stitching them """    140   27.488 mib    0.000 mib    seqfreqs = counter()    141   27.488 mib    0.000 mib    linker_str = "~"    142                              #for rec1, rec2 in izip(read1, read2):    143 3013.402 mib 2985.914 mib    rec1, rec2 in islice(izip(read1, read2), nbrofitems):    144 3013.398 mib   -0.004 mib        rec1 = rec1[9:]                         # trim primer variable sequence    145 3013.398 mib    0.000 mib        rec2 = rec2[:150].reverse_complement()  # trim low quality half of 3' read , take rev complement    146                                  #aaseq = seq.translate(rec1 + rec2)    147                                 148                                  global nseqs     149 3013.398 mib    0.000 mib        nseqs += 1    150                                 151 3013.402 mib    0.004 mib        if filter_seq(rec1, direction=5) , filter_seq(rec2, direction=3):    152 3013.395 mib   -0.008 mib            aakey = str(seq.translate(rec1)) + linker_str + str(seq.translate(rec2))    153 3013.395 mib    0.000 mib            seqfreqs.update({ aakey : 1 })      154                                      155 3013.402 mib    0.008 mib    print "========================================"    156 3013.402 mib    0.000 mib    print "# of total sequences: %d" % nseqs    157 3013.402 mib    0.000 mib    print "# of filtered sequences: %d" % sum(seqfreqs.values())    158 3013.461 mib    0.059 mib    print "# of repeated occurances: %d" % (sum(seqfreqs.values()) - len(list(seqfreqs)))    159 3013.461 mib    0.000 mib    print "# of low-score sequences (<20): %d" % lowqseq    160 3013.461 mib    0.000 mib    print "# of sequences stop codon: %d" % starseqs    161 3013.461 mib    0.000 mib    print "========================================"    162 3013.504 mib    0.043 mib    pprint(seqfreqs.most_common(100), width = 240) 

the code, in short, filtering on records , keeps track of how many times strings occur in file (zipped pair of strings in particular case).

100 000 strings of 150 chars integer values in counter should land around 100 mbs tops, checked using following function @aaronhall.

given memory_profiler output suspect islice doesn't let go of previous entities on course of iteration. google search landed me @ this bug report it's marked solved python 2.7 running @ moment.

any opinions?

edit: have tried skip islice per comment below , use loop

for rec in list(next(read1) _ in xrange(10**5)): 

which makes no significant difference. in case of single file, in order avoid izip comes itertools.

a secondary troubleshooting idea had check if gzip.open() reads , expands file memory, , cause issue here. running script on decompressed files doesn't make difference.

note memory_profiler reports maximum memory consumption each line. long loops can misleading first line of loop seem report disproportionate amount of memory.

that because compares first line of loop respect memory consumption of line before, out of loop. doesn't mean first line of loop consumes 2985mb rather difference between peak in memory within loop 2985mb higher out of loop.


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

reactjs - React router and this.props.children - how to pass state to this.props.children -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -