Characters get corrupt if spark.executor.memory is not set properly when importing CSV to DataFrame -


update: please hold on question. found might problem of spark 1.5 itself, not using official version of spark. i'll keep updating question. thank you!

i noticed strange bug when using spark-csv import csv dataframe in spark.

here sample code:

  object sparktry   {     def main(args: array[string])     {       autologger.setlevel("info")        val sc = singletonsparkcontext.getinstance()       val sql_context = singletonsqlcontext.getinstance(sc)        val options = new collection.mutable.hashmap[string, string]()       options += "header" -> "true"       options += "charset" -> "utf-8"        val customschema = structtype(array(         structfield("year", stringtype),         structfield("brand", stringtype),         structfield("category", stringtype),         structfield("model", stringtype),         structfield("sales", doubletype)))        val dataframe = sql_context.read.format("com.databricks.spark.csv")       .options(options)       .schema(customschema)       .load("hdfs://myhdfsserver:9000/bigdata/carsales.csv")        dataframe.head(10).foreach(x => autologger.info(x.tostring))     }   } 

carsales small csv. noticed when spark.master not local, setting spark.executor.memory above 16gb result in corruption of dataframe. output of program shown below: (i copied text log, , in case spark.executor.memory set 32gb)

16/03/07 12:39:50.190 info dagscheduler: job 1 finished: head @ sparktry.scala:35, took 8.009183 s 16/03/07 12:39:50.225 info autologger$: [       ,  ,      ,ries       ,142490.0] 16/03/07 12:39:50.225 info autologger$: [       ,  ,      ,ries       ,112464.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,      ,ries       ,90960.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,      ,ries       ,100910.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,      ,ries       ,94371.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,      ,ries       ,54142.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,       ,ries       ,14773.0] 16/03/07 12:39:50.226 info autologger$: [       ,  ,       ,ries       ,12276.0] 16/03/07 12:39:50.227 info autologger$: [       ,  ,       ,ries       ,9254.0] 16/03/07 12:39:50.227 info autologger$: [       ,  ,       ,ries       ,12253.0] 

while first 10 lines of file is:

1/1/2007,bmw,compact,bmw 3-series,142490.00 1/1/2008,bmw,compact,bmw 3-series,112464.00 1/1/2009,bmw,compact,bmw 3-series,90960.00 1/1/2010,bmw,compact,bmw 3-series,100910.00 1/1/2011,bmw,compact,bmw 3-series,94371.00 1/1/2007,bmw,compact,bmw 5-series,54142.00 1/1/2007,bmw,fullsize,bmw 7-series,14773.00 1/1/2008,bmw,fullsize,bmw 7-series,12276.00 1/1/2009,bmw,fullsize,bmw 7-series,9254.00 1/1/2010,bmw,fullsize,bmw 7-series,12253.00 

i noticed changing spark.executor.memory 16gb on machine, first 10 lines correct, setting on 16gb result in corruption.

what's more: on 1 of servers have 256gb's memory, setting 16gb produces bug. instead, setting 48gb make work fine. in addition, tried print dataframe.rdd, shows content of rdd correct, while dataframe not.

does have idea problem?

thank you!

it turns out bug in serializing kyro in spark 1.5.1 & 1.5.2.

https://github.com/databricks/spark-csv/issues/285#issuecomment-193633716

this fixed in 1.6.0. has nothing spark-csv.


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

reactjs - React router and this.props.children - how to pass state to this.props.children -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -