Characters get corrupt if spark.executor.memory is not set properly when importing CSV to DataFrame -
update: please hold on question. found might problem of spark 1.5 itself, not using official version of spark. i'll keep updating question. thank you!
i noticed strange bug when using spark-csv import csv dataframe in spark.
here sample code:
object sparktry { def main(args: array[string]) { autologger.setlevel("info") val sc = singletonsparkcontext.getinstance() val sql_context = singletonsqlcontext.getinstance(sc) val options = new collection.mutable.hashmap[string, string]() options += "header" -> "true" options += "charset" -> "utf-8" val customschema = structtype(array( structfield("year", stringtype), structfield("brand", stringtype), structfield("category", stringtype), structfield("model", stringtype), structfield("sales", doubletype))) val dataframe = sql_context.read.format("com.databricks.spark.csv") .options(options) .schema(customschema) .load("hdfs://myhdfsserver:9000/bigdata/carsales.csv") dataframe.head(10).foreach(x => autologger.info(x.tostring)) } }
carsales small csv. noticed when spark.master
not local
, setting spark.executor.memory
above 16gb result in corruption of dataframe. output of program shown below: (i copied text log, , in case spark.executor.memory
set 32gb)
16/03/07 12:39:50.190 info dagscheduler: job 1 finished: head @ sparktry.scala:35, took 8.009183 s 16/03/07 12:39:50.225 info autologger$: [ , , ,ries ,142490.0] 16/03/07 12:39:50.225 info autologger$: [ , , ,ries ,112464.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,90960.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,100910.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,94371.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,54142.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,14773.0] 16/03/07 12:39:50.226 info autologger$: [ , , ,ries ,12276.0] 16/03/07 12:39:50.227 info autologger$: [ , , ,ries ,9254.0] 16/03/07 12:39:50.227 info autologger$: [ , , ,ries ,12253.0]
while first 10 lines of file is:
1/1/2007,bmw,compact,bmw 3-series,142490.00 1/1/2008,bmw,compact,bmw 3-series,112464.00 1/1/2009,bmw,compact,bmw 3-series,90960.00 1/1/2010,bmw,compact,bmw 3-series,100910.00 1/1/2011,bmw,compact,bmw 3-series,94371.00 1/1/2007,bmw,compact,bmw 5-series,54142.00 1/1/2007,bmw,fullsize,bmw 7-series,14773.00 1/1/2008,bmw,fullsize,bmw 7-series,12276.00 1/1/2009,bmw,fullsize,bmw 7-series,9254.00 1/1/2010,bmw,fullsize,bmw 7-series,12253.00
i noticed changing spark.executor.memory
16gb on machine, first 10 lines correct, setting on 16gb result in corruption.
what's more: on 1 of servers have 256gb's memory, setting 16gb produces bug. instead, setting 48gb make work fine. in addition, tried print dataframe.rdd
, shows content of rdd correct, while dataframe not.
does have idea problem?
thank you!
it turns out bug in serializing kyro in spark 1.5.1 & 1.5.2.
https://github.com/databricks/spark-csv/issues/285#issuecomment-193633716
this fixed in 1.6.0. has nothing spark-csv.
Comments
Post a Comment