python - pyspark how to plus between two RDDs with same key match -
suppose have 2 rdds
where rdd1 has (key1,key2,value)
and rdd2 has (key1, value)
now want combine operation ( + or minus ) rdd2 rdd1 key1 has match here example
rdd1 has [1,1,3],[1,2,2],[2,2,5] rdd2 = sc.parallelize([1,1]) i want result
rdd3 [1,1,4],[1,2,3],[2,2,5] first , second data added while third 1 wasn't i try use left outer join find match on key1 , operation lost data don't need operation there way operation in partial data?
assuming want pairwise operations or data contains 1 0..1 relationships simplest thing can convert both rdds dataframes:
from pyspark.sql.functions import coalesce, lit df1 = sc.parallelize([ (1, 1, 3), (1, 2, 2), (2, 2, 5) ]).todf(("key1", "key2", "value")) df2 = sc.parallelize([(1, 1)]).todf(("key1", "value")) new_value = ( df1["value"] + # old value coalesce(df2["value"], lit(0)) # if no match (null) take 0 ).alias("value") # set alias df1.join(df2, ["key1"], "leftouter").select("key1", "key2", new_value) you can adjust handle other scenarios applying aggregation on df2 before joining dataframes.
Comments
Post a Comment