python - pyspark how to plus between two RDDs with same key match -


suppose have 2 rdds

where rdd1 has (key1,key2,value)

and rdd2 has (key1, value)

now want combine operation ( + or minus ) rdd2 rdd1 key1 has match here example

rdd1 has [1,1,3],[1,2,2],[2,2,5]  rdd2 = sc.parallelize([1,1]) 

i want result

rdd3 [1,1,4],[1,2,3],[2,2,5]  first , second data added while third 1 wasn't 

i try use left outer join find match on key1 , operation lost data don't need operation there way operation in partial data?

assuming want pairwise operations or data contains 1 0..1 relationships simplest thing can convert both rdds dataframes:

from pyspark.sql.functions import coalesce, lit  df1 = sc.parallelize([     (1, 1, 3), (1, 2, 2), (2, 2, 5) ]).todf(("key1", "key2", "value"))  df2 = sc.parallelize([(1, 1)]).todf(("key1", "value"))  new_value = (     df1["value"] +  # old value     coalesce(df2["value"], lit(0))  # if no match (null) take 0 ).alias("value")  # set alias  df1.join(df2, ["key1"], "leftouter").select("key1", "key2", new_value) 

you can adjust handle other scenarios applying aggregation on df2 before joining dataframes.


Comments

Popular posts from this blog

java - Run spring boot application error: Cannot instantiate interface org.springframework.context.ApplicationListener -

python - pip wont install .WHL files -

Excel VBA "Microsoft Windows Common Controls 6.0 (SP6)" Location Changes -