Need to join two dataframes in pyspark.
One dataframe df1 is like:
city user_count_city meeting_session
NYC 100 5
LA 200 10
....
Another dataframe df2 is like:
total_user_count total_meeting_sessions
1000 100
Need to calculate user_percentage and meeting_session_percentage so I need a left join, something like
df1 left join df2
How could I join the two dataframes since they do not have common key?
Take a look of solution from this post Joining two dataframes without a common column
But this is not same as my case.
Expected results
city user_count_city meeting_session total_user_count total_meeting_sessions
NYC 100 5 1000 100
LA 200 10 1000 100
....
You are looking for a cross join:
result = df1.crossJoin(df2)
Related
I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union
best way to merge two pandas dataframe having 150 k records in each..Merging should happen on one common column like ID
df1(1st Datafrma150k records)
df2(2nd dataframe 150k records)
df3 = pd.merge(df1,df2,on='ID',how='left')
is this correct way to merge ?
I Have 2 large data frames. Each row has lat/lon data. My goal is to do a join between 2 dataframes and find all the points which are within a distance, e.g. 100m.
df1: (id, lat, lon, geohash7)
df2: (id, lat, lon, geohash7)
I want to partition df1 and df2 on geohash7, and then only join within the partitions. I want to avoid joining between partitions to reduce computation.
df1 = df1.repartition(200, "geohash7")
df2 = df2.repartition(200, "geohash7")
df_merged = df1.join(df2, (df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )
So basically join on geohash7 and then make sure distance between points is less than 100.
The problem is that, Spark actually will cross join all the data. How can I make it only do inter-partition join not intra-partition join?
After much playing with data, it seems that spark is smart enough to first make sure a join happens on the equality condition ("geohash7"). So if there's no match there, it won't calculate the "dist" function.
It also appears that with equality condition, it doesn't do cross-join anymore. So I didn't have to do anything else. The join above works fine.
I have two large pyspark dataframes df1 and df2 containing GBs of data.
The columns in first dataframe are id1, col1.
The columns in second dataframe are id2, col2.
The dataframes have equal number of rows.
Also all values of id1 and id2 are unique.
Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2.
df = df1.join(df2, df1.id1 == df2.id2)
I am afraid this may suffer from shuffling.
How can I optimize the join operation for this special case?
To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.
The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.
df1 = df1.repartition('id1')
df2 = df2.repartition('id2')
Another way to avoid shuffles at join is to leverage bucketing.
Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.
But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.
Also this will include additional steps of creating the bucket then reading.
I have 2 dataframes in Spark. Both of them have an id which is unique.
The structure is the following
df1:
id_df1 values
abc abc_map_value
cde cde_map_value
fgh fgh_map_value
df2:
id_df2 array_id_df1
123 [abc, fgh]
456 [cde]
I want to get the following dataframe result:
result_df:
id_df2 array_values
123 [map(abc,abc_map_value), map(fgh,fgh_map_value)]
456 [map(cde,cde_map_value)]
I can use spark sql to do so but i don't think that it's the most efficient way as ids are unique.
Is there a way to store a key/values dictionary in memory to lookup for the value based on the key rather than to do a join ? Would it be more efficient than to do a join ?
If you explode the df2 into key,value pairs, the join becomes easy and just a groupBy is needed.
You could experiment other aggregations & reductions for more efficiency / parallelisation
df2
.select('id_df2, explode('array_id_df1).alias("id_df1"))
.join(df1, usingColumn="id_df1")
.groupBy('id_df2)
.agg(collect_list(struct('id_df1, 'values)).alias("array_values"))