Pivoting streaming dataframes without aggregation in pyspark - apache-spark

ID
type
value
A
car
camry
A
price
20000
B
car
tesla
B
price
40000
Example dataframe that is being streamed.
I need output to look like this. Anyone have suggestions?
ID
car
price
A
camry
20000
B
tesla
40000
Whats a good way to transform this? I have been researching pivoting but it requires an aggregation which is not something I need.

You could filter the frame (df) twice, and join
(
df.filter(df.type=="car").withColumnRenamed("value","car")
.join(
df.filter(df.type=="price").withColumnRenamed("value","price")
, on="ID"
)
.select("ID", "car", "price")
)

Related

spark join two dataframe without common column

Need to join two dataframes in pyspark.
One dataframe df1 is like:
city user_count_city meeting_session
NYC 100 5
LA 200 10
....
Another dataframe df2 is like:
total_user_count total_meeting_sessions
1000 100
Need to calculate user_percentage and meeting_session_percentage so I need a left join, something like
df1 left join df2
How could I join the two dataframes since they do not have common key?
Take a look of solution from this post Joining two dataframes without a common column
But this is not same as my case.
Expected results
city user_count_city meeting_session total_user_count total_meeting_sessions
NYC 100 5 1000 100
LA 200 10 1000 100
....
You are looking for a cross join:
result = df1.crossJoin(df2)

Optimize Join of two large pyspark dataframes

I have two large pyspark dataframes df1 and df2 containing GBs of data.
The columns in first dataframe are id1, col1.
The columns in second dataframe are id2, col2.
The dataframes have equal number of rows.
Also all values of id1 and id2 are unique.
Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2.
df = df1.join(df2, df1.id1 == df2.id2)
I am afraid this may suffer from shuffling.
How can I optimize the join operation for this special case?
To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.
The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.
df1 = df1.repartition('id1')
df2 = df2.repartition('id2')
Another way to avoid shuffles at join is to leverage bucketing.
Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.
But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.
Also this will include additional steps of creating the bucket then reading.

How to read particular row from groupby dataframe?

I am preprocessing my data about car sales where lots of used cars price is 0. I want to replace 0 value with the mean price value of similar kind of cars.
Here, I have found mean values for each car with groupby function:
df2= df1.groupby(['car','body','year','engV'])['price'].mean()
This is my dataframe extracted from actual data with price is zero
rep_price=df[df['price']==0]
I want to assign mean price value from df2['Land Rover'] to rep_price['Land Rover'] which is 0
Since I'm not sure about the columns you have in those dataframes I'm gonna take a wild guess to try to give you a head start, let's say 'Land Rover' is a value in a column called 'Car_Type' in df1, and then you grouped your data like this:
df2= df1.groupby(['Car_Type'])['price'].mean()
In that case something like this should cover your need:
df1.loc[df1['Car_Type']=='Land Rover','price'] = df2['Land Rover']

SparkSQL joining parent/child datasets

I am using SparkSQL 2.2.0 to load data from Cassandra and index it Elasticsearch. The data I have consists of customers (first table people) and orders (second table orders).
Table orders has a column person_id that points to the the corresponding customer.
My need is to query (and index later in Elasticsearch) the people table and the orders so I can have for each customer the number of orders she purchased.
The easiest approach I figured out is to read the two tables into org.apache.spark.sql.Dataset<Row>s and make a join on the person_id column. Then I groupBy(person_id).
That gives me a Dataset with two columns: person_id and count which I am obliged to join back with the people table so I can have the count with the other person data.
Dataset<Row> peopleWithOrders = people.join(orders, people.col("id").equalTo(orders.col("person_id")), "left_outer");
Dataset<Row> peopleOrdersCounts = peopleWithOrders.groupBy("id").count().withColumnRenamed("id", "personId");
Dataset<Row> peopleWithOrderCounts = people.join(personsOrdersCounts, people.col("id").equalTo(peopleOrdersCounts.col("personId")), "left_outer")
.withColumnRenamed("count", "nbrOfOrders")
.select("id", "name", "birthDate", "nbrOfOrders");
The people table has 1_000_000 rows and the orders one 2_500_000. Each customer has 2 or 3 orders.
I am using a MAC Book pro, with a 2,2 GHz Intel Core i7 Processor and 16 GB 1600 MHz DDR3 Memory. All of Cassandra, Spark 2.2 master and (single) worker are on the same machine.
These 3 joins take 15 to 20 seconds.
My question is: is there some room for performance gains. Do Window Aggregate Functions have benefits since I see ShuffleMapTask in the logs.
Thanks in advance
I think the first step is unnecessary. You could just do this:
Dataset<Row> peopleOrdersCounts = orders.groupBy("person_id").count();
Dataset<Row> peopleWithOrderCounts = people.join(personsOrdersCounts, people.col("id").equalTo(peopleOrdersCounts.col("personId")), "left_outer")
.withColumnRenamed("count", "nbrOfOrders")
.select("id", "name", "birthDate", "nbrOfOrders");
I hope this helps.

Efficient way to do a join between dataframe when joining field is unique

I have 2 dataframes in Spark. Both of them have an id which is unique.
The structure is the following
df1:
id_df1 values
abc abc_map_value
cde cde_map_value
fgh fgh_map_value
df2:
id_df2 array_id_df1
123 [abc, fgh]
456 [cde]
I want to get the following dataframe result:
result_df:
id_df2 array_values
123 [map(abc,abc_map_value), map(fgh,fgh_map_value)]
456 [map(cde,cde_map_value)]
I can use spark sql to do so but i don't think that it's the most efficient way as ids are unique.
Is there a way to store a key/values dictionary in memory to lookup for the value based on the key rather than to do a join ? Would it be more efficient than to do a join ?
If you explode the df2 into key,value pairs, the join becomes easy and just a groupBy is needed.
You could experiment other aggregations & reductions for more efficiency / parallelisation
df2
.select('id_df2, explode('array_id_df1).alias("id_df1"))
.join(df1, usingColumn="id_df1")
.groupBy('id_df2)
.agg(collect_list(struct('id_df1, 'values)).alias("array_values"))

Resources