PySpark: Inconsistent count() result after join - apache-spark

I am completely baffled with the following problem:
When I join 2 data frames and return the row count, I get a slightly different count on each try. Here are the details:
I would like to join the data frames: 'df_user_ids' and 'df_conversions':
df_user_ids.show()
>>>
+--------------------+
| user_id|
+--------------------+
|AMsySZY-cqcufnXst...|
|AMsySZY1Oo75A6vKU...|
|AMsySZY4nbqZiuEMR...|
|AMsySZY5RSfgj6Xvi...|
|AMsySZY5geAmTx0er...|
|AMsySZY6Gskv_kEAv...|
|AMsySZY6MIOyPWM4U...|
|AMsySZYCEZYS00UB9...|
df_conversions.show()
>>>
+--------------------+----------------------+---------+
| user_id|time_activity_observed|converted|
+--------------------+----------------------+---------+
|CAESEAl1YPOZpaWVx...| 2018-03-23 12:15:37| 1|
|CAESEAuvSBzmfc_f3...| 2018-03-23 21:58:25| 1|
|CAESEBXWsSYm4ntvR...| 2018-03-30 12:16:53| 1|
|CAESEC-5uPwWGFdnv...| 2018-03-23 08:52:48| 1|
|CAESEDB3Z-NNvz7zL...| 2018-03-24 21:37:05| 1|
|CAESEDu7S7rGTVlj2...| 2018-04-01 17:00:12| 1|
|CAESEE4s6g1-JlUEt...| 2018-03-23 19:32:23| 1|
|CAESEELlJt0mE2xjn...| 2018-03-24 18:26:15| 1|
Both data frames have the key column named: "user_id",
and both are created using ".sampleBy()" with a fixed seed:
.sampleBy("converted", fractions={0: 0.035, 1: 1}, seed=0)
Before I join the data frames I persist them to disk:
df_user_ids.persist(StorageLevel.DISK_ONLY)
df_conversions.persist(StorageLevel.DISK_ONLY)
Then I verify that the row count of both data frames is consistent:
df_user_ids.count()
>>> 584309
df_user_ids.count()
>>> 584309
df_conversions.count()
>>> 5830
df_conversions.count()
>>> 5830
And check that the key column of both data frames does not contain duplicates:
df_user_ids.count()
>>> 584309
df_user_ids.select('user_id').distinct().count()
>>> 584309
df_conversions.count()
>>> 5830
df_conversions.select('user_id').distinct().count()
>>> 5830
Then I get the inconsistent row counts when I join them!
df_user_ids.join(df_conversions, ["user_id"], "left").count()
>>> 584314
df_user_ids.join(df_conversions, ["user_id"], "left").count()
>>> 584317
df_user_ids.join(df_conversions, ["user_id"], "left").count()
>>> 584304
How is this possible??
Sometimes this joined count is higher than "df_user_ids.count()" and sometimes it is lower. I am using a Zeppelin notebook in AWS EMR on an EMR cluster to run this code.
I already tried what is suggested in the link below:
".persist(StorageLevel.DISK_ONLY)" doesn't help.
I don't use monotonically_increasing_id.
spark inconsistency when running count command

By looking at the series of operations you are doing on DataFrames, i think the issue is due to Join. Join operation results shuffle, where every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining). When sharing data across executors, if executor doesnt have the dataframe persisted on Disk, it will re-compute the DAG and sampleBy is not guaranteed to return the same fraction of rows in dataframe.

Related

Why does spark data frame show different results?

This statement outputs partitionID and number of records in that partition:
data_frame.toDF().withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().orderBy(asc("count")).show()
+-----------+-----+
|partitionId|count|
+-----------+-----+
| 3| 22|
+-----------+-----+
This statement outputs number of partitions:
logger.warning('Num partitions: %s', data_frame.toDF().rdd.getNumPartitions())
WARNING:root:Num partitions 4
Shouldn't they both be same in num of partitions? First result shows only one partition and second result says there are 4 partitions?
Spark actually created 4 partitions but 3 are empty.
logger.warning("Partitions structure: {}".format(dynamic_frame.toDF().rdd.glom().collect()))
Partitions structure: [[Row(.....), Row(...)], [], [], []]

How to join the two dataframe by condition in PySpark?

I am having two dataframe like described below
Dataframe 1
P_ID P_Name P_Description P_Size
100 Moto Mobile 16
200 Apple Mobile 15
300 Oppo Mobile 18
Dataframe 2
P_ID List_Code P_Amount
100 ALPHA 20000
100 BETA 60000
300 GAMMA 15000
Requirement :
Need to join the two dataframe by P_ID.
Information about the dataframe :
In dataframe 1 P_ID is a primary key and dataframe 2 does't have any primary attribute.
How to join the dataframe
Need to create new columns in dataframe 1 from the value of dataframe 2 List_Code appends with "_price". If dataframe 2 List_Code contains 20 unique values we need to create 20 column in dataframe 1. Then, we have fill the value in newly created column in dataframe 1 from the dataframe 2 P_Amount column based on P_ID if present else fills with zero. After creation of dataframe we need to join the dataframe based on the P_ID. If we add the column with the expected value in dataframe 1 we can join the dataframe. My problem is creating new columns with the expected value.
The expected dataframe is shown below
Expected dataframe
P_ID P_Name P_Description P_Size ALPHA_price BETA_price GAMMA_price
100 Moto Mobile 16 20000 60000 0
200 Apple Mobile 15 0 0 0
300 Oppo Mobile 18 0 0 15000
Can you please help me to solve the problem, thanks in advance.
For you application, you need to pivot the second dataframe and then join the first dataframe on to the pivoted result on P_ID using left join.
See the code below.
df_1 = pd.DataFrame({'P_ID' : [100, 200, 300], 'P_Name': ['Moto', 'Apple', 'Oppo'], 'P_Size' : [16, 15, 18]})
sdf_1 = sc.createDataFrame(df_1)
df_2 = pd.DataFrame({'P_ID' : [100, 100, 300], 'List_Code': ['ALPHA', 'BETA', 'GAMMA'], 'P_Amount' : [20000, 60000, 10000]})
sdf_2 = sc.createDataFrame(df_2)
sdf_pivoted = sdf_2.groupby('P_ID').pivot('List_Code').agg(f.sum('P_Amount')).fillna(0)
sdf_joined = sdf_1.join(sdf_pivoted, on='P_ID', how='left').fillna(0)
sdf_joined.show()
+----+------+------+-----+-----+-----+
|P_ID|P_Name|P_Size|ALPHA| BETA|GAMMA|
+----+------+------+-----+-----+-----+
| 300| Oppo| 18| 0| 0|10000|
| 200| Apple| 15| 0| 0| 0|
| 100| Moto| 16|20000|60000| 0|
+----+------+------+-----+-----+-----+
You can change the column names or ordering of the dataframe as needed.

Python Spark: How to join 2 datasets containing >2 elements for each tuple

I'm trying to join data from these two datasets, based on the common "stock" key
stock, sector
GOOG Tech
stock, date, volume
GOOG 2015 5759725
The join method should join these together, however the resulting RDD I got is of the form:
GOOG, (Tech, 2015)
I'm trying to obtain:
(Tech, 2015) 5759726
Additionally, how do I go about reducing the results by the keys (e.g. (Tech, 2015)) in order to obtain a numerical summation for each sector and year?
from pyspark.sql.functions import struct, col, sum
#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
['AAPL', 'Tech'],
['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
['AAPL', '2015', '123'],
['XOM', '2015', '234'],
['XOM', '2016', '789']]).toDF(["stock","date","volume"])
#final output
df = df1.join(df2, ['stock'], 'inner').\
withColumn('sector_year', struct(col('sector'), col('date'))).\
drop('stock','sector','date')
df.show()
#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()
Output is:
+-------+-----------+
| volume|sector_year|
+-------+-----------+
| 123|[Tech,2015]|
| 234| [Oil,2015]|
| 789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+
+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]| 5759848.0|
| [Oil,2015]| 234.0|
| [Oil,2016]| 789.0|
+-----------+-----------+

Apache Spark - Finding Array/List/Set subsets

I have 2 dataframes each one having Array[String] as one of the columns. For each entry in one dataframe, I need to find out subsets, if any, in the other dataframe. An example is here:
DF1:
----------------------------------------------------
id : Long | labels : Array[String]
---------------------------------------------------
10 | [label1, label2, label3]
11 | [label4, label5]
12 | [label6, label7]
DF2:
----------------------------------------------------
item : String | labels : Array[String]
---------------------------------------------------
item1 | [label1, label2, label3, label4, label5]
item2 | [label4, label5]
item3 | [label4, label5, label6, label7]
After the subset operation I described, the expected o/p should be
DF3:
----------------------------------------------------
item : String | id : Long
---------------------------------------------------
item1 | [10, 11]
item2 | [11]
item3 | [11, 12]
It is guaranteed that the DF2, will always have corresponding subsets in DF1, so there won't be any left over elements.
Can someone please help with the right approach here ? It looks like for each element in DF2, I need to scan DF1 and do subset operation (or set subtraction) on the 2nd column until I find all the subsets and exhaust the labels in that row and while doing that accumulate the list of "id" fields. How do I do this in compact and efficient manner ? Any help is greatly appreciated. Realistically, I may have 100s of elements in DF1 and 1000s of elements in DF2.
I'm not aware of any way to perform this kind of operation in an efficient way. However, here is one possible solution using UDF as well as Cartesian join.
The UDF takes two sequences and checks if all strings in the first exists in the second:
val matchLabel = udf((array1: Seq[String], array2: Seq[String]) => {
array1.forall{x => array2.contains(x)}
})
To use Cartesian join, it needs to be enabled as it is computationally expensive.
val spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.crossJoin.enabled", true)
The two dataframes are joined together utilizing the UDF. Afterwards the resulting dataframe is grouped by the item column to collect a list of all ids. Using the same DF1 and DF2 as in the question:
val DF3 = DF2.join(DF1, matchLabel(DF1("labels"), DF2("labels")))
.groupBy("item")
.agg(collect_list("id").as("id"))
The result is as follows:
+-----+--------+
| item| id|
+-----+--------+
|item3|[11, 12]|
|item2| [11]|
|item1|[10, 11]|
+-----+--------+

Inserting and Deleting data in a Spark Dataframe

I have a PySpark Dataframe input_dataframe as shown below:
**cust_id** **source_id** **value**
10 11 test_value
10 12 test_value2
i have another dataframe delta_dataframe which have updated records from input_dataframe and some new records as shown below:
**cust_id** **source_id** **value**
10 11 update_value
10 15 new_value2
In Both dataframe, primary key is combination of cust_id and source_id.
I have to generate a new dataframe output_dataframe, which will have records from input_dataframe with updated records from delta_dataframe, so my final dataframe is as below:
**cust_id** **source_id** **value**
10 11 update_value
10 12 test_value2
10 15 new_value2
Can someone please suggest me, how can i achieve it in PySpark. Any help will be appreciated on this.
Subtract the two dataframes based on primary key. Make inner join of output with input_dataframe. Then take Uion of it with Delta_dataframe. You will get proper output.
You need to join input_dataframe and delta_dataframe using join on two columns
output_df = input_df.join(delta_df, input_df['cust_id'] = delta_df['cust_id'] & input_df['source_id'] = delta_df['source_id'], 'left_outer')
And then select only the required fields from output_df
We can use Outer join and select the required dataframe value,
>>> input_dataframe.join(delta_dataframe,['custid','sourceid'],'outer').select('custid','sourceid',F.coalesce(delta_dataframe['value'],input_dataframe['value']).alias('value')).show()
+------+--------+-------------+
|custid|sourceid| value|
+------+--------+-------------+
| 10| 15| new_value2|
| 10| 11|updated_value|
| 10| 12| test_value2|
+------+--------+-------------+

Resources