Apache Spark Streaming Left Outer Join Output Records - apache-spark

We are working on Spark Streaming application where it sources data from kafka.We have producer which publishes data to kafka from a file which has 1.5 million records. Out of 1.5 million records, there are 1 million records of A type and 0.5 million records of B type. Here in this file we have 0.5 million records have same column value which will be used to join in Spark streaming.
So in spark streaming, we are doing left outer join of these two streams(A left outer join B). So output expected should have 1 million records( 0.5 million records which are inner join and 0.5 million records for null join from left stream).
However we have received different output count for different watermark:
1) If we provide watermark as 15 minutes, we do receive 0.5 million records(inner join) and 0.5 million records(unmatched records with null).
2) If we provide watermark as 2 minutes, we receive more than 1 million records (around 1.1 million records) where we found that output file has duplicate records (records which are part of inner join also appear as part of null join)
Please help me to understand why left outer join emitting more records when provided less time in watermark.
When i say more number records, i meant it emits records which are:
-> matched(0.5 million) + records which are unmatched( 0.5 million with null values from left stream) + 0.1 million records with null join from left stream.
The 0.1 million records are the extra records which are suprisingly already part of inner join (0.5 million) and still present with null join from left stream.

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Spark sql query causing huge data shuffle read / write

I am using spark sql for processing the data. Here is the query
select
/*+ BROADCAST (C) */ A.party_id,
IF(B.master_id is NOT NULL, B.master_id, 'MISSING_LINK') as master_id,
B.is_matched,
D.partner_name,
A.partner_id,
A.event_time_utc,
A.funnel_stage_type,
A.product_id_set,
A.ip_address,
A.session_id,
A.tdm_retailer_id,
C.product_name ,
CASE WHEN C.product_category_lvl_01 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_01 END as product_category_lvl_01,
CASE WHEN C.product_category_lvl_02 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_02 END as product_category_lvl_02,
CASE WHEN C.product_category_lvl_03 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_03 END as product_category_lvl_03,
CASE WHEN C.product_category_lvl_04 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_04 END as product_category_lvl_04,
C.brand_name
from
browser_data A
INNER JOIN (select partner_name, partner_alias_tdm_id as npa_retailer_id from npa_retailer) D
ON (A.tdm_retailer_id = D.npa_retailer_id)
LEFT JOIN
(identity as B1 INNER JOIN (select random_val from random_distribution) B2) as B
ON (A.party_id = B.party_id and A.random_val = B.random_val)
LEFT JOIN product_taxonomy as C
ON (A.product_id = C.product_id and D.npa_retailer_id = C.retailer_id)
Where,
browser_data A - Its around 110 GB data with 519 million records,
D - Small dataset which maps to only one value in A. As this is small spark sql automatically broadcast it (confirmed in the execution plan in explain)
B - 5 GB with 45 million records contains only 3 columns. This dataset is replicated 30 times (done with cartesian product with dataset which contains 0 to 29 values) so that skewed key (lot of data against one in dataset A) issue is solved.This will result in 150 GB of data.
C - 900 MB with 9 million records. This is joined with A with broadcast join (so no shuffle)
Above query works well. But when I see spark UI I can observe above query triggers shuffle read of 6.8 TB. As dataset D and C are joined as broadcast it wont cause any shuffle. So only join of A and B should cause the shuffle. Even if we consider all data shuffled read then it should be limited to 110 GB (A) + 150 GB (B) = 260 GB. Why it is triggering 6.8 TB of shuffle read and 40 GB of shuffle write.
Any help appreciated. Thank you in advance
Thank you
Manish
The first thing I would do is use DataFrame.explain on it. That will show you the execution plan so you can see exactly what is actually happen. I would check the output to confirm that the Broadcast Join is really happening. Spark has a setting to control how big your data can be before it gives up on doing a broadcast join.
I would also note that your INNER JOIN against the random_distribution looks suspect. I may have recreated your schema wrong, but when I did explain I got this:
scala> spark.sql(sql).explain
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
LocalRelation [party_id#99]
and
LocalRelation [random_val#117]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Finally, is your input data compressed? You may be seeing the size differences because of a combination of no your data no longer being compressed, and because of the way it is being serialized.

How to optimize a join?

I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?
Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.
If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/

Can I use loops in Spark Data frame

My data is as shown below
Store ID Amount,...
1 1 10
1 2 20
2 1 10
3 4 50
I have to create separate directory for each store
Store 1/accounts
ID Amount
1 10
2 20
store 2/accounts directory:
ID Amount
1 10
For this purpose Can I use loops in Spark dataframe. It is working in local machine. Will it be a problem in cluster
while storecount<=50:
query ="SELECT * FROM Sales where Store={}".format(storecount)
DF =spark.sql(query)
DF.write.format("csv").save(path)
count = count +1
If I correctly understood the problem , you really want to do is partitioning in the dataframe.
I would suggest to do this
df.write.partitionBy("Store").mode(SaveMode.Append).csv("..")
This will write the dataframe into several partitions like
store=2/
store=1/
....
Yes you can run a loop here as it is not an Nested operation on data frame.
Nested operation on RDD or Data frame is not allowed as Spark Context is not Serializable.

Resources