Databricks - Spark is not counting table number of rows after union - apache-spark

I need to open a lot of parquet files and get one column for those parquet files.
I did a loop to get this field, and I apply the following function:
if 'attribute_name' in spark.read.parquet(p[0]).columns:
i=i+1
table = spark.read.parquet(p[0]).\
select('attribute_name').distinct().withColumn('Path',F.lit(element[0]))
k = k+table.count()
if i == 1:
table1 = table
else:
table1 = table1.union(table)
I created i and k only to counting purposes, i is the number of files and k is the total of number of rows. element is an element of a list of paths (the list of all the files).
According my countings I read 457 parquet files, and I got in total 21,173 different rows (for all the parquet files).
When I try to count directly in the table1 (with table1.count()) the number of rows I got the following ERROR:
Job aborted due to stage failure: Total size of serialized results of 4932 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
To be honest I don't understand what is happening and I don't have idea how to solve it, I tried repartition over table1 after the union and it's not working. Do you have some ideas to how to solve it?

Related

Try to avoid shuffle by manual control of table read per executor

I have:
really huge (let's say 100s if Tb) Iceberg table B which is partitioned by main_col, truncate[N, stamp]
small table S with columns main_col, stamp_as_key
I want to get a dataframe (actually table) with logic:
b = spark.read.table(B)
s = spark.read.table(S)
df = b.join(F.broadcast(s), (b.main_col == s.main_col) & (s.stamp_as_key - W0 <= b.stamp <= s.stamp_as_key + W0))
df = df.groupby('main_col', 'stamp_as_key').agg(make_some_transformations)
I want to avoid shuffle when reading B table. Iceberg has some meta tables about all parquet files in table and its content. What is possible to do:
read only metainfo table of B table
join it with S table
repartition by expected columns
collect s3 paths of real B data
read these files from executors independently.
Is there a better way to make this work? Also I can change the schema of B table if needed. But main_col should stay as a first paritioner.
One more question: suppose I have such dataframe and I saved it as a table. I need effectively join such tables. Am I correct that it is also impossible to do without shuffle with classic spark code?

Inconsistent duplicated row on Spark

I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with some data, with 2 columns of interest: name and ftime
I found that I sometime have duplicated values, sometime not, depending on how I fetch the data:
df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').count()
# Result is 1
But when I run
len(df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').collect())
# Result is 2
, I now have a result of 2 rows, which are exactly the same. Those two cells are ran one after the other, the order doesn't change anything.
I tried creating a temp view in spark with
df.createOrReplaceTempView('df_referential')
but I run in the same problem:
%sql
SELECT name, ftime, COUNT(*)
FROM df_referential
GROUP BY name, ftime
HAVING COUNT(*) > 1
returns no result, while
%sql
SELECT *
FROM df_referential
WHERE name = 'test' AND ftime = '2022-07-18'
returns two rows, perfectly identical.
I'm having a hard time understanding why this happens. I expect these to returns only one row, and the JSON file that the data is read from contains only one occurrence of the data.
If someone can point me at what I'm doing wrong, this would be of great help

Loss of data while storing Spark data frame in parquet format

I have a csv data file which I can load into pyspark:
spark = SparkSession.builder.master("local").appName("MYAPP").getOrCreate()```
df = spark.read.csv( path = csvfilepath, sep="|", schema=my_schema, nullValue="NULL", mode="DROPMALFORMED")```
Checking the number of rows of the data frame gives approximately 20 millions rows.
df.count()
I re-store my data frame in parquet:
df.write.mode("overwrite").parquet( parquetfilepath )
I then load the parquet data:
df = spark.read.parquet( parquetfilepath )
Now when I count the rows (df.count()) I only get 3 million rows.
Why did I lose 85% of the rows and how can I solve this problem? I have also tried to use "repartition" and "coalesce" when creating the parquet data with the same result.
I am answering my own question as I have now understood what the problem is. It is actually a very simple beginner's mistake. The reason that I lose data is simply because I have asked for it when the df is read with the option mode="DROPMALFORMED". When I count the rows of the date frame I find that it is 20 millions rows, but some of them are inconsistent with the schema and are dropped when the data is actually written down to disk (i.e. the dropping of malformed rows which I requested when reading the csv has been deferred to this point). What I didn't realize was that there were errors in my data.

Spark condition on partition column from another table (Performance)

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Spark sql query causing huge data shuffle read / write

I am using spark sql for processing the data. Here is the query
select
/*+ BROADCAST (C) */ A.party_id,
IF(B.master_id is NOT NULL, B.master_id, 'MISSING_LINK') as master_id,
B.is_matched,
D.partner_name,
A.partner_id,
A.event_time_utc,
A.funnel_stage_type,
A.product_id_set,
A.ip_address,
A.session_id,
A.tdm_retailer_id,
C.product_name ,
CASE WHEN C.product_category_lvl_01 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_01 END as product_category_lvl_01,
CASE WHEN C.product_category_lvl_02 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_02 END as product_category_lvl_02,
CASE WHEN C.product_category_lvl_03 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_03 END as product_category_lvl_03,
CASE WHEN C.product_category_lvl_04 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_04 END as product_category_lvl_04,
C.brand_name
from
browser_data A
INNER JOIN (select partner_name, partner_alias_tdm_id as npa_retailer_id from npa_retailer) D
ON (A.tdm_retailer_id = D.npa_retailer_id)
LEFT JOIN
(identity as B1 INNER JOIN (select random_val from random_distribution) B2) as B
ON (A.party_id = B.party_id and A.random_val = B.random_val)
LEFT JOIN product_taxonomy as C
ON (A.product_id = C.product_id and D.npa_retailer_id = C.retailer_id)
Where,
browser_data A - Its around 110 GB data with 519 million records,
D - Small dataset which maps to only one value in A. As this is small spark sql automatically broadcast it (confirmed in the execution plan in explain)
B - 5 GB with 45 million records contains only 3 columns. This dataset is replicated 30 times (done with cartesian product with dataset which contains 0 to 29 values) so that skewed key (lot of data against one in dataset A) issue is solved.This will result in 150 GB of data.
C - 900 MB with 9 million records. This is joined with A with broadcast join (so no shuffle)
Above query works well. But when I see spark UI I can observe above query triggers shuffle read of 6.8 TB. As dataset D and C are joined as broadcast it wont cause any shuffle. So only join of A and B should cause the shuffle. Even if we consider all data shuffled read then it should be limited to 110 GB (A) + 150 GB (B) = 260 GB. Why it is triggering 6.8 TB of shuffle read and 40 GB of shuffle write.
Any help appreciated. Thank you in advance
Thank you
Manish
The first thing I would do is use DataFrame.explain on it. That will show you the execution plan so you can see exactly what is actually happen. I would check the output to confirm that the Broadcast Join is really happening. Spark has a setting to control how big your data can be before it gives up on doing a broadcast join.
I would also note that your INNER JOIN against the random_distribution looks suspect. I may have recreated your schema wrong, but when I did explain I got this:
scala> spark.sql(sql).explain
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
LocalRelation [party_id#99]
and
LocalRelation [random_val#117]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Finally, is your input data compressed? You may be seeing the size differences because of a combination of no your data no longer being compressed, and because of the way it is being serialized.

Resources