Azure Synapse Notebook: Data Skew Issue - azure

I need to transform and union 20ish parquet files (with different schema) into one big fat dataframe in Azure Synapse notebook. I didn't run into any issues when querying each individual dataframe. However, after they are unioned into one table, i've got Data Skew warning message when querying against it. It's not stopping the operation though, so i can still do display(), count rows or even write to adsls, but am just wondering if I need to be concerned about this, and if there's any way to fix it.
Thankyou!

Related

How to do an "overwrite" output mode using spark structured streaming without deleting all the data and the checkpoint

I have this delta lake in ADLS to sink data through spark structured streaming. We usually append new data from our data source to our delta lake, but there are some cases when we find errors in the data that we need to reprocess everything. So what we do is delete all the data and the checkpoints and re-run the pipeline, having the correct data inside our ADLS.
But the problem with this approach is that the end-users stay one whole day without the data to analyze (because we need to delete it to re-run). So, to fix that, I would like to know if there's a way to do an "overwrite" output using the structured streaming so we can overwrite the data into a new delta version, and the end-user could still query the data using the current version.
I don't know if it's possible using streaming, but I would like to know if anyone had a similar problem and how you went to solve it :)
Thanks!

Azure Databricks: Switching from batch to streaming mode

Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!

Issues with long lineages (DAG) in Spark

We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.
Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

How to write data into a Hive table?

I use Spark 2.0.2.
While learning the concept of writing a dataset to a Hive table, I understood that we do it in two ways:
using sparkSession.sql("your sql query")
dataframe.write.mode(SaveMode."type of
mode").insertInto("tableName")
Could anyone tell me what is the preferred way of loading a Hive table using Spark ?
In general I prefer 2. First because for multiple rows you cannot build such a long sql and second because it reduces the chance of errors or other issues like SQL injection attacks.
In the same way that for JDBC I use PreparedStatements as much as possible.
Think in this fashion, we need to achieve updates on daily basis on hive.
This can be achieved in two ways
Process all the data of the hive
Process only effected partitions.
For the first option sql works like a gem, but keep in mind that the data should be less to process entire data.
Second option works well.If you want to process only effected partition. Use data.overwite.partitionby.path
You should write the logic in such a way that it process only effected partitions. This logic will be applied to tables where data is in millions T billions records

Resources