Pyspark code improvement idea (to lower the duration) - apache-spark

I have severals csv files stored in HDFS, each file is about 500Mo (thus about 200k lines and 20 columns). The data are mainly String type (90%).
I am doing basic operations on those files, like filtering, mapping, regexextract, to generate two "clean file".
I do not ask Spark to infer the schema.
The filters step:
# Filter with document type
self.data = self.data.filter('type=="1" or type=="3"')
# Remove message document without ID
self.data = self.data.filter('id_document!="1" and id_document!="7" and id_document is not NULL')
# Remove message when the Service was unavailable.
self.data = self.data.filter('message!="Service Temporarily Unavailable"')
# Chose message with the following actions
expression = '(.*receiv.*|.*reject.*|.*[^r]envoi$|creation)'
self.data = self.data.filter(self.data['action'].rlike(expression))
# Chose message with a valid date format
self.data = self.data.filter(self.data['date'].rlike('(\d{4}-\d{2}-\d{2}.*)'))
# Parse message
self.data = self.data.withColumn('message_logiciel', regexp_extract(col('message'), r'.*app:([A-Za-z\s-\déèàç
Then I do a mapping based on a dictionnary, and I finish by steps like this:
self.log = self.log.withColumn('action_result', col('message')) \
.withColumn('action_result', regexp_replace('action_result',
'.*(^[Ee]rror.{0,15}422|^[Ee]rror.{0,15}401).*',
': error c')) \
.withColumn('action_result', regexp_replace('action_result',
'.*(^[Ee]rror.*failed to connect.*|^[Ee]rror.{0,15}502).*',
': error ops')) \
.withColumn('action_result', regexp_replace('action_resul', '.*(.*connector.*off*).*',
': connector off'))
Finaly I put the cleaned data to Hive
My issue is the duration of the process; for each 500Mo files it takes 15 minutes. Which do not seems good for me. I am guessing that:
either my code does not follow spark logic (to many regexextract?)
either what is long is to put in Hive.
I am wondering why it takes so long; and if you have any clue to lower this duration.
My environnement is the following:
Coding in pyspark with a Jupyter Notebook (connected to the cluster with Livy, the Cluster is a simple 3 node cluster (4x3 Cores, 3x16Gb Ram).
I tried to increase the allocated cores and memory, but the duration is still the same.
I have done the same exercice with Panda in classic python, it works just fine.
Thank you for your help.

Related

Auto Loader with Merge Into for multiple tables

I am trying to implement the Auto Loder using the Merge Into on multiple tables using the code below as stated in the documentation:
def upsert_data(df, epoch_id):
deltaTable = DeltaTable.forPath(spark, target_location)\
deltaTable.alias("t").merge(df.alias("s"),\
"t.xx = s.xx and t.xx1 = s.xx1") \
.whenMatchedUpdateAll()\
.whenNotMatchedInsertAll() \
.execute()
for i in range(len(list_of_files()[0])):
schema =list_of_files()[2][i]
raw_data = list_of_files()[1][i]
checkpoint= list_of_files()[3][i]
target_location = list_of_files()[4][i]
dfSource =list_of_files(raw_data)
dfMergedSchema = dfSource.where("1=0")
dfMergedSchema.createOrReplaceGlobalTempView("test1")
dfMergedSchema.write.option("mergeSchema","true").mode("append").format("delta")\
.save(target_location)
stream = spark.readStream\
.format("cloudFiles")\
.option("cloudFiles.format", "parquet")\
.option("header", "true")\
.schema(schema)\
.load(raw_data)
stream.writeStream.format("delta")\
.outputMode("append")\
.foreachBatch(upsert_data)\
.option("dataChange", "false")\
.trigger(once=True)\
.option("checkpointLocation", checkpoint)\
.start()
My scenario:
We have a Landing Zone where Parquet files are appended into multiple folders for example as shown below:
Landing Zone ---|
|-----folder 0 ---|----parquet1
| |----parquet2
| |----parquet3
|
|
|-----folder 1 ---|----parquet1
| |----parquet2
| |----parquet3
Then I am needing Auto Loader to create the tables as shown below with the checkpoints:
Staging Zone ---|
|-----folder 0 ---|----checkpoint
| |----table
|
|
|
|-----folder 1 ---|----checkpoint
| |----table
|
I am noticing that without the foreachBatch option in the Writestream, but with the Trigger once, the code works as expected for inserts for multiple tables as in above. The code also works when we have both foreachBatch and Trigger options on individual tables without the for loop. However, when I try to enable both options (foreachBatch and the Trigger Once) for multiple tables as in the for loops, Auto Loader is merging all the table contents into one table. You get a checkpoint, but no table contents for folder 0 in Staging Zone, and in folder 1, you get a checkpoint, but delta files that make up the table contents for both folder 0 and 1 in the table folder of folder 1. It's merging both tables into one.
I also get the ConcurrentAppendException.
I read about the ConcurrentAppendException in the documentation, and what I am finding is that you either use partitioning or have a disjointed condition in the upsert_data function passed into the foreachBatch option of the WriteStream. I tried both and none works.
How can one isolate the streams for the different folders in this scenario for the Staging Zone, while using foreachBatch and the Trigger Once in this for loop? There is something I am definitely missing with the foreachBatch option here because without it, Auto Loader is able to isolate the streams to folder 0 and folder 1, but with it, it's not.
Spoke with a Databricks Solution Architect today, and he mentioned that I needed to use a ThreadPoolExecutor, which is something outside the Auto Loader or Databricks itself, but native to Python. That will be in a helper function, which specifies the number of streams to handle the tables in parallel with Auto Loader. So, one can use a single instance of Auto Loader notebook for multiple tables, which meets my use case. Thanks!

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Spark Reading and Writing to same S3 Path Giving Unable to infer Schema Error

I need to perform update insert (Upsert) on old data with new data.
Pseudo Code:
old_data = spark.read.parquet('s3://bucket/old_data/')
new_data= spark.read.parquet('s3://bucket/new_data/')
common_records = old_data.join(new_data,on=opk,how="inner")
non_match_records = old_data.join(new_data,on=opk,how="left_anti")
new_records = new_data.join(old_data,on=opk,how="left_anti")
dfs = [common_records , non_match_records , new_records ]
final_data = reduce(DataFrame.unionAll, dfs)
final_data .cache()
final_data.write.parquet('s3://bucket/old_data/')
Error :
Even I cached data, Still It's Looking for old_data path, Is there anyway to directly write to old data s3 path.
I have tried it to write to some temp path and read from it and write to main path Like below It worked but It's taking time process when I have data in Billons.
final_data.write.parquet('s3://bucket/temp/')
df = spark.read.parquet('s3://bucket/temp/')
df.write.parquet('s3://bucket/old_data/')
I want to reduce this Temp writing & reading part.
Thanks in advance :)
You need to perform an action to trigger dataframe caching. Thus you should modify the last lines of your code snippet as follow:
...
final_data = final_data.cache()
final_data.count()
final_data.write.parquet('s3://bucket/old_data/')
By performing a count action on your dataframe, you trigger caching process and then be able to write on same directory from where your read.
However, I don't know if this will improve performance of your application as cache fallback to disk write when dataframe is too huge for memory. If your use case is updating parquet files, I advise you to look at DeltaLake that was created to solve this.

Pyspark - Read data inside a map function?

I have a dataframe as below:
id | file_path
--------------------------
abc | s3://data/file1.json
def | s3://data/file2.json
For every row in this dataframe, I want to read the contents of the file located in file_path in a distributed manner.
Here's what I tried:
rdd_paths = df.rdd.map(lambda x: x.file_path)
rdd_contents = rdd_paths.map(lambda y: spark.read.parquet(y))
rdd_contents.take(2)
This gave me the following error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I checked SPARK-5063 but did not get clear pointers to solve this. How can I read from the file paths in a distributed manner?
Spark context can only be accessed from driver node, since map() is executed in the worker nodes, it doesnt have access to spark in your code that does spark.read.parquet(y).
You need to modify you logic, for example(not a very good example) you can the column values for s3 paths, pass this to textFile which provide key as file name and value as file content.
paths = df.select('file_path').collect()
rdd4 = sc.textFile(paths)
Now, you can apply map or iterate to store values (file content) as columns in the dataframe. For example with join.
rdd4.foreach(f=>{
println(f)
})

Databricks - Creating output file

I'm pretty new to databricks, so excuse my ignorance.
I have a databricks notebook that creates a table to hold data. I'm trying to output the data to a pipe delimited file using another notebook which is using python. If I use the 'Order By' clause each record is created in a seperate file. If I leave the clause out of the code I get 1 file, but it's not in order
The code from the notebook is as follows
%python
try:
dfsql = spark.sql("select field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8, field_9, field_10, field_11, field_12, field_13, field_14, field_15, field_16 from dbsmets1mig02_technical_build.tbl_tech_output_bsmart_update ORDER BY MSN,Sort_Order") #Replace with your SQL
except:
print("Exception occurred")
if dfsql.count() == 0:
print("No data rows")
else:
dfsql.write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Spark creates a file per partition when writing files. So your order by is creating lots of partitions. Generally you want multiple files as that means you get more throughput - if you have 1 file/partition then you are only using one thread - therefore only 1 CPU on your workers is active - the others are idle which makes it a very expensive way of solving your problem.
You could leave the order by in and coalesce back into a single partition:
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/publisheddatasmets1mig/smetsmig1/mmt/bsmart")
Even if you have multiple files you can point your other notebook at the folder and it will read all files in the folder.
To accomplish this I have done something similar to what simon_dmorias suggested. I am not sure if there is a better way to do so, as this doesn't scale very well but if you are working with a small dataset it will work.
simon_dmorias suggested: df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/mnt/mountone/data/")
This will write a single partition in a directory /mnt/mountone/data/data-<guid>-.csv, which I believe is not what you are looking for, right? You just want /mnt/mountone/data.csv, similar to the pandas .to_csv function.
Therefore, I will write it to a temporary location on the cluster (not on the mount).
df.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter", "|").mode("overwrite").save("/tmpdir/data")
I will then use the dbutils.fs.ls("/tmpdir/data") command to list the directory contents and identify the name of the csv file that was written in the directory i.e. /tmpdir/data/data-<guid>-.csv.
Once you have the CSV file name, I will use the dbutils.fs.cp function to copy the file to a mount location and rename the file. This allows you to have a single file without the directory, which is what I believe you were looking for.
dbutils.fs.cp("/tmpdir/data/data-<guid>-.csv", "/mnt/mountone/data.csv")

Resources