Handling Duplicates in Databricks autoloader

Handling Duplicates in Databricks autoloader - apache-spark

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?

It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning

So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Create Spark output streams with function

I use Databricks Auto Loader to ingest files that contain data with different schemas and want to write them in corresponding delta tables using update mode.
There may be many (>15) different message types in a stream, so that I'd have to write an output stream for very one of them. There is an "upsert" function for every table.
Can this be condensed using a function (example given below) that will save a few keystrokes?
upload_path = '/example'
# Set up the stream to begin reading incoming files from the
# upload_path location.
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
# filter messages and apply JSON schema
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# each table has it's own upsert function
def create_output_stream(df, table_name, upsert_function):
# Create stream and return it.
return df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(upsert_function) \
.queryName(f"autoLoader_query_{table_name}") \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.outputMode("update")
output_stream1 = create_output_stream(table1_df, "table_name1", upsert_function1).start() # start stream in outer environment
output_stream2 = create_output_stream(table2_df, "table_name2", upsert_function2).start()
output_stream3 = create_output_stream(table3_df, "table_name3", upsert_function3).start()

Yes, of course it's possible to do it this way - it's quite a standard pattern.
But you need to take one thing into a consideration - if your input data isn't partitioned by the message type, then you will scan same files multiple times (for each message type). Alternative to it could be following - you perform filtering & upsert of all message types using the single foreachBatch, like this:
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
def do_all_upserts(df, epoch):
df.cache()
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# really you can run multiple writes using multithreading, or something like it
do_upsert(table1_df)
do_upsert(table2_df)
...
# free resources
df.unpersist()
df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(do_all_upserts) \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.start()

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.

You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.

you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table name, but I'm not sure how to pass the table name argument.
The example here is pretty helpful, but in the python example the table name is hardcoded, and it looks like in the scala example they're referencing a global variable(?) I would like to pass the name of the table into the function.
The function given in the python example at the link above is:
def writeToSQLWarehose(df, epochId):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", "my_table_in_dw_copy") \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
I'd like to use something like this:
def writeToSQLWarehose(df, epochId, tableName):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", tableName) \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
But I'm not sure how to pass the additional argument through foreachBatch.

Something like this should work.
streamingDF.writeStream.foreachBatch(lambda df,epochId: writeToSQLWarehose(df, epochId,tableName )).start()

Samellas' solution does not work if you need to run multiple streams. The foreachBatch function gets serialised and sent to Spark worker. The parameter seems to be still a shared variable within the worker and may change during the execution.
My solution is to add parameter as a literate column in the batch dataframe (passing a silver data lake table path to the merge operation):
.withColumn("dl_tablePath", func.lit(silverPath))
.writeStream.format("delta")
.foreachBatch(insertIfNotExisting)
In the batch function insertIfNotExisting, I pick up the parameter and drop the parameter column:
def insertIfNotExisting(batchDf, batchId):
tablePath = batchDf.select("dl_tablePath").limit(1).collect()[0][0]
realDf = batchDf.drop("dl_tablePath")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Handling Duplicates in Databricks autoloader - apache-spark

Related

Can i exclude the column used for partitioning when writing to parquet?

Create Spark output streams with function

How to include partitioned column in pyspark dataframe read method

pyspark parse filename on load

Pass additional arguments to foreachBatch in pyspark

Categories

Resources