I have ~3PB of parquet on S3. I want to read it file-by-file with spark streaming and join some metadata to it before writing out. The metadata is small enough to be broadcasted. Files in the source data are ~60mb, none are huge.
val r = spark.readStream
.option("maxFilesPerTrigger", "100")
.schema(pschema)
.parquet("s3://mybigdata/sourcedata/")
.withColumn("id", regexp_extract(col("mycol"), "someregex", 1).cast(IntegerType))
.alias("p")
.join(broadcast(idmap.alias("i")), $"p.id" === $"i.id", "inner") //idmap is a small dataframe
.drop($"i.id")
.withColumn("date", regexp_extract($"filename", "someregex", 1))
val w = r.writeStream.format("delta")
.partitionBy("date", "some_id")
.option("checkpointLocation", "s3://mybigdata/checkpoint/")
.option("path", "s3://mybigdata/destination/")
.start()
When I do this, I get MASSIVE spills to memory and disk:
Which of course, is a disaster. How is it that I am getting these massive spills when I'm rate limiting via maxFilesPerTrigger to 100x60mb files at a time? It seems to be trying to read the entire S3 dataset and isn't streaming at all.
What is going wrong here?
Related
I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatch
Our Code Sample (Java):
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
DeltaTable deltaTable = DeltaTable.forPath(sparkSession, targetPath);
sourceDf.createOrReplaceTempView("vTempView");
StreamingQuery sq = sparkSession.sql("select * from vTempView").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
deltaTable.alias("e").merge(microDf.alias("d"), "e.SALE_ID = d.SALE_ID")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute();
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
Problem:
Here Source path and Target path is already in sync using the checkpoint folder. Which has around 8 million rows of data amounting to around 450mb of parquet files.
When new data comes in Source Path (let's say 987 rows), then above code will pick that up and perform a merge with target table. During this operation spark is trying to perform a BroadCastHashJoin, and broadcasts the target table which has 8M rows.
Here's a DAG snippet for merge operation (with table with 1M rows),
Expectation:
I am expecting smaller dataset (i.e: 987 rows) to be broadcasted. If not then atleast spark should not broadcast target table, as it is larger than provided spark.sql.autoBroadcastJoinThreshold setting and neither are we providing any broadcast hint anywhere.
Things I have tried:
I searched around and got this article - https://learn.microsoft.com/en-us/azure/databricks/kb/sql/bchashjoin-exceeds-bcjointhreshold-oom.
It provides 2 solutions,
Run "ANALYZE TABLE ..." (but since we are reading target table from path and not from a table this is not possible)
Cache the table you are broadcasting, DeltaTable does not have any provision to cache table, so can't do this.
I thought this was because we are using DeltaTable.forPath() method for reading target table and spark is unable to calculate target table metrics. So I also tried a different approach,
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
Dataset<Row> targetDf = sparkSession
.read()
.format("delta")
.option("inferSchema", "true")
.load(targetPath);
sourceDf.createOrReplaceTempView("vtempview");
targetDf.createOrReplaceTempView("vtemptarget");
targetDf.cache();
StreamingQuery sq = sparkSession.sql("select * from vtempview").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
microDf.createOrReplaceTempView("vtempmicrodf");
microDf.sparkSession().sql(
"MERGE INTO vtemptarget as t USING vtempmicrodf as s ON t.SALE_ID = s.SALE_ID WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * "
);
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
In above snippet I am also caching the targetDf so that Spark can calculate metrics and not broadcast target table. But it didn't help and spark still broadcasts it.
Now I am out of options. Can anyone give me some guidance on this?
I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table
The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.
In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.
I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but I have an empty file of ~500B instead of 500Mb.
Cloudera (cdh5.15.1)
Spark 2.3.0
Blob
val readParquet = spark.read.parquet(inputP)
readParquet
.write
.format("parquet")
.bucketBy(23, "column")
.sortBy("column")
.mode(SaveMode.Overwrite)
.saveAsTable("bucketedTable1")
val firstTableDF = spark.table("bucketedTable1")
val readCSV = spark.read.csv(inputCSV)
readCSV
.filter(..)
.ordrerBy(someColumn)
.write
.format("parquet")
.bucketBy(23, "column")
.sortBy("column")
.mode(SaveMode.Overwrite)
.saveAsTable("bucketedTable2")
val secondTableDF = spark.table("bucketedTable2")
val resultDF = secondTableDF
.join(firstTableDF, Seq("column"), "fullouter")
.
.
resultDF
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.parquet(output)
When I launch Spark job in command line using ssh I have correct result, ~500Mb parquet file which I can see using Hive. If I run the same job using oozie workflow I have an empty file (~500 Bytes).
When I do .show() on my resultDF I can see the data but I have empty parquet file.
+-----------+---------------+----------+
| col1| col2 | col3|
+-----------+---------------+----------+
|33601234567|208012345678910| LOL|
|33601234567|208012345678910| LOL|
|33601234567|208012345678910| LOL|
There is no problem writing to parquet when I am not saving data as a table. It occurs only with DF created from table.
Any suggestions ?
Thanks in advance for any thoughts!
I figured it out for my case I just added an option .option("path", "/sources/tmp_files_path"). Now I can use bucketing and I have a data in my output files.
readParquet
.write
.option("path", "/sources/tmp_files_path")
.mode(SaveMode.Overwrite)
.bucketBy(23, "column")
.sortBy("column")
.saveAsTable("bucketedTable1")
In spark batch jobs I usually have a JSON datasource written to a file and can use corrupt column features of the DataFrame reader to write the corrupt data out in a seperate location, and another reader to write the valid data both from the same job. ( The data is written as parquet )
But in Spark Structred Streaming I'm first reading the stream in via kafka as a string and then using from_json to get my DataFrame. Then from_json uses JsonToStructs which uses a FailFast mode in the parser and does not return the unparsed string to a column in the DataFrame. (see Note in Ref) Then how can I write corrupt data that doesn't match my schema and possibly invalid JSON to another location using SSS?
Finally in the batch job the same job can write both dataframes. But Spark Structured Streaming requires special handling for multiple sinks. Then in Spark 2.3.1 (my current version) we should include details about how to write both corrupt and invalid streams properly...
Ref: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Expression-JsonToStructs.html
val rawKafkaDataFrame=spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.broker)
.option("kafka.ssl.truststore.location", path.toString)
.option("kafka.ssl.truststore.password", config.pass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.security.protocol", "SSL")
.option("subscribe", config.topic)
.option("startingOffsets", "earliest")
.load()
val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
// does not provide a corrupt column or way to work with corrupt
jsonDataFrame.select(from_json(col("value"), schema)).select("jsontostructs(value).*")
When you convert to json from string, and if it is not be able to parse with the schema provided, it will return null. You can filter the null values and select the string. Something like this.
val jsonDF = jsonDataFrame.withColumn("json", from_json(col("value"), schema))
val invalidJsonDF = jsonDF.filter(col("json").isNull).select("value")
I was just trying to figure out the _corrupt_record equivalent for structured streaming as well. Here's what I came up with; hopefully it gets you closer to what you're looking for:
// add a status column to partition our output by
// optional: only keep the unparsed json if it was corrupt
// writes up to 2 subdirs: 'out.par/status=OK' and 'out.par/status=CORRUPT'
// additional status codes for validation of nested fields could be added in similar fashion
df.withColumn("struct", from_json($"value", schema))
.withColumn("status", when($"struct".isNull, lit("CORRUPT")).otherwise(lit("OK")))
.withColumn("value", when($"status" <=> lit("CORRUPT"), $"value"))
.write
.partitionBy("status")
.parquet("out.par")