I am reading table in pyspark
df = spark.readStream.format("delta").load("mySourceTable")
And I write it using
df.writeStream.format("delta").outputMode("append").option("checkpointLocation", "/_checkpoints/myOutputTable").start("myOutputTable")
My question is how can I remove all the checkpoints so that pyspark reads mySourceTable from the beginning, instead of from where it was last read?
Thank you.
I don't know how to remove the checkpoints in "/_checkpoints/myOutputTable").start("myOutputTable")
I don't know how to remove the checkpoints in "/_checkpoints/myOutputTable").start(myOutputTable")
After stopping the Spark application, you can go directly to the checkpointLocation directory on your file system (or wherever the table is stored e.g. S3) and move/delete it.
When you then restart the Spark application it will process mySourceTable from the beginning.
Related
I have a Spark stream that process files in an S3 prefix. The issue is that there are many TBs of data already in this prefix, meaning the EMR cluster underneath Spark is getting throttled trying to process it when the stream is turned on.
What I want is to ignore all files before a certain date, and then have the stream run normally. Is there a recommended way to do this?
I think I found what I need.
val df=spark.readStream
.schema(testSchema)
.option(“maxFileAge”, “1”)
.parquet(“s3://bucket/prefix”)
This ignores everything older than a week.
I have a spark job which reads (using structured streaming API) from a source s3://bucket/source-1 folder and writes to s3://bucket/destination-1 folder. The checkpoints are saved at s3://bucket/checkpoint-1.
Now I want to read the data with the same schema from s3://bucket/source-2 (with checkpointing at s3://bucket/checkpoint-2) but i want to append it to the same s3://bucket/destination-2 folder. Is it possible?
Yes, of course this is possible to write into the same location. But there are different things that you need to take into account, like:
what data format you're using as output (Parquet, Delta, something else...)?
are these both streaming jobs are running in the same time? Could you have conflicts when writing data?
(potentially) what is partitioning schema for the destination?
We are using Apache Spark(2.4.5) job via EMR, it reads a S3 prefix {bucket}/{prefix}/*.json, conducts some data massaging and then rewrites it back to the same {bucket}/{prefix} via the Spark job save() in overwrite mode. My question is, if the Spark job fails while it is re-writing the data to S3 prefix-partition, then is there any way we can restore data in that prefix-partition in any atomic/transactional way.
Does spark/EMR/S3 any/all of these support it?
Spark will write the new files to the folder because the cluster nodes are writing files at the same time and it is better to write the multiple files. So, when you did the overwrite action, Spark will remove the folder contents first and write the result.
The problem is that the Spark will not cache the whole original data, but just a part of the data that is needed in the code. If you write the result to the original path, then it will delete the original first and write the cached result into the folder.
You could use the append method but it will create some new files not just adding the data into the original files. Spark is not designed to do this and there is no way to revert when you overwrite or it fails.
I am working on structured spark streaming with existing multilevel partitioned parquet file as source. I have following issue while using it.
Starting spark streaming job to read data from particular partition instead of starting from beginning.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Now suppose i have corrected that data till date by replacing correct files.
Now Question is how to reprocess data starting from this day instead of starting from beginning? Because in structured streaming lets say i use file stream as source which will load all files i want to ignore few files. Here i need to remove my checkpoint directory as well because it has offset till date.
Suppose if we observed that there is data quality issue in partition year=2018/month=10/hour=10. Then how to reprocess data starting from this day instead of starting from beginning?
I don't think it's possible in Spark Structured Streaming (wish I'd be mistaken).
Since we're talking about a streaming query, you'd have to rewind the "stream". The only way to achieve it (I can think of) is to re-upload the data (no idea how to do it) or simply process data that would "delete" the previous version of the partition year=2018/month=10/hour=10 and then upload a new corrected version.
The question is how to inform the parquet data source that whatever has already been processed should be "evicted" from the result (that may've been sent out to external sources for further processing).
I am having a daemon process which dumps data as files in HDFS. I need to create a RDD over the new files, de-duplicate them and store them back on HDFS. The file names should be maintained while dumping back on HDFS.
Any pointers to achieve this?
I am open to achieve it with or without spark streaming.
I tried creating a spark streaming process which processes data directly ( using java code on worker nodes) and pushes it into HDFS without creating a RDD.
But, this approach fails for larger files (greater than 15GB).
I am looking into JavaSparkContext.fileStreaming now.
Any pointers would be a great help.
Thanks and Regards,
Abhay Dandekar