what is the use of _spark_metadata directory - apache-spark

I am trying to get my head around how streaming works in spark.
I have a file in a /data/flight-data/csv/ directory. It has the following data:
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
I thought to test what will happen if I read the file as a stream instead of as a batch. I first created a Dataframe using read
scala> val dataDF = spark.read.option("inferSchema","true").option("header","true").csv("data/flight-data/csv/2015-summary.csv");
[Stage 0:> dataDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
then took the schema fromm it and created a new Dataframe
scala> val staticSchema = dataDF.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.readStream.schema(staticSchema).option("header","true").csv("data/flight-data/csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
Then I started the stream. The path for checkpoint and output (I suppose) is `/home/manu/test" directory which is initially empty.
scala> dataStream.writeStream.option("checkpointLocation","home/manu/test").start("/home/manu/test");
res5: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5c7df5f1
The return value of the start is StreamingQuery which I read is A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
I notice that now the directory has a directory _spark_metadatabut there is nothing else.
Question1 - What is _spark_metadata directory? I notice it is empty. What is it used for?
Question 2 - I don't see anything else happening. Is it because I am not running any query on the Dataframe dataStream (or shall I say that the query isn't doing anything useful)?


Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
//2nd time in another cell
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

Refresh metadata for Dataframe while reading parquet file

I am trying to read a parquet file as a dataframe which will be updated periodically(path is /folder_name. whenever a new data comes the old parquet file path(/folder_name) will be renamed to a temp path and then we union both new data and old data and will store in the old path(/folder_name)
What happens is suppose we have a parquet file as hdfs://folder_name/part-xxxx-xxx.snappy.parquet before updation and then after updation it is changed to hdfs://folder_name/part-00000-yyyy-yyy.snappy.parquet
The issue is happening is when I try to read the parquet file while the update is being done
sparksession.read.parquet("filename") => it takes the old path hdfs://folder_name/part-xxxx-xxx.snappy.parquet(path exists)
when an action is called on the dataframe it is trying to read the data from hdfs://folder_name/part-xxxx-xxx.snappy.parquet but because of updation the filename changed and I am getting the below issue
java.io.FileNotFoundException: File does not exist: hdfs://folder_name/part-xxxx-xxx.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am using Spark 2.2
Can anyone help me how to refresh the metadata?
That error occurs when you are trying to read a file that doesn't exists.
Correct me if I'm wrong but I suspect you are overwriting all the files when you save the new dataframe (using .mode("overwrite")). While this process is running you are trying to read a file that was deleted and that exception is thrown - this makes the table unavailable for a period of time (during the update).
As far as I know there is no direct way of "refreshing the metadata" as you want.
Two (of several possible) ways of solving this:
1 - Use append mode
If you just want to append the new dataframe to the old one there is no need of creating a temporary folder and overwriting the old one. You can just change the save mode from overwrite to append. This way you can add partitions to an existing Parquet file without having to rewrite existing ones.
This is by far the simplest solution and there is no need to read the data that was already stored. This, however, won't work if you have to update the old data (ex: if you are doing an upsert). For that you have option 2:
2 - Use a Hive view
You can create hive tables and use a view to point to the most recent (and available) one.
Here is an example on the logic behind this approach:
Part 1
If the view <table_name> does not exist we create a new table called
<table_name>_alpha0 to store the new data
After creating the table
we create a view <table_name> as select * from
Part 2
If the view <table_name> exists we need to see to which table it is pointing (<table_name>_alphaN)
You do all the operations you want with the new data save it as a table named <table_name>_alpha(N+1)
After creating the table we alter the view <table_name> to select * from <table_name>_alpha(N+1)
And a code example:
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import spark.implicits._
//This method verifies if the view exists and returns the table it is pointing to (using the query 'describe formatted')
def getCurrentTable(spark: SparkSession, databaseName:String, tableName: String): Option[String] = {
if(spark.catalog.tableExists(s"${databaseName}.${tableName}")) {
val rdd_desc = spark.sql(s"describe formatted ${databaseName}.${tableName}")
.filter("col_name == 'View Text'")
if(rdd_desc.isEmpty()) {
else {
.stripPrefix("select * from ")
//This method saves a dataframe in the next "alpha table" and updates the view. It maintains 'rounds' tables (default=3). I.e. if the current table is alpha2, the next one will be alpha0 again.
def saveDataframe(spark: SparkSession, databaseName:String, tableName: String, new_df: DataFrame, rounds: Int = 3): Unit ={
val currentTable = getCurrentTable(spark, databaseName, tableName).getOrElse(s"${databaseName}.${tableName}_alpha${rounds-1}")
val nextAlphaTable = currentTable.replace(s"_alpha${currentTable.last}",s"_alpha${(currentTable.last.toInt + 1) % rounds}")
spark.sql(s"create or replace view ${databaseName}.${tableName} as select * from ${nextAlphaTable}")
//An example on how to use this:
//SparkSession: spark
val df = Seq((1,"I"),(2,"am"),(3,"a"),(4,"dataframe")).toDF("id","text")
val new_data = Seq((5,"with"),(6,"new"),(7,"data")).toDF("id","text")
val dbName = "test_db"
val tableName = "alpha_test_table"
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
println("Saving dataframe")
saveDataframe(spark, dbName, tableName, df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
val processed_df = df.unionByName(new_data) //Or other operations you want to do
println("Saving new dataframe")
saveDataframe(spark, dbName, tableName, processed_df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
Current table: Table does not exist
Saving dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha0
| id| text|
| 3| a|
| 4|dataframe|
| 1| I|
| 2| am|
Saving new dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha1
| id| text|
| 3| a|
| 4|dataframe|
| 5| with|
| 6| new|
| 7| data|
| 1| I|
| 2| am|
By doing this you can guarantee that a version of the view <table_name> will always be available. This also has the advantage (or not, depending on your case) of maintaining the previous versions of the table. i.e. the previous version of <table_name_alpha1> will be <table_name_alpha0>
3 - A bonus
If upgrading your Spark version is an option, take a look at Delta Lake (minimum Spark version: 2.4.2)
Hope this helps :)
Cache the parquet first, then do overwrite.
var tmp = sparkSession.read.parquet("path/to/parquet_1").cache()
tmp.write.mode(SaveMode.Overwrite).parquet("path/to/parquet_1") // same path
Error is thrown because spark does lazy evaluation. When the DAG is executed on "write" command, it starts to read the parquet and write/overwrite at the same time.
Spark doesn't have a transaction manager like Zookeeper to do locks on files hence doing concurrent read/writes is a challenge which needs to be take care of separately.
To refresh the catalog you can do the following:-
spark.sql(s"REFRESH TABLE $tableName")
A simple solution would be to use df.cache.count to bring in memory first, then do union with new data and write to /folder_name with mode overwrite. You won't have to use temp path in this case.
You mentioned that you are renaming the /folder_name to some temp path. So you should read the old data from that temp path rather than hdfs://folder_name/part-xxxx-xxx.snappy.parquet.
From reading your question, I think this might be your issue if so you should be able to run your code without using DeltaLake. In the below use-case Spark will run the code as such: (1) load the inputDF a store locally the file names of the folder location [in this case the explicit part file names] ; (2a) reach line 2 and overwrite the files within the tempLocation; (2b) load the contents from the inputDF and output it to the tempLocation; (3) follow the same steps as 1 but on the tempLocation; (4a) delete the files within the inputLocation folder; and (4b) try to load the part files cached in 1 to load the data from the inputDF to run the union and break because the file does not exist.
val inputDF = spark.read.format("parquet").load(inputLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF.unionAll(tempDF)
From my experience you can follow two pathways persistence or temporarily output everything used for the overwrite.
In the below use case we are going to load the inputDF and immediately save it as another element and persist it. When following with the action the persist will be on the data and not the file paths within the folder.
Else you can do the persistence on the outputDF, which will have, relatively, the same effect. Because the persistence is tethered to the data and not the file paths, the destruction of the inputs will not, cause the file paths to be missing during overwrite.
val inputDF = spark.read.format("parquet").load(inputLocation)
val inputDF2 = inputDF.persist
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF2.unionAll(tempDF) outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
Temporary load
Instead of loading the temporary output for the union input, if you instead entirely load the outputDF to a temporary file and reload that file for the output then you shouldn't see the file not found error.

Spark job reading the sorted AVRO files in dataframe, but writing to kafka without order

I have AVRO files sorted with ID and each ID has folder called "ID=234" and data inside the folder is in AVRO format and sorted on the basis of date.
I am running spark job which takes input path and reads avro in dataframe. This dataframe then writes to kafka topic with 5 partition.
val properties: Properties = getProperties(args)
val spark = SparkSession.builder().master(properties.getProperty("master"))
val sqlContext = spark.sqlContext
val sourcePath = properties.getProperty("sourcePath")
val dataDF = sqlContext.read.avro(sourcePath).as("data")
val count = dataDF.count();
val schemaRegAdd = properties.getProperty("schemaRegistry")
val schemaRegistryConfs = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegAdd,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME
val start = Instant.now
dataDF.select(functions.struct(properties.getProperty("message.key.name")).alias("key"), functions.struct("*").alias("value"))
.toConfluentAvroWithPlainKey(properties.getProperty("topic"), properties.getProperty("schemaName"),
My use case is to write all messages from each ID (sorted on date) sequencially such as all sorted data from one ID 1 should be added first then from ID 2 and so on. Kafka message has key as ID.
Dont forgot that the data inside a RDD/dataset is shuffle when you do transformations so you lose the order.
the best way to achieve this is to read file one by one and send it to kafka instead of read a full directory in your val sourcePath = properties.getProperty("sourcePath")

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.
I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.
These parquet files need to be read latter by hive queries.
1) Is this strategy works in production environment ? or does it lead to any small file problem later ?
2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?
3) How these kind of things generally handled in Production?
Thank you.
I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.
My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.
Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.
val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
object AppListener {
def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
val configs = config.kafka(config.key.get)
if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {
|Current Timestamp : ${currentTs}
|Merge Files : ${Processed.ts.minusHours(1)}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val ts = Processed.ts.minusHours(1)
val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
val path = new Path(hdfsPath)
if(fs.exists(path)) {
val hdfsFiles = fs.listLocatedStatus(path)
.filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
|Total files in HDFS location : ${hdfsFiles.length}
| ${hdfsFiles.length > 1}
if(hdfsFiles.length > 1) {
|Merge Small Files
|HDFS Path : ${hdfsPath}
|Total Available files : ${hdfsFiles.length}
|Status : Running
val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
|Merge Small Files
|HDFS Path : ${hdfsPath}
|Total files : ${hdfsFiles.length}
|Status : Completed
def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
object Processed {
var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB
val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
val dataSize = bytes.toLong
val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt
df.repartition(if(numPartitions == 0) 1 else numPartitions)
Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.
scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709
scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB
scala> import sys.process._
import sys.process._
scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r----- 3 svcmxns hdfs 0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r----- 3 svcmxns hdfs 727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0
We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.
As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.
This is a common burning question of spark streaming with no any fixed answer.
I took an unconventional approach which is based on idea of append.
As you are using spark 2.4.1, this solution will be helpful.
So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch.
However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition.
What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.
As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.
I have described how to achieve this optimally in the link. You might want to have a look.
You can set a trigger.
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.trigger(Trigger.ProcessingTime("30 seconds"))
The larger the trigger size, the larger the file size.
Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.

How to read data from a csv file as a stream

I have the following table:
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I am able to sort the entries as a batch process.
scala> dataDS.sort(col("count")).show(100);
I now want to try if I can do the same using streaming. To do this, I suppose I will have to read the file as a stream.
scala> val staticSchema = dataDS.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.
| readStream.
| schema(staticSchema).
| option("header","true").
| csv("data/flight-data/csv/2015-summary.csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> dataStream.isStreaming;
res245: Boolean = true
But I am not able to progress further w.r.t. how to read the data as a stream.
I have executed the sort transformation` process
scala> dataStream.sort(col("count"));
res246: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I suppose now I should use Dataset's writeStream method. I ran the following two commands but both returned errors.
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("complete").
| start();
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
and this one
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("append").
| start();
org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
From the errors, it seems I should be aggregating (group) data but I thought I don't need to do it as I can run any batch operation as a stream.
How can I understand how to sort data which arrives as a stream?
Unfortunately what the error messages tell you is accurate.
Sorting is supported only in complete mode (i.e. when each window returns complete dataset).
Complete mode requires aggregation (otherwise it would require unbounded memory - Why does Complete output mode require aggregation?)
The point you make:
but I thought I don't need to do it as I can run any batch operation as a stream.
is not without merit, but it misses a fundamental point, that Structured Streaming is not tightly bound to micro-batching.
One could easily come up with some unscalable hack
import org.apache.spark.sql.functions._
.withColumn("time", window(current_timestamp, "5 minute")) // Some time window
.withWatermark("time", "0 seconds") // Immediate watermark
.agg(sort_array(collect_list(struct($"count", $"DEST_COUNTRY_NAME", $"ORIGIN_COUNTRY_NAME"))).as("data"))
.withColumn("data", explode($"data"))
.select(df.columns.map(col): _*)
