spark structured streaming parquet overwrite - apache-spark

i would like to be able to overwrite my output path with parquet format,
but it's not among available actions (append, complete, update),
Is there another solution here ?
val streamDF = sparkSession.readStream.schema(schema).option("header","true").parquet(rawData)
val query = streamDF.writeStream.outputMode("overwrite").format("parquet").option("checkpointLocation",checkpoint).start(target)
query.awaitTermination()

Apache Spark only support Append mode for File Sink. Check out here
You need to write code to delete path/folder/files from file system before writing a data.
Check out this stackoverflow link for ForeachWriter. This will help you to achieve your case.

Related

spark structured streaming producing .c000.csv files

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

spark save and read parquet on HDFS

I am writing this code
val inputData = spark.read.parquet(inputFile)
spark.conf.set("spark.sql.shuffle.partitions",6)
val outputData = inputData.sort($"colname")
outputData.write.parquet(outputFile) //write on HDFS
If I want to read the content of the file "outputFile" from HDFS, I don't find the same number of partitions and the data is not sorted. Is this normal?
I am using Spark 2.0
This is an unfortunate deficiency of Spark. While write.parquet saves files as part-00000.parquet, part-00001.parquet, ... , it saves no partition information, and does not guarantee that part-00000 on disk is read back as the first partition.
We have added functionality for our project to a) read back partitions in the same order (this involves doing some somewhat-unsafe partition casting and sorting based on the contained filename), and b) serialize partitioners to disk and read them back.
As far as I know, there is nothing you can do in stock Spark at the moment to solve this problem. I look forward to seeing a resolution in future versions of Spark!
Edit: My experience is in Spark 1.5.x and 1.6.x. If there is a way to do this in native Spark with 2.0, please let me know!
You should make use of the repartition() instead. This would write the parquet file the way you want it:
outputData.repartition(6).write.parquet("outputFile")
Then, it would be the same if you try to read it back .
Parquet preserves the order of rows. You should use take() instead of show() to check the contents. take(n) returns the first n rows and the way it works is by first reading the first partition to get an idea of the partition size and then getting the rest of the data in batches..

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Resources