spark structured streaming producing .c000.csv files - apache-spark

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...

This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

Related

spark structured streaming parquet overwrite

i would like to be able to overwrite my output path with parquet format,
but it's not among available actions (append, complete, update),
Is there another solution here ?
val streamDF = sparkSession.readStream.schema(schema).option("header","true").parquet(rawData)
val query = streamDF.writeStream.outputMode("overwrite").format("parquet").option("checkpointLocation",checkpoint).start(target)
query.awaitTermination()
Apache Spark only support Append mode for File Sink. Check out here
You need to write code to delete path/folder/files from file system before writing a data.
Check out this stackoverflow link for ForeachWriter. This will help you to achieve your case.

How to write Avro Objects to Parquet with partitions in Java ? How to append data to the same parquet?

I am using Confluent's KafkaAvroDerserializer to deserialize Avro Objects sent over Kafka.
I want to write the recieved data to a Parquet file.
I want to be able to append data to the same parquet and to create a Parquet with Partitions.
I managed to create a Parquet with AvroParquetWriter - but I didn't find how to add partitions or append to the same file:
Before using Avro I used spark to write the Parquet - With spark writing a parquet with partitions and using append mode was trivial - should I try creating Rdds from my Avro objects and use spark to create the parquet ?
I want to write the Parquets to HDFS
Personally, I would not use Spark for this.
Rather I would use the HDFS Kafka Connector. Here is a config file that can get you started.
name=hdfs-sink
# List of topics to read
topics=test_hdfs
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
# increase to be the sum of the partitions for all connected topics
tasks.max=1
# the folder where core-site.xml and hdfs-site.xml exist
hadoop.conf.dir=/etc/hadoop
# the namenode url, defined as fs.defaultFS in the core-site.xml
hdfs.url=hdfs://hdfs-namenode.example.com:9000
# number of messages per file
flush.size=10
# The format to write the message values
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
# Setup Avro parser
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://schema-registry.example.com:8081
value.converter.schemas.enable=true
schema.compatibility=BACKWARD
If you want HDFS Partitions based on a field rather than the literal "Kafka Partition" number, then refer to the configuration docs on the FieldPartitioner. If you want automatic Hive integration, see the docs on that as well.
Let's say you did want to use Spark, though, you can try AbsaOSS/ABRiS to read in an Avro DataFrame, then you should be able to do something like df.write.format("parquet").path("/some/path") (not exact code, because I have not tried it)

Spark Structured Streaming: join stream with data that should be read every micro batch

I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.

spark structured streaming: query incoming data via Hive

I am streaming data into Spark Structured Streaming 2.1.1 using Kafka with a writeStream() to append into parquet. This works.
I can create a temporary table over the parquet files using
spark.read.parquet ("/user/markteehan/interval24" ).registerTempTable("interval24")
However this is only visible in the same spark session; and the "read.parquet" must be re-run to collect new data. Setting ".queryName()" for the writeStream doesnt create a table in the hive metastore.
What is the best technique to run SQL dynamically on the parquet data?

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Resources