spark structured streaming: query incoming data via Hive - apache-spark

I am streaming data into Spark Structured Streaming 2.1.1 using Kafka with a writeStream() to append into parquet. This works.
I can create a temporary table over the parquet files using
spark.read.parquet ("/user/markteehan/interval24" ).registerTempTable("interval24")
However this is only visible in the same spark session; and the "read.parquet" must be re-run to collect new data. Setting ".queryName()" for the writeStream doesnt create a table in the hive metastore.
What is the best technique to run SQL dynamically on the parquet data?

Related

Hive table requires 'repair' for every new partitions while inserting parquet files using pyspark

I have spark conf as:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")
I am using the spark context to write the parquet files into hdfs location as:
df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')
In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).
It's better if you integrate hive with spark to make your job easier.
After the hive-spark integration setup, you can enable hive support while creating SparkSession.
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Now you can access hive tables from spark.
You can run repair command from spark itself.
spark.sql("MSCK REPAIR TABLE <tbl_name>")
I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.
df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")

Can we look up HBase Table in Spark Structured stream process

I have a spark structured stream application which reads data from Kafka and writes to HBase and Kafka, during this process, after reading data from kafka, I would like lookup HBase table, can someone please suggest options available, with sample code if you have one.
I have tried to use SHC (Spark HBase Connector), but using SHC to read HBASE table means data frame iteration on another data frame, which dynamic filter, which spark doesn't seem to like.

spark structured streaming producing .c000.csv files

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

What is the right approach to upload the data into HBase from Apache Spark?

I am working on writing a Spark job which reads the data from the Hive and store in HBase for real time access. The executor makes the connection with HBase, what is the right approach to insert the data into. I have thought of following two approaches.
Which one is more appropriate or is there any other approach?
Write data directly from Spark Job to Hbase
Write data from Spark to HDFS and later move it to Hbase

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Resources