Is it possible to use hive tables as source/sink for spark structured streaming?
Nowhere in the documentation I could see the source or sink as hive tables .
.
Related
I have spark conf as:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")
I am using the spark context to write the parquet files into hdfs location as:
df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')
In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).
It's better if you integrate hive with spark to make your job easier.
After the hive-spark integration setup, you can enable hive support while creating SparkSession.
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Now you can access hive tables from spark.
You can run repair command from spark itself.
spark.sql("MSCK REPAIR TABLE <tbl_name>")
I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.
df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")
I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those.
1. Save that table after applying transformation and then Bulk load through hadoop ingestion method. But i want to avoid extra write on the server.
2. Using tranquility. But it is for Spark Streaming and only for Scala and Java, not for Python. Am I right on this?
Is there any other way I can achieve this?
You can achieve it by using druid kafka integration.
I think you should read data from tables in spark apply transformation and then write back it to kafka stream.
Once you setup druid kafka integration it will read data from kafka and will push to druid datasource.
Here is documentation about druid kafka integration https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html
(Disclaimer: I am a contributor for rovio-ingest)
With rovio-ingest you can batch ingest a Hive table to Druid with Spark. This avoids the extra write.
Is it possible to use Kinesis streams as a data source for Spark structured streaming? I can't find any connector available.
Qubole have a kinesis-sql library for exactly this.
https://github.com/qubole/kinesis-sql
Then you can use the source similar to any other Spark Structured Streaming source:
val source = spark
.readStream
.format("kinesis")
.option("streamName", "spark-source-stream")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("awsAccessKeyId", [YOUR_AWS_ACCESS_KEY_ID])
.option("awsSecretKey", [YOUR_AWS_SECRET_KEY])
.option("startingPosition", "TRIM_HORIZON")
.load
I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.
I am streaming data into Spark Structured Streaming 2.1.1 using Kafka with a writeStream() to append into parquet. This works.
I can create a temporary table over the parquet files using
spark.read.parquet ("/user/markteehan/interval24" ).registerTempTable("interval24")
However this is only visible in the same spark session; and the "read.parquet" must be re-run to collect new data. Setting ".queryName()" for the writeStream doesnt create a table in the hive metastore.
What is the best technique to run SQL dynamically on the parquet data?