I am using spark.readStream to read data from Kafka and running an explode on the resulting dataframe.
I am trying to save the result of the explode in a Hive table and I am not able to find any solution for that.
I tried the following method but it doesn't work (it runs but I don't see any new partitions created)
val query = tradelines.writeStream.outputMode("append")
.format("memory")
.option("truncate", "false")
.option("checkpointLocation", checkpointLocation)
.queryName("tl")
.start()
sc.sql("set hive.exec.dynamic.partition.mode=nonstrict;")
sc.sql("INSERT INTO TABLE default.tradelines PARTITION (dt) SELECT * FROM tl")
Check HDFS for the dt partitions on the file system
You need to run MSCK REPAIR TABLE on the hive table to see new partitions.
If you aren't doing anything special with Spark, then it's worth pointing out that Kafka Connect HDFS is capable of registering Hive partitions directly from Kafka.
Related
I am writing the spark streaming data into hdfs partitions using pyspark.
please find the code
data = (spark.readStream.format("json").schema(fileSchema).load(inputDirectoryOfJsonFiles))
output = (data.writeStream
.format("parquet")
.partitionBy("date")
.option("compression", "none")
.option("path" , "/user/hdfs/stream-test")
.option("checkpointLocation", "/user/hdfs/stream-ckp")
.outputMode("append")
.start().awaitTermination())
After writing the data into hdfs, i am creating the hive external partition table.
CREATE EXTERNAL TABLE test (id string,record string)
PARTITIONED BY (`date` date)
STORED AS PARQUET
LOCATION '/user/hdfs/stream-test/'
TBLPROPERTIES ('discover.partitions' = 'true');
But the newly created partitions are not been recognized Hive metastore. i am updating the metastore using the msck command.
msck repair table test sync partitions
Now for the streaming data how to automate this task of updating the hive metastore with the real time partitions.
please suggest a solution to this problem.
Spark structured streaming don't natively support this, but you can use foreachBatch as workaround
val yourStream = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.load()
val query = yourStream.writeStream.foreachBatch((batchDF: DataFrame, batchId: Long) => {
batchDF
.write
.mode(SaveMode.Append)
.insertInto("your_db.your_hive_table");
}).start()
query.awaitTermination()
More details refer https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch
I'd like to load a Hive table (target_table) as a DataFrame after writing a new batch out to HDFS (target_table_dir) using Spark Structured Streaming as follows:
df.writeStream
.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write
.option("path", target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table))
.start()
When we immediately read same data back from the Hive table we get a "partition not found exception". If we read with some delay, we have data correct.
It seems that Spark is still writing data to HDFS while execution has stopped and Hive Metastore is updated but data is still being written out to HDFS.
How to know when the writing of data to the Hive table (into the HDFS) is complete?
Note:
we have found that if we use processAllAvailable() after writing out,subsequent read works fine.but processAllAvailable() will block execution forever if we are dealing with continuous streams
I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.
currently my Spark Structured Streaming goes like this (Sink part displayed only):
//Output aggregation query to Parquet in append mode
aggregationQuery.writeStream
.format("parquet")
.trigger(Trigger.ProcessingTime("15 seconds"))
.partitionBy("date", "hour")
.option("path", "hdfs://<myip>:8020/user/myuser/spark/proyecto3")
.option("checkpointLocation", "hdfs://<myip>:8020/user/myuser/spark/checkpointfolder3")
.outputMode("append")
.start()
The above code generates .parquet files in the directory defined by path.
I have externally defined a Impala table that reads from that path, but I need the table to be updated or refreshed after every append of parquet files.
How can this be achieved?
You need to update the partitions of your table after file sink.
import spark.sql
val query1 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (date='20200803') LOCATION '/your/location/proyecto3/date=20200803'"
sql(s"$query1")
import spark.sql
val query2 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (hour='104700') LOCATION '/your/location/proyecto3/date=20200803/hour=104700'"
sql(s"$query2")
I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?