Spark Structured streaming replacing values of a column - apache-spark

I have the following dataframe
val tDataJsonDF = kafkaStreamingDFParquet
.filter($"value".contains("tUse"))
.filter($"value".isNotNull)
.selectExpr("cast (value as string) as tdatajson", "cast (topic as string) as env")
.select(from_json($"tdatajson", schema = ParquetSchema.tSchema).as("data"), $"env".as("env"))
.select("data.*", "env")
.select($"date", <--YYYY/MM/dd
$"time",
$"event",
$"serviceGroupId",
$"userId",
$"env")
This streaming dataframe has a column date which has the format - YYYY/MM/dd.
Due to this when I use this column as a partitioning column in my parquet write Spark creates the partition as date=2018%04%12.
Is there way I can modify the column value on the fly in the above code so that the date value is YYYY-MM-dd or YYYYMMd.
Parquet write query:
val tunerQuery = tunerDataJsonDF
.writeStream
.format("parquet")
.option("path",pathtodata )
.option("checkpointLocation", pathtochkpt)
.partitionBy("date","env","serviceGroupId")
.start()

I assume you're using Spark 2.2+
tDataJsonDF.withColumn("formatted_date",date_format(to_date(col("date"), "YYYY/MM/dd"), "yyyy-MM-dd"))

Related

can we do pivot when we read data from delta in a streaming way and write to delta table(streaming sink) or pl suggest alternate to parse array col

I wanted to parse a array column of type: ["key: value key1: value1 key2: value2 .. key_35: value_35\n", ""] and then take these key columns and put that as column names along with other columns and value as the corresponding key columns' values
I am reading from silver layer of delta lake and writing back to again silver layer of delta lake
This is the code I am using but I am getting error like Queries with streaming sources must be executed with writeStream.start(); tahoe
dfInput = (spark.readStream
.option("maxFilesPerTrigger", 1)
.format("delta")
.load(path)
.withColumn("status", col("status").getItem(0))
.select(col("date"), col("id"), col("status"))
.withColumn("status", trim(col("status")))
.withColumn("status", split(col("status"), " "))
)
(dfInput.writeStream
.trigger(processingTime='1 seconds')
.option("checkpointLocation", "checkpointPath")
.foreachBatch(forEachFunc)
.start()
)
def forEachFunc(split_df,batch_id):
dfParsed = (dfInput
.withColumn("kv", explode(col("status")))
.withColumn("key", concat(lit("status_"), split(col("kv"), ":")[0]))
.withColumn("val", split(col("kv"), ":")[1]))
(dfParsed.groupBy("date", "id")
.pivot("key")
.agg(first("val"))
.write.format("delta")
.mode("append")
.save(destPath))
pass
I tried pivoting inside foreachBatch function on seeing stack over flow solution (Pivot a streaming dataframe pyspark) and also this link (https://www.mssqltips.com/sqlservertip/6563/pivot-transformations-for-spark-streaming/#:~:text=Spark%20streaming%20supports%20wide%20range,am%20going%20to%20describe%20here.) but I am still getting error - Queries with streaming sources must be executed with writeStream.start(); tahoe

PySpark dataframe not getting saved in Hive due to file format mismatch

I want to write the streaming data from kafka topic to hive table.
I am able to create dataframes by reading kafka topic, but the data is not getting written to Hive Table due to file-format mismatch. I have specified dataframe.format("parquet") and the hive table is created with stored as parquet.
Below is the code snippet:
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable(table)
def write_to_hive(data,kafka_sink_name):
global table
table = kafka_sink_name
data.select(col("key"),col("value"),col("offset")) \
.writeStream.foreachBatch(hive_write_batch_data) \
.start().awaitTermination()
if __name__ == '__main__':
kafka_sink_name = sys.argv[1]
kafka_config = {
....
..
}
spark = SparkSession.builder.appName("Test Streaming").enableHiveSupport().getOrCreate()
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
write_to_hive(df1,kafka_sink_name)
Hive table is created as Parquet:
CREATE TABLE test.kafka_test(
key string,
value string,
offset bigint)
STORED AS PARQUET;
It is giving me the Error:
pyspark.sql.utils.AnalysisException: "The format of the existing table test.kafka_test is `HiveFileFormat`. It doesn\'t match the specified format `ParquetFileFormat`.;"
How do I write the dataframe to hive table ?
I dropped the hive table, and ran the Spark-streaming job. Table got created with the correct format.

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to a Hive table in parquet format using the method
df.saveAsTable(tablename,mode).
The above code works fine, but I have so much data for each day that i want to dynamic partition the hive table based on the creationdate(column in the table).
is there any way to dynamic partition the dataframe and store it to hive warehouse. Want to refrain from Hard-coding the insert statement using hivesqlcontext.sql(insert into table partittioin by(date)....).
Question can be considered as an extension to :How to save DataFrame directly to Hive?
any help is much appreciated.
I believe it works something like this:
df is a dataframe with year, month and other columns
df.write.partitionBy('year', 'month').saveAsTable(...)
or
df.write.partitionBy('year', 'month').insertInto(...)
I was able to write to partitioned hive table using df.write().mode(SaveMode.Append).partitionBy("colname").saveAsTable("Table")
I had to enable the following properties to make it work.
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
I also faced same thing but using following tricks I resolved.
When we Do any table as partitioned then partitioned column become case sensitive.
Partitioned column should be present in DataFrame with same name (case sensitive). Code:
var dbName="your database name"
var finaltable="your table name"
// First check if table is available or not..
if (sparkSession.sql("show tables in " + dbName).filter("tableName='" +finaltable + "'").collect().length == 0) {
//If table is not available then it will create for you..
println("Table Not Present \n Creating table " + finaltable)
sparkSession.sql("use Database_Name")
sparkSession.sql("SET hive.exec.dynamic.partition = true")
sparkSession.sql("SET hive.exec.dynamic.partition.mode = nonstrict ")
sparkSession.sql("SET hive.exec.max.dynamic.partitions.pernode = 400")
sparkSession.sql("create table " + dbName +"." + finaltable + "(EMP_ID string,EMP_Name string,EMP_Address string,EMP_Salary bigint) PARTITIONED BY (EMP_DEP STRING)")
//Table is created now insert the DataFrame in append Mode
df.write.mode(SaveMode.Append).insertInto(empDB + "." + finaltable)
}
it can be configured on SparkSession in that way:
spark = SparkSession \
.builder \
...
.config("spark.hadoop.hive.exec.dynamic.partition", "true") \
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
or you can add them to .properties file
the spark.hadoop prefix is needed by Spark config (at least in 2.4) and here is how Spark sets this config:
/**
* Appends spark.hadoop.* configurations from a [[SparkConf]] to a Hadoop
* configuration without the spark.hadoop. prefix.
*/
def appendSparkHadoopConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = {
SparkHadoopUtil.appendSparkHadoopConfigs(conf, hadoopConf)
}
This is what works for me. I set these settings and then put the data in partitioned tables.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode",
"nonstrict")
This worked for me using python and spark 2.1.0.
Not sure if it's the best way to do this but it works...
# WRITE DATA INTO A HIVE TABLE
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[*]") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.enableHiveSupport() \
.getOrCreate()
### CREATE HIVE TABLE (with one row)
spark.sql("""
CREATE TABLE IF NOT EXISTS hive_df (col1 INT, col2 STRING, partition_bin INT)
USING HIVE OPTIONS(fileFormat 'PARQUET')
PARTITIONED BY (partition_bin)
LOCATION 'hive_df'
""")
spark.sql("""
INSERT INTO hive_df PARTITION (partition_bin = 0)
VALUES (0, 'init_record')
""")
###
### CREATE NON HIVE TABLE (with one row)
spark.sql("""
CREATE TABLE IF NOT EXISTS non_hive_df (col1 INT, col2 STRING, partition_bin INT)
USING PARQUET
PARTITIONED BY (partition_bin)
LOCATION 'non_hive_df'
""")
spark.sql("""
INSERT INTO non_hive_df PARTITION (partition_bin = 0)
VALUES (0, 'init_record')
""")
###
### ATTEMPT DYNAMIC OVERWRITE WITH EACH TABLE
spark.sql("""
INSERT OVERWRITE TABLE hive_df PARTITION (partition_bin)
VALUES (0, 'new_record', 1)
""")
spark.sql("""
INSERT OVERWRITE TABLE non_hive_df PARTITION (partition_bin)
VALUES (0, 'new_record', 1)
""")
spark.sql("SELECT * FROM hive_df").show() # 2 row dynamic overwrite
spark.sql("SELECT * FROM non_hive_df").show() # 1 row full table overwrite
df1.write
.mode("append")
.format('ORC')
.partitionBy("date")
.option('path', '/hdfs_path')
.saveAsTable("DB.Partition_tablename")
It will create the partition with "date" column values and will also write as Hive External Table in hive from spark DF.

Resources