spark structured streaming batch data refresh issue (partition by clause) - apache-spark

I came across a problem while joining spark structured streaming data frame with batch data frame , my scenario I have a S3 stream which needs to do left anti join with history data which returns record not present in history (figures out new records) and I write these records to history as a new append (partition by columns disk data partition not memory).
when I refresh my history data frame which is partitioned my history data frame doesn't get updated.
Below are the code two code snippets one which work's the other which doesn't work.
Only difference between working code and non working code is partition_by clause.
Working Code:- (history gets refreshed)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
//.partitionBy("spark_id", "account_id", "run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()
Doesn't Work:- (history is not getting refreshed)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
.partitionBy("spark_id", "account_id","run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()
Thanks
Sri

I resolved this problem by using union by name function instead of reading refreshed data from disk.
Step 1:-
Read history S3
Step 2:-
Read Kafka and look up history
Step 3:-
Write to processed data to Disk and append to data frame created in step 1 using union by name spark function.
Step 1 Code (Reading History Data Frame):-
val acctHistDF = sparkSession.read
.schema(schema)
.parquet(S3path)
val acctHistDFPersisted = acctHistDF.persist()
acctHistDFPersisted.createOrReplaceTempView("acctHist")
Step 2 (Refreshing History Data Frame with stream data):-
val history = sparkSession.table("acctHist")
history.unionByName(stream)
history.createOrReplaceTempView("acctHist")
Thanks
Sri

Related

spark sql add comment with withComment, it is not work

I want to add remarks to the dataframe, then write hive table,but it is not work.That is to say, the remarks of the table are not added.
I try in spark 2.4 and spark 3, it is not work. But the lower version seems to work, I don't know why,I tried to read the source code but found nothing, if you know why, please tell me, thank you
The code as follows
val personRDD: RDD[Row] = GetTestRDD.map((line: String) => {
val arr: Array[String] = line.split(" ")
Row(arr(0).toInt, arr(1), arr(2).toInt)
})
val schema: StructType = StructType(List(
StructField("id", IntegerType, nullable = false),
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val frame: DataFrame = sparkSession.createDataFrame(personRDD, schema)
println("输出原始信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
//添加备注后处理
val commentMap: Map[String, String] = Map("id" -> "唯一标识", "name" -> "姓名", "age" -> "年龄")
val newSchema: Seq[StructField] = frame.schema.map((s: StructField) => {
println(commentMap(s.name))
s.withComment(commentMap(s.name))
})
sparkSession.createDataFrame(frame.rdd, StructType(newSchema)).repartition(10)
println("输出处理后的信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
the output
输出原始信息
(id,{})
(name,{})
(age,{})
输出处理后的信息
(id,{})
(name,{})
(age,{})

Read Excel in Spark Error :InputStream of class ZipArchiveInputStream is not implementing InputStreamStatistics

I am trying to read excel files from COS via spark , like this
def readExcelData(filePath: String, spark: SparkSession): DataFrame =
spark.read
.format("com.crealytics.spark.excel")
.option("path", filePath)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
def readAllFiles: DataFrame = {
val objLst //contains the list the file paths
val schema = StructType(
StructField("col1", StringType, true) ::
StructField("col2", StringType, true) ::
StructField("col3", StringType, true) ::
StructField("col4", StringType, true) :: Nil
)
var initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
for (file <- objLst) {
initialDF = initialDF.union(
readExcelData(file, spark).select($"col1", $"col2", $"col3", $"col4"))
}
}
In this code , I am creating an empty dataframe first , then reading all the excel files (by iterating the filepaths ) and merging the data via a union operation.
It is throwing an error like this
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
The sparkExcel version is 0.10.2
try removing the .show() for your original statement and convert to dataframe first.
def readExcel(file: String): DataFrame = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show()

Spark structured streaming sinks to output is delayed

The below spark structured streaming code collects data from Kafka at every 10 seconds:
window($"timestamp", "10 seconds")
I was expecting the results to be printed on the console every 10 seconds. But, I notice the sink to the console is happening at every ~2 mins or above.
May I know what am I doing wrong?
def streaming(): Unit = {
System.setProperty("hadoop.home.dir", "/Documents/ ")
val conf: SparkConf = new SparkConf().setAppName("Histogram").setMaster("local[8]")
conf.set("spark.eventLog.enabled", "false");
val sc: SparkContext = new SparkContext(conf)
val sqlcontext = new SQLContext(sc)
val spark = SparkSession.builder().config(conf).getOrCreate()
import sqlcontext.implicits._
import org.apache.spark.sql.functions.window
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "wonderful")
.option("startingOffsets", "latest")
.load()
import scala.concurrent.duration._
val personJsonDf = inputDf.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.withWatermark("timestamp", "500 milliseconds")
.groupBy(
window($"timestamp", "10 seconds")).count()
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
consoleOutput.awaitTermination()
}
object SparkExecutor {
val spE: SparkExecutor = new SparkExecutor();
def main(args: Array[String]): Unit = {
println("test")
spE.streaming
}
}
I think that you might be missing the trigger definition for querying personJsonDf during the writeStreamoperation. The 2 minute period might be a default one (not sure).
The groupBy window that you have defined, will be used in the query but it does not define its periodicity.
One way to configure this could be:
val consoleOutput = personJsonDf.writeStream
.outputMode("complete")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Update())
.start()
Finally, the class Trigger contains some useful methods you wanna check out.
Hope it helps.

Use WHERE or FILTER whe creating TempView

Is it possible to use where or filter when creating a SparkSQL TempView ?
I have a Cassandra table words with
word | count
------------
apples | 20
banana | 10
I tried
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where($"count" > 10)
.load()
.createOrReplaceTempView("high_counted")
or
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.where("count > 10")
.load()
.createOrReplaceTempView("high_counted")
You cannot do a WHERE or FILTER without .load()ing the table as #undefined_variable suggested.
Try:
%spark
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options( Map ("keyspace"-> "temp", "table"->"words" ))
.load()
.where($"count" > 10)
.createOrReplaceTempView("high_counted")
Alternatively, you can do a free form query as documented here.
Spark evaluated statements in lazy fashion and the above statement is a Transformation. (If you are thinking we need to filter before we load)

Structured streaming debugging input

Is there a way for me to print out the incoming data? For e.g. I have a readStream on a folder looking for JSON files, however there seems to be an issue as I am seeing 'nulls' in the aggregation output.
val schema = StructType(
StructField("id", LongType, false) ::
StructField("sid", IntegerType, true) ::
StructField("data", ArrayType(IntegerType, false), true) :: Nil)
val lines = spark.
readStream.
schema(schema).
json("in/*.json")
val top1 = lines.groupBy("id").count()
val query = top1.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start()
To print the data you can add queryName in the write stream, by using that queryName you can print.
In your Example
val query = top1.writeStream
.outputMode("complete")
.queryName("xyz")
.format("console")
.option("truncate", "false")
.start()
run this and you can display data by using SQL query
%sql select * from xyz
or you can Create Dataframe
val df = spark.sql("select * from xyz")

Resources