Parquet data and partition issue in Spark Structured streaming - apache-spark

I am using Spark Structured streaming; My DataFrame has the following schema
root
|-- data: struct (nullable = true)
| |-- zoneId: string (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- timeSinceLast: long (nullable = true)
|-- date: date (nullable = true)
How can I do a writeStream with Parquet format and write the data
(containing zoneId, deviceId, timeSinceLast; everything except date) and partition the data by date ? I tried the following code and the partition by clause did
not work
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("data.zoneId")
.start()

If you want to partition by date then you have to use it in partitionBy() method.
val query1 = df1
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("date")
.start()
In case if you want to partition data structured by <year>/<month>/<day> you should make sure that the date column is of DateType type and then create columns appropriately formatted:
val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))
df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
.withColumn("month", functions.date_format(df.col("date"), "MM"))
.withColumn("day", functions.date_format(df.col("date"), "dd"))
.writeStream
.format("parquet")
.option("path", "/Users/abc/hb_parquet/data")
.option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint")
.partitionBy("year", "month", "day")
.start()

I think you should try the method repartition which can take two kinds of arguments:
column name
number of wanted partitions.
I suggest using repartition("date") to partition your data by date.
A great link on the subject: https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Related

How to use writestream.outputmode(append) in PySpark in combination with Groupby function on Timestamp per hour?

I have a large dataframe with records from machines that do something (event_name) and contains a timestamp:
root
|-- device_Model: string (nullable = true)
|-- application_version: string (nullable = true)
|-- location_country: string (nullable = true)
|-- event_name: string (nullable = true)
|-- data_eventTimeStamp: timestamp (nullable = true)
|-- event_count: long (nullable = true)
What I would like to do is to count the event_name by hour.
To do so I have written the following code:
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import *
## Load the data from its source.
events = spark \
.readStream \
.format("delta") \
.load(load_path)
## Timestamp per hour.
df_timestamp = events \
.withColumn("Timestamp", date_trunc('hour',to_timestamp("data_eventTimestamp","dd-MM-yyyy HH:mm:ss")))
## Groupby and count events per hour.
event_count = df_timestamp \
.groupBy("device_Model","application_version","location_country","event_name","Timestamp","event_count") \
.sum("event_count")
## Write stream to Delta Table
event_count.coalesce(1) \
.writeStream \
.trigger(once=True) \
.outputMode("complete") \
.format("delta") \
.option("checkpointLocation", "...") \
.start("...")
This code works as expected, but every time this code is trigged it will create a new
(large) .snappy.parquet file. All these files will in the end stack up and consume space in a sink.
Instead of complete mode I prefer append mode as it will first create a large .snappy.parquet file of the existing data and writes smaller snappy.parquet files with only new events from the loaded data.
However, if I use append mode it will tell me that it won't work because it lacks a watermark. Yet, I don't see how to use a watermark in this context.
Does someone know how to solve this issue? Thanks!

How can i send my structured streaming dataframe to kafka?

Hello everyone !
I'm trying to send my structured streaming dataframe to one of my kafka topics, detection.
This is the schema of the structued streaming dataframe:
root
|-- timestamp: timestamp (nullable = true)
|-- Sigma: string (nullable = true)
|-- time: string (nullable = true)
|-- duration: string (nullable = true)
|-- SourceComputer: string (nullable = true)
|-- SourcePort: string (nullable = true)
|-- DestinationComputer: string (nullable = true)
|-- DestinationPort: string (nullable = false)
|-- protocol: string (nullable = true)
|-- packetCount: string (nullable = true)
|-- byteCount: string (nullable = true)
but then i try to send the dataframe, with this method:
dfwriter=df \
.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "/Documents/checkpoint/logs") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("failOnDataLoss", "false") \
.option("topic", detection) \
.start()
Then i got the error :
pyspark.sql.utils.AnalysisException: cannot resolve 'value' given input columns: [DestinationComputer, DestinationPort, Sigma, SourceComputer, SourcePort, byteCount, duration, packetCount, processName, protocol, time, timestamp]; line 1 pos 5;
If i send a dataframe with juste the column value it works, i receive the data on my kafka topic consumer.
Any idea to send my dataframe with all columns ?
Thank you !
Your dataframe has no value column, as the error says.
You'd need to "embed" all columns under a value StructType column, then use a function like to_json, not CAST( .. AS STRING)
In Pyspark, that'd be something like struct(to_json(struct($"*")).as("value") within a select query
Similar question - Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Can't read CSV string using PySpark

The scenario is: EventHub -> Azure Databricks (using pyspark)
File format: CSV (Quoted, Pipe delimited and custom schema )
I am trying to read CSV strings comming from eventhub. Spark is successfully creating the dataframe with the proper schema, but the dataframe end up empty after every message.
I managed to do some tests outside streaming environment, and when getting the data from a file, all goes well, but it fails when the data comes from a string.
So I found some links to help me on this, but none worked:
can-i-read-a-csv-represented-as-a-string-into-apache-spark-using-spark-csv?rq=1
Pyspark - converting json string to DataFrame
Right now I have the code below:
schema = StructType([StructField("Decisao",StringType(),True), StructField("PedidoID",StringType(),True), StructField("De_LastUpdated",StringType(),True)])
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
df = spark.read \
.option("header", "true") \
.option("mode","FAILFAST") \
.option("delimiter","|") \
.schema(schema) \
.csv(csvData)
df.show()
Is that even possible to do with CSV files?
You can construct schema like this via Row and split on | delimiter
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
schemaDF = csvData\
.map(lambda x: x.split("|"))\
.map(lambda x: Row(x[0],\
x[1],\
x[2],\
x[3],\
x[4]))\
.toDF(["Decisao", "PedidoID", "De_LastUpdated", "col4", "col5"])
for i in schemaDF.take(1): print(i)
Row(Decisao='DECISAO', PedidoID='PEDIDOID', De_LastUpdated='DE_LASTUPDATED\r\n"asdasdas"', col4='"1015905177"', col5='"sdfgsfgd"')
schemaDF.printSchema()
root
|-- Decisao: string (nullable = true)
|-- PedidoID: string (nullable = true)
|-- De_LastUpdated: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)

Does Spark 2.2.0 support Streaming Self-Joins?

I understand JOINS of two different dataframes are not supported in Spark 2.2.0 but I am trying to do self-join so only one stream. Below is my code
val jdf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "join_test")
.option("startingOffsets", "earliest")
.load();
jdf.printSchema
which print the following
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Now I run the join query below after reading through this SO post
jdf.as("jdf1").join(jdf.as("jdf2"), $"jdf1.key" === $"jdf2.key")
And I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`jdf1.key`' given input columns: [timestamp, value, partition, timestampType, topic, offset, key];;
'Join Inner, ('jdf1.key = 'jdf2.key)
:- SubqueryAlias jdf1
: +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
+- SubqueryAlias jdf2
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
I think it will not create any difference if we try to join same streaming data frame or different dataframe.So, It will not be supported.
There are two ways to achieve it.
First, you can join static and streaming dataframe. So, read once as batch data and next as streaming df.
The second solution, you can use Kafka streams. It provides joining of streaming data.

How to load CSVs with timestamps in custom format?

I have a timestamp field in a csv file that I load to a dataframe using spark csv library. The same piece of code works on my local machine with Spark 2.0 version but throws an error on Azure Hortonworks HDP 3.5 and 3.6.
I have checked and Azure HDInsight 3.5 is also using the same Spark version so I don't think it's a problem with Spark version.
import org.apache.spark.sql.types._
val sourceFile = "C:\\2017\\datetest"
val sourceSchemaStruct = new StructType()
.add("EventDate",DataTypes.TimestampType)
.add("Name",DataTypes.StringType)
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("dateFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
The whole exception is as follows:
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply$mcJ$sp(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13$$anonfun$apply$2.apply(UnivocityParser.scala:142)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:139)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$13.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:135)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9.apply(UnivocityParser.scala:134)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:215)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:187)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:304)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
... 27 more
The csv file has only one row as follows:
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
TL;DR Use timestampFormat option (not dateFormat).
I've managed to reproduce it in the latest Spark version 2.3.0-SNAPSHOT (built from the master).
// OS shell
$ cat so-43259485.csv
"EventDate"|"Name"
"2016/12/19 00:43:27.583"|"adam"
// spark-shell
scala> spark.version
res1: String = 2.3.0-SNAPSHOT
case class Event(EventDate: java.sql.Timestamp, Name: String)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Event].schema
scala> spark
.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.schema(schema)
.load("so-43259485.csv")
.show(false)
17/04/08 11:03:42 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:237)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:167)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply$mcJ$sp(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(UnivocityParser.scala:146)
at scala.util.Try.getOrElse(Try.scala:79)
The corresponding line in the Spark sources is the "root cause" of the issue:
Timestamp.valueOf(s)
Having read the javadoc of Timestamp.valueOf, you can learn that the argument should be:
timestamp in format yyyy-[m]m-[d]d hh:mm:ss[.f...]. The fractional seconds may be omitted. The leading zero for mm and dd may also be omitted.
Note "The fractional seconds may be omitted" so let's cut it off by first loading the EventDate as a String and only after removing the unneeded fractional seconds convert it to Timestamp.
val eventsAsString = spark.read.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.load("so-43259485.csv")
It turns out that for fields of TimestampType type Spark uses timestampFormat option first if defined and only if not uses the code the uses Timestamp.valueOf.
It turns out the fix is just to use timestampFormat option (not dateFormat!).
val df = spark.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","|")
.option("mode","FAILFAST")
.option("inferSchema","false")
.option("timestampFormat","yyyy/MM/dd HH:mm:ss.SSS")
.schema(sourceSchemaStruct)
.load(sourceFile)
scala> df.show(false)
+-----------------------+----+
|EventDate |Name|
+-----------------------+----+
|2016-12-19 00:43:27.583|adam|
+-----------------------+----+
Spark 2.1.0
Use schema inference in CSV using inferSchema option with your custom timestampFormat.
It's important to trigger schema inference using inferSchema for timestampFormat to take effect.
val events = spark.read
.format("csv")
.option("header", true)
.option("mode","FAILFAST")
.option("delimiter","|")
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss")
.load("so-43259485.csv")
scala> events.show(false)
+-------------------+----+
|EventDate |Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
"Incorrect" initial version left for learning purposes
val events = eventsAsString
.withColumn("date", split($"EventDate", " ")(0))
.withColumn("date", translate($"date", "/", "-"))
.withColumn("time", split($"EventDate", " ")(1))
.withColumn("time", split($"time", "[.]")(0)) // <-- remove millis part
.withColumn("EventDate", concat($"date", lit(" "), $"time")) // <-- make EventDate right
.select($"EventDate" cast "timestamp", $"Name")
scala> events.printSchema
root
|-- EventDate: timestamp (nullable = true)
|-- Name: string (nullable = true)
events.show(false)
scala> events.show
+-------------------+----+
| EventDate|Name|
+-------------------+----+
|2016-12-19 00:43:27|adam|
+-------------------+----+
Spark 2.2.0
As of Spark 2.2 you can use to_timestamp function to do the string to timestamp conversion.
eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
scala> eventsAsString.select($"EventDate", to_timestamp($"EventDate", "yyyy/MM/dd HH:mm:ss.SSS")).show(false)
+-----------------------+----------------------------------------------------+
|EventDate |to_timestamp(`EventDate`, 'yyyy/MM/dd HH:mm:ss.SSS')|
+-----------------------+----------------------------------------------------+
|2016/12/19 00:43:27.583|2016-12-19 00:43:27 |
+-----------------------+----------------------------------------------------+
I searched for this issue, and discovered the offical Github issue page https://github.com/databricks/spark-csv/pull/280 which has fixed a related bug for parsing data with custom date format. I reviewed some source codes, and according to the code to find out your issue reason which is set inferSchema with the default value false as below.
inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default
Please change inferSchema with true for your date format yyyy/MM/dd HH:mm:ss.SSS using SimpleDateFormat.

Resources