Spark Structured Streaming watermark error - apache-spark

Followup to this question
I have json streaming data in the format same as below
| A | B |
|-------|------------------------------------------|
| ABC | [{C:1, D:1}, {C:2, D:4}] |
| XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |
I need to transform it to the format below
| A | C | D |
|-------|-----|------|
| ABC | 1 | 1 |
| ABC | 2 | 4 |
| XYZ | 3 | 6 |
| XYZ | 9 | 11 |
| XYZ | 5 | 12 |
To achieve this performed the transformations as suggested to the previous question.
val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")
val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")
val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")
val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))
val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))
val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")
Now I trying to save the result to a csv file in HDFS
df6.withWatermark("event_time", "0 seconds")
.writeStream
.trigger(Trigger.ProcessingTime("0 seconds"))
.queryName("query_db")
.format("parquet")
.option("checkpointLocation", "/path/to/checkpoint")
.option("path", "/path/to/output")
// .outputMode("complete")
.start()
Now I get the below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
EventTimeWatermark event_time#223: timestamp, interval
My doubt is that I am not performing any aggregation that will require it store the aggregated value beyond the processing time for that row. Why do I get this error? Can I keep watermarking as 0 seconds?
Any help on this will be deeply appreciated.

As per my understanding, watermarking is required only when you are performing window operation on event time. Spark used watermarking to handle late data and for the same purpose Spark needs to save older aggregation.
The following link explains this very well with example:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
I don't see any window operations in your transformation and if that is the case then I think you can try running the stream query without watermarking.

when grouping spark streaming structures you have to already have the watermark in the dataframe and take it into account while grouping, by including a window of the watermarks in your aggregation.
df.groupBy(col("dummy"), window(col("event_time"), "1 day")).

Related

How do I make my many-join / many-union datasets compute faster?

I have a series of ~30 datasets that all need to be joined together for making a wide final table. This final table takes ~5 years of individual tables (one table per year) and unions them together, then joins this full history with the full history of other tables (similarly unioned) to make a big, historical, wide table.
The layout of these first, per year tables is as such:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
With other year tables like this:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 1 |
| key_2 | 1 |
These are then unioned together to create:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
Similarly, a second type of table when unioned results in the following:
table_type_2:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
I now want to join table_type_1 with table_type_2 on primary_key and year to yield a much wider table. I notice that this final join takes a very long time and shuffles a lot of data.
How can I make this faster?
You can use bucketing on the per-year tables over the primary_key and year columns into the exact same number of buckets to avoid an expensive exchange when computing the final join.
- output: table_type_1_year_0
input: raw_table_type_1_year_0
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
- output: table_type_1_year_1
input: raw_table_type_1_year_1
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
...
- output: table_type_2_year_0
input: raw_table_type_2_year_0
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
- output: table_type_2_year_1
input: raw_table_type_2_year_1
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
...
- output: all_tables
input:
- table_type_1_year_0
- table_type_1_year_1
...
- table_type_2_year_0
- table_type_2_year_1
...
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
Note: When you are picking the BUCKET_COUNT value, it's important to understand it should be optimized for the final all_tables output, not for the intermediate tables. This will mean you likely will end up with files that are quite small for the intermediate tables. This is likely to be inconsequential compared to the efficiency gains of the all_tables output since you won't have to compute a massive exchange when joining everything up; your buckets will be pre-computed and you can simply SortMergeJoin on the input files.
For an explicit example on how to write the transform writing out a specified number of buckets, my answer over here is probably useful.
What I advice you is: to make a first union on small datasets then to broadcast the dataset ,result of the first union , spark will deploy that dataset on its different nodes which will reduce the number of shuffles. The union on spark is well optimized so what you have to do is to think about the possess : select only columns that you need from the beginning, avoid any kind of non cost effective operations before the union like groupByKey ...etc because spark will call those operations when it makes the final process. I do advise you to avoid hive because it uses the map reduce strategy which is not worthy compared to spark sql you can use this example of a function just change the key, use scala if you can it will interact directly with spark:
def map_To_cells(df1: DataFrame, df2: DataFrame): DataFrame = {
val df0= df2.withColumn("key0",F.col("key")).drop("key")
df1.as("main").join(
broadcast(df0),
df0("key0") <=> df("key")
).select( needed columns)
}

Updated dataframe column value failed to overwrite in Hive

Consider hive table tbl with column aid and bid
| aid | bid |
---------------
| | 12 |
| 24 | 13 |
| 18 | 3 |
| | 7 |
---------------
requirement is when aid is null or empty string, aid should be overwritten by value of bid
| aid | bid |
---------------
| 12 | 12 |
| 24 | 13 |
| 18 | 3 |
| 7 | 7 |
---------------
code is simple
val df01 = spark.sql("select * from db.tbl")
val df02 = df01.withColumn("aid", when(col("aid").isNull || col("aid") <=> "", col("bid")) otherwise(col("aid")))
and when running in spark-shell, df02.show displayed correct data just like above table
problem is when write the data back to hive
df02.write
.format("orc")
.mode("Overwrite")
.option("header", "false")
.option("orc.compress", "snappy")
.insertInto(tbl)
there is no error but when I validate the data
select * from db.tbl where aid is null or aid= '' limit 10;
I can still see multiple rows return from the query with aid being null
How to overwrite the data back to hive if previously update column value just like above example?
I would try this
df02.write
.orc
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.insertInto(tbl)

How to CREATE TABLE USING delta with Spark 2.4.4?

This is Spark 2.4.4 and Delta Lake 0.5.0.
I'm trying to create a table using delta data source and seems I'm missing something. Although the CREATE TABLE USING delta command worked fine neither the table directory is created nor insertInto works.
The following CREATE TABLE USING delta worked fine, but insertInto failed.
scala> sql("""
create table t5
USING delta
LOCATION '/tmp/delta'
""").show
scala> spark.catalog.listTables.where('name === "t5").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t5| default| null| EXTERNAL| false|
+----+--------+-----------+---------+-----------+
scala> spark.range(5).write.option("mergeSchema", true).insertInto("t5")
org.apache.spark.sql.AnalysisException: `default`.`t5` requires that the data to be inserted have the same number of columns as the target table: target table has 0 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).;
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
...
I thought I'd create with columns defined, but that didn't work either.
scala> sql("""
create table t6
(id LONG, name STRING)
USING delta
LOCATION '/tmp/delta'
""").show
org.apache.spark.sql.AnalysisException: delta does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3370)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
... 54 elided
The OSS version of Delta does not have the SQL Create Table syntax as of yet. This will be implemented the future versions using Spark 3.0.
To create a Delta table, you must write out a DataFrame in Delta format. An example in Python being
df.write.format("delta").save("/some/data/path")
Here's a link to the create table documentation for Python, Scala, and Java.
An example with pyspark 3.0.0 & delta 0.7.0
print(f"LOCATION '{location}")
spark.sql(f"""
CREATE OR REPLACE TABLE {TABLE_NAME} (
CD_DEVICE INT,
FC_LOCAL_TIME TIMESTAMP,
CD_TYPE_DEVICE STRING,
CONSUMTION DOUBLE,
YEAR INT,
MONTH INT,
DAY INT )
USING DELTA
PARTITIONED BY (YEAR , MONTH , DAY, FC_LOCAL_TIME)
LOCATION '{location}'
""")
Where "location" is a dir HDFS for spark cluster mode save de delta table.
tl;dr CREATE TABLE USING delta is not supported by Spark before 3.0.0 and Delta Lake before 0.7.0.
Delta Lake 0.7.0 with Spark 3.0.0 (both just released) do support CREATE TABLE SQL command.
Be sure to "install" Delta SQL using spark.sql.catalog.spark_catalog configuration property with org.apache.spark.sql.delta.catalog.DeltaCatalog.
$ ./bin/spark-submit \
--packages io.delta:delta-core_2.12:0.7.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
scala> spark.version
res0: String = 3.0.0
scala> sql("CREATE TABLE delta_101 (id LONG) USING delta").show
++
||
++
++
scala> spark.table("delta_101").show
+---+
| id|
+---+
+---+
scala> sql("DESCRIBE EXTENDED delta_101").show(truncate = false)
+----------------------------+---------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------+-------+
|id |bigint | |
| | | |
|# Partitioning | | |
|Not partitioned | | |
| | | |
|# Detailed Table Information| | |
|Name |default.delta_101 | |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_101| |
|Provider |delta | |
|Table Properties |[] | |
+----------------------------+---------------------------------------------------------+-------+

Is spark smart enough to avoid redundant values while performing aggregation?

I have the following Dataset
case class Department(deptId:String,locations:Seq[String])
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| d1|[] |
| dp2|[] |
| dp2|[hyderabad] |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Department itself
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| dp2|[hyderabad] |
+-------+--------------------+
I do the following
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("locations")).as("locations")
My question is, is Spark smart enough not to shuffle around empty locations ie [] ?
PS: I am not sure if this is a stupid question.
Yes and no:
Yes - collect_list performs map-side aggregation, so if there are multiple values per grouping key, data will be merged before shuffle.
No - because an empty list is not the same as the missing data. If that's not the desired behavior you should filter the data first
ds.filter(size($"location") > 0).groupBy("deptId").agg(...)
but keep in mind that it will yield different result if there are only empty arrays for deptId.

Spark create multiple Data frames from one Data frame

I am using Spark 2.1 with Cassandra (3.9) as data source. C* has a big table with 50 columns, which is not a good data model for my use case. so I created split tables for each of those sensors along with partition key and clustering key cols.
All sensor table
-----------------------------------------------------
| Device | Time | Sensor1 | Sensor2 | Sensor3 |
| dev1 | 1507436000 | 50.3 | 1 | 1 |
| dev2 | 1507436100 | 90.2 | 0 | 1 |
| dev1 | 1507436100 | 28.1 | 1 | 1 |
-----------------------------------------------------
Sensor1 table
-------------------------------
| Device | Time | value |
| dev1 | 1507436000 | 50.3 |
| dev2 | 1507436100 | 90.2 |
| dev1 | 1507436100 | 28.1 |
-------------------------------
Now I am using spark to copy data from old table to new ones.
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
df.createOrReplaceTempView("data")
query = ('''select device,time,sensor1 as value from data ''' )
vgDF = spark.sql(query)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="sensor1", keyspace="dataks")\
.save()
copying data one by one is taking a lot of time (2.1) hours for a single table. is there any way i can select * and create multiple df for each sensors and save at once ? (or even sequentially).
One issue in the code is the cache
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
Here I don't see how df is used multiple times apart from save. SO here cache is counter productive. You are reading the data, filter it and saving it to a separate cassandra table. Now the only action happening on the dataframe is the save and nothing else.
So there is no benefit from caching the data here. Removing the cache will give you some speed up.
To create multiple tables sequentially. I would suggest to use partitionBy and write the data first to HDFS as partitioned data w.r.t sensor and then write it back to cassandra.

Resources