spark reduceByKeyAndWindow window have error interval - apache-spark

I use stream 2.1 spark 1.6 scala program, the partial code is listed below.
Follow documentation saying that window size 3 and sliding size 2 and then windowed stream will appear at time 3 and time 5 but I got at time 4 and time 6 but windows size 3 and sliding size 1 is OK.
val inputData: mutable.Queue[RDD[String]] = mutable.Queue()
var outputCollector = new ArrayBuffer[(String, Int)]
val inputStream = ssc.queueStream(inputData)
val patternStream: DStream[(String, Int)] = inputStream.flatMap(line => {
line.replace(",", "").map(x => (x.toString(), 1))
})
val groupStream = patternStream.reduceByKeyAndWindow(_+_, _-_, Seconds(wt1), Seconds(st1))
inputStream.print()
patternStream.foreachRDD(rdd=>{
rdd.collect().foreach(print)
println("\n")
})
groupStream.foreachRDD(rdd => {
println("window stream")
rdd.filter(s => s._2>0).sortByKey().collect().foreach(i=> {outputCollector += (i)} )
})
window size 3 sliding size 1 it is ok
Time: 1000 ms
window stream (a,1)(b,1)(f,1)(g,1)
Time: 2000 ms
window stream (a,1)(b,1)(d,1)(e,1)(f,1)(g,1)
Time: 3000 ms
window stream (a,1)(b,1)(c,1)(d,2)(e,1)(f,1)(g,1)
Time: 4000 ms
window stream (a,1)(c,1)(d,3)(e,1)
Time: 5000 ms
window stream (a,1)(c,2)(d,3)
Time: 6000 ms
window stream (a,2)(c,2)(d,2)(f,1)(g,1)
window size 3 slice size 2
window stream appear in 2 4 6 have some error
Time: 1000 ms
Time: 2000 ms
window stream (a,1)(b,1)(d,1)(e,1)(f,1)(g,1)
Time: 3000 ms
############## no widowed streamhere
Time: 4000 ms
window stream (a,1)(c,1)(d,3)(e,1)
Time: 5000 ms
############ no widowed stream here
Time: 6000 ms
window stream (a,2)(c,2)(d,2)(f,1)(g,1)

Related

Spark stage taking too long - 2 executors doing "all" the work

I've been trying to figure this out for the past day, but have not been successful.
Problem I am facing
I'm reading a parquet file that is about 2GB big. The initial read is 14 partitions, then eventually gets split into 200 partitions. I perform seemingly simple SQL query that runs for 25+ mins runtime, about 22 mins is spent on a single stage. Looking in the Spark UI, I see that all computation is eventually pushed to about 2 to 4 executors, with lots of shuffling. I don't know what is going on. Please I would appreciate any help.
Setup
Spark environment - Databricks
Cluster mode - Standard
Databricks Runtime Version - 6.4 ML (includes Apache Spark 2.4.5, Scala 2.11)
Cloud - Azure
Worker Type - 56 GB, 16 cores per machine. Minimum 2 machines
Driver Type - 112 GB, 16 cores
Notebook
Cell 1: Helper functions
load_data = function(path, type) {
input_df = read.df(path, type)
input_df = withColumn(input_df, "dummy_col", 1L)
createOrReplaceTempView(input_df, "__current_exp_data")
## Helper function to run query, then save as table
transformation_helper = function(sql_query, destination_table) {
createOrReplaceTempView(sql(sql_query), destination_table)
}
## Transformation 0: Calculate max date, used for calculations later on
transformation_helper(
"SELECT 1L AS dummy_col, MAX(Date) max_date FROM __current_exp_data",
destination_table = "__max_date"
)
## Transformation 1: Make initial column calculations
transformation_helper(
"
SELECT
cId AS cId
, date_format(Date, 'yyyy-MM-dd') AS Date
, date_format(DateEntered, 'yyyy-MM-dd') AS DateEntered
, eId
, (CASE WHEN isnan(tSec) OR isnull(tSec) THEN 0 ELSE tSec END) AS tSec
, (CASE WHEN isnan(eSec) OR isnull(eSec) THEN 0 ELSE eSec END) AS eSec
, approx_count_distinct(eId) OVER (PARTITION BY cId) AS dc_eId
, COUNT(*) OVER (PARTITION BY cId, Date) AS num_rec
, datediff(Date, DateEntered) AS analysis_day
, datediff(max_date, DateEntered) AS total_avail_days
FROM __current_exp_data
CROSS JOIN __max_date ON __main_data.dummy_col = __max_date.dummy_col
",
destination_table = "current_exp_data_raw"
)
## Transformation 2: Drop row if Date is not valid
transformation_helper(
"
SELECT
cId
, Date
, DateEntered
, eId
, tSec
, eSec
, analysis_day
, total_avail_days
, CASE WHEN analysis_day == 0 THEN 0 ELSE floor((analysis_day - 1) / 7) END AS week
, CASE WHEN total_avail_days < 7 THEN NULL ELSE floor(total_avail_days / 7) - 1 END AS avail_week
FROM current_exp_data_raw
WHERE
isnotnull(Date) AND
NOT isnan(Date) AND
Date >= DateEntered AND
dc_eId == 1 AND
num_rec == 1
",
destination_table = "main_data"
)
cacheTable("main_data_raw")
cacheTable("main_data")
}
spark_sql_as_data_table = function(query) {
data.table(collect(sql(query)))
}
get_distinct_weeks = function() {
spark_sql_as_data_table("SELECT week FROM current_exp_data GROUP BY week")
}
Cell 2: Call helper function that triggers the long running task
library(data.table)
library(SparkR)
spark = sparkR.session(sparkConfig = list())
load_data_pq("/mnt/public-dir/file_0000000.parquet")
set.seed(1234)
get_distinct_weeks()
Long running stage DAG
Stats about long running stage
Logs
I trimmed it down, and show only entries that appeared multiple times below
BlockManager: Found block rdd_22_113 locally
CoarseGrainedExecutorBackend: Got assigned task 812
ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
InMemoryTableScanExec: Predicate (dc_eId#61L = 1) generates partition filter: ((dc_eId.lowerBound#622L <= 1) && (1 <= dc_eId.upperBound#621L))
InMemoryTableScanExec: Predicate (num_rec#62L = 1) generates partition filter: ((num_rec.lowerBound#627L <= 1) && (1 <= num_rec.upperBound#626L))
InMemoryTableScanExec: Predicate isnotnull(Date#57) generates partition filter: ((Date.count#599 - Date.nullCount#598) > 0)
InMemoryTableScanExec: Predicate isnotnull(DateEntered#58) generates partition filter: ((DateEntered.count#604 - DateEntered.nullCount#603) > 0)
MemoryStore: Block rdd_17_104 stored as values in memory (estimated size <VERY SMALL NUMBER < 10> MB, free 10.0 GB)
ShuffleBlockFetcherIterator: Getting 200 non-empty blocks including 176 local blocks and 24 remote blocks
ShuffleBlockFetcherIterator: Started 4 remote fetches in 1 ms
UnsafeExternalSorter: Thread 254 spilling sort data of <Between 1 and 3 GB> to disk (3 times so far)

Postgres and 1000 multiple calls

I have a DB PostgresSql Server v. 11.7 - which is used 100% for local development only.
Hardware: 16 cores CPU, 112 GB memory, 3TB m.2 SSD (It is running Ubuntu 18.04 - But I get the about the same speed at my windows 10 laptop when I run the exact same query locally on it).
The DB contains ~ 1500 DB table (of the same structure).
Every call to the DB is custom and specific - so nothing to cache here.
From NodeJS I execute a lot of simultaneously calls (via await Promise.all(all 1000 promises)) and afterwards make a lot of different calculations.
Currently my stats look like this (max connection set to the default of 100):
1 call ~ 100ms
1.000 calls ~ 15.000ms (15ms/call)
I have tried to change the different settings of PostgreSQL. For example to change the max connections to 1.000 - but nothing really seems to optimize the performance (and yes - I do remember to restart the PostgreSql service every time I make a change).
How can I make the execution of the 1.000 simultaneously calls as fast as possible? Should I consider to copy all the needed data to another in-memory database like Redis instead?
The DB table looks like this:
CREATE TABLE public.my_table1 (
id int8 NOT NULL GENERATED ALWAYS AS IDENTITY,
tradeid int8 NOT NULL,
matchdate timestamptz NULL,
price float8 NOT NULL,
"size" float8 NOT NULL,
issell bool NOT NULL,
CONSTRAINT my_table1_pkey PRIMARY KEY (id)
);
CREATE INDEX my_table1_matchdate_idx ON public.my_table1 USING btree (matchdate);
CREATE UNIQUE INDEX my_table1_tradeid_idx ON public.my_table1 USING btree (tradeid);
The simple test query - fetch 30 mins of data between two time-stamps:
select * from my_table1 where '2020-01-01 00:00' <= matchdate AND matchdate < '2020-01-01 00:30'
total_size_incl_toast_and_indexes 21 GB total table size --> 143 bytes/row
live_rows_in_text_representation 13 GB total table size --> 89 bytes/row
My NodeJS code looks like this:
const startTime = new Date();
let allDBcalls = [];
let totalRawTrades = 0;
(async () => {
for(let i = 0; i < 1000; i++){
allDBcalls.push(selectQuery.getTradesBetweenDates(tickers, new Date('2020-01-01 00:00'), new Date('2020-01-01 00:30')).then(function (rawTradesPerTicker) {
totalRawTrades += rawTradesPerTicker["data"].length;
}));
}
await Promise.all(allDBcalls);
_wl.info(`Fetched ${totalRawTrades} raw-trades in ${new Date().getTime() - startTime} ms!!`);
})();
I just tried to run EXPLAIN - 4 times on the query:
EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM public.my_table1 where '2020-01-01 00:00' <= matchdate and matchdate < '2020-01-01 00:30';
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.024..0.555 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.096 ms
Execution Time: 0.634 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.018..0.305 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.170 ms
Execution Time: 0.374 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.020..0.351 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.097 ms
Execution Time: 0.428 ms
Index Scan using my_table1_matchdate_idx on my_table1 (cost=0.57..179.09 rows=1852 width=41) (actual time=0.016..0.482 rows=3013 loops=1)
Index Cond: (('2020-01-01 00:00:00+04'::timestamp with time zone <= matchdate) AND (matchdate < '2020-01-01 00:30:00+04'::timestamp with time zone))
Buffers: shared hit=41
Planning Time: 0.077 ms
Execution Time: 0.586 ms

How to avoid sudden spikes in batch size in Spark streaming?

I am streaming data from kafka and trying to limit the number of events per batch to 10 events. After processing for 10-15 batches, there is a sudden spike in the batch size. Below are my settings:
spark.streaming.kafka.maxRatePerPartition=1
spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.pid.minRate=1
spark.streaming.receiver.maxRate=2
Please check this image for the streaming behavior
This is the bug in spark, please reffer to: https://issues.apache.org/jira/browse/SPARK-18371
The pull request isn't merged yet, but you may pick it up and build spark on your own.
To summarize the issue:
If you have the spark.streaming.backpressure.pid.minRate set to a number <= partition count, then an effective rate of 0 is calculated:
val totalLag = lagPerPartition.values.sum
...
val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
...
(the second line calculates rate per partition where rate is rate comming from PID and defaults to minRate, when PID calculates it shall be smaller)
As here: DirectKafkaInputDStream code
This resulting to 0 causes the fallback to (unreasonable) head of partitions:
...
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
...
maxMessagesPerPartition(offsets).map { mmp =>
mmp.map { case (tp, messages) =>
val lo = leaderOffsets(tp)
tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
}
}.getOrElse(leaderOffsets)
As in DirectKafkaInputDStream#clamp
This makes the backpressure basically not working when your actual and minimal receive rate/msg/ partitions is smaller ~ equal to partitions count and you experience significant lag (e.g. messages come in spikes and you have constant processing powers).

Structured streaming : watermark vs. exactly-once semantics

The programming guide says that structured streaming guarantees end-to-end exactly once semantics using appropriate sources/sinks.
However I'm not understanding how this works when the job crashes and we have a watermark applied.
Below is an example of how I currently imagine it working, please correct me on any points that I'm misunderstanding. Thanks in advance!
Example:
Spark Job: Count # events in each 1 hour window, with a 1 hour Watermark.
Messages:
A - timestamp 10am
B - timestamp 10:10am
C - timestamp 10:20am
X - timestamp 12pm
Y - timestamp 12:50pm
Z - timestamp 8pm
We start the job, read A, B, C from the Source and the job crashes at 10:30am before we've written them out to our Sink.
At 6pm the job comes back up and knows to re-process A, B, C using the saved checkpoint/WAL. The final count is 3 for the 10-11am window.
Next, it reads the new messages from Kafka, X, Y, Z in parallel since they belong to different partitions. Z is processed first, so the max event timestamp gets set to 8pm. When the job reads X and Y, they are now behind the watermark (8pm - 1 hour = 7pm), so they are discarded as old data. The final count is 1 for 8-9pm, and the job does not report anything for the 12-1pm window. We've lost data for X and Y.
---End example---
Is this scenario accurate?
If so, the 1 hour watermark may be sufficient to handle late/out-of-order data when flowing normally from Kafka-Sspark, but not when the spark job goes down/Kafka connection is lost for a long period of time. Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for?
The watermark is a fixed value during the minibatch. In your example, since X, Y and Z are processed in the same minibatch, watermark used for this records would be 9:20am. After completion of that minibatch watermark would be updated to 7pm.
Below the quote from the design doc for the feature SPARK-18124 which implements watermarking functionality:
To calculate the drop boundary in our trigger based execution, we have to do the following.
In every trigger, while aggregate the data, we also scan for the max value of event time in the trigger data
After trigger completes, compute watermark = MAX(event time before trigger, max event time in trigger) - threshold
Probably simulation would be more description:
import org.apache.hadoop.fs.Path
import java.sql.Timestamp
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
val dir = new Path("/tmp/test-structured-streaming")
val fs = dir.getFileSystem(sc.hadoopConfiguration)
fs.mkdirs(dir)
val schema = StructType(StructField("vilue", StringType) ::
StructField("timestamp", TimestampType) ::
Nil)
val eventStream = spark
.readStream
.option("sep", ";")
.option("header", "false")
.schema(schema)
.csv(dir.toString)
// Watermarked aggregation
val eventsCount = eventStream
.withWatermark("timestamp", "1 hour")
.groupBy(window($"timestamp", "1 hour"))
.count
def writeFile(path: Path, data: String) {
val file = fs.create(path)
file.writeUTF(data)
file.close()
}
// Debug query
val query = eventsCount.writeStream
.format("console")
.outputMode("complete")
.option("truncate", "false")
.trigger(ProcessingTime("5 seconds"))
.start()
writeFile(new Path(dir, "file1"), """
|A;2017-08-09 10:00:00
|B;2017-08-09 10:10:00
|C;2017-08-09 10:20:00""".stripMargin)
query.processAllAvailable()
val lp1 = query.lastProgress
// -------------------------------------------
// Batch: 0
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// +---------------------------------------------+-----+
// lp1: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 3,
// "eventTime" : {
// "avg" : "2017-08-09T10:10:00.000Z",
// "max" : "2017-08-09T10:20:00.000Z",
// "min" : "2017-08-09T10:00:00.000Z",
// "watermark" : "1970-01-01T00:00:00.000Z"
// },
// ...
// }
writeFile(new Path(dir, "file2"), """
|Z;2017-08-09 20:00:00
|X;2017-08-09 12:00:00
|Y;2017-08-09 12:50:00""".stripMargin)
query.processAllAvailable()
val lp2 = query.lastProgress
// -------------------------------------------
// Batch: 1
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2 |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1 |
// +---------------------------------------------+-----+
// lp2: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 3,
// "eventTime" : {
// "avg" : "2017-08-09T14:56:40.000Z",
// "max" : "2017-08-09T20:00:00.000Z",
// "min" : "2017-08-09T12:00:00.000Z",
// "watermark" : "2017-08-09T09:20:00.000Z"
// },
// "stateOperators" : [ {
// "numRowsTotal" : 3,
// "numRowsUpdated" : 2
// } ],
// ...
// }
writeFile(new Path(dir, "file3"), "")
query.processAllAvailable()
val lp3 = query.lastProgress
// -------------------------------------------
// Batch: 2
// -------------------------------------------
// +---------------------------------------------+-----+
// |window |count|
// +---------------------------------------------+-----+
// |[2017-08-09 10:00:00.0,2017-08-09 11:00:00.0]|3 |
// |[2017-08-09 12:00:00.0,2017-08-09 13:00:00.0]|2 |
// |[2017-08-09 20:00:00.0,2017-08-09 21:00:00.0]|1 |
// +---------------------------------------------+-----+
// lp3: org.apache.spark.sql.streaming.StreamingQueryProgress =
// {
// ...
// "numInputRows" : 0,
// "eventTime" : {
// "watermark" : "2017-08-09T19:00:00.000Z"
// },
// "stateOperators" : [ ],
// ...
// }
query.stop()
fs.delete(dir, true)
Notice how Batch 0 started with watermark 1970-01-01 00:00:00 while Batch 1 started with watermark 2017-08-09 09:20:00 (max event time of Batch 0 minus 1 hour). Batch 2, while empty, used watermark 2017-08-09 19:00:00.
Z is processed first, so the max event timestamp gets set to 8pm.
That's correct. Even though Z may be computed first, the watermark is subtracted from the maximum timestamp in the current query iteration. This means that 08:00 PM will be set as the time in which we subtract the watermark time from, meaning 12:00 and 12:50 will be discarded.
From the documentation:
For a specific window starting at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T)
Would the only option to avoid data loss be to use a watermark longer than you expect the job to ever go down for
Not necessarily. Lets assume you set a maximum amount of data to be read per Kafka querying to 100 items. If you read small batches, and you're reading serially from each partition, each maximum timestamp for each batch may not be the maximum time of the latest message in the broker, meaning you won't lose these messages.

Graphite importing historical data only for 1 day

I'm trying to import historical data for 60 day per hour, but data succsessfully importing only for last 24 hours, configuration bellow:
Storage schema in Graphite /etc/carbon/storage-schemas.conf
[default]
pattern = .*
retentions = 5m:15d,15m:1y,1h:10y,1d:100y
Storage aggregation /etc/carbon/storage-aggregation.conf
[all_sum]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = sum
Restarting carbon-cache and removing old whisper data is not solving problem.
I checked .wsp files with wisper-info.py:
# whisper-info /var/lib/graphite/whisper/ran/3g/newerlang.wsp
maxRetention: 3153600000
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 1961584
Archive 0
retention: 1296000
secondsPerPoint: 300
points: 4320
size: 51840
offset: 64
Archive 1
retention: 31536000
secondsPerPoint: 900
points: 35040
size: 420480
offset: 51904
Archive 2
retention: 315360000
secondsPerPoint: 3600
points: 87600
size: 1051200
offset: 472384
Archive 3
retention: 3153600000
secondsPerPoint: 86400
points: 36500
size: 438000
offset: 1523584
Any idea if I need to set this up in another file or am I missing something?

Resources