Record by record timestamp difference calculation - apache-spark

I am working on a logic to find the consecutive time difference between two timestamps in the streaming layer(spark) by comparing the previous time and current time and storing the value in the database.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
So according to the above timestamps my consecutive diff will be 5 mins(11:00:00 - 11:05:00) and 2 mins respectively and when i sum the difference I will get 7 mins(5+2) which will be actual time difference.. Now the real challenge is when I receive delayed timestamp.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
2017-08-01 11:02:00
Here when i calculate the difference it will be 5 mins,2 mins,5 mins respectively and now sum of the difference I will get 12 mins(5+2+5) which will be greater than the actual time difference(7 mins).which is wrong
please help me to find a workaround to handle this delayed timestamp in record by record time difference calculation.

What you are experiencing is the difference between 'event time' and 'processing time'. In the best case, processing time will be nearly identical to event time, but sometimes, an input record is delayed, so the difference will be larger.
When you process streaming data, you define (explicitly or implicitly) a window of records that you look at. If you process records individually, this window has size 1. In your case, your window has size 2. But you could also have a window that is based on time, i.e. you can look at all records that have been received in the last 10 minutes.
If you want to process delayed records in-order, you need to wait before the delayed records have arrived and then sort the records within the window. The problem then becomes, how long do you wait? The delayed records may show up 2 days later! How long to wait is a subjective question and depends on your application and its requirements.
Note that if your window is time-based, you will need to handle the case where no previous record is available.
I highly recommend this article: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 to get to grips with streaming terminology and windows.

Related

Spark Structured Streaming - Force microbatch execution even without input rows

We have a Spark Structured Streaming query that counts the number of input rows received on the last hour, updating every minute, performing the aggrupation with a temporal window (windowDuration="1 hour", slideDuration="1 minute"). The query is configured to use a processingTime trigger, with a duration of 30 secods trigger(processingTime="30 seconds"). The outputMode of the query is append.
This query produces results as long as new rows are received, which is consistent with the behaviour that the documentation indicates for fixed interval micro-batches:
If no new data is available, then no micro-batch will be kicked off.
However, we would like the query to produce results even when there are NO input rows: our use case is related to monitorization, and we would like to trigger alerts when there are no input messages in the monitorized system for a period of time.
For example, for the following input:
event_time
event_id
00:02
1
00:05
2
01:00
3
03:00
4
At processingTime=01:01, we could suppose that the following output row would be produced:
window.start
window.end
count
00:00
01:00
3
However, from this point, there are no input rows until 03:00, and therefore, no microbatch will be executed until this time, missing the opportunity to produce output rows such as:
window.start
window.end
count
01:01
02:01
0
Which would otherwise produce a monitoring alert in our system.
Is there any workaround for this behaviour, allowing executions of empty microbatches when there are no input rows?
You cannot ask for things not provided in the software as such, and there are no work arounds. There was even a time, that may still exist, in which the last set of microbatch data is not processed.

Is there a way to use Spark Structured Streaming to calculate daily aggregates?

I am planning to use structured streaming to calculate daily aggregates across different metrics.
Data volume < 1000 records per day.
Here is the simple example of input data
timestamp, Amount
1/1/20 10:00, 100
1/1/20 11:00, 200
1/1/20 23:00, 400
1/2/20 10:00, 100
1/2/20 11:00, 200
1/2/20 23:00, 400
1/2/20 23:10, 400
Expected output
Day, Amount
1/1/20, 700
1/2/20, 1100
I am planning to do something like this in the structured streaming not sure if it works or if it's the right way to do it?
parsedDF.withWatermark("date", "25 hours").groupBy("date", window("date", "24 hours")).sum("amount")
There is material overhead from running structured streams. Given you're writing code to produce a single result every 24 hours it would seem a better use of resources to do the following if you can take an extra couple minutes of latency in trade for using far fewer resources.
Ingest data into a table, partitioned by day
Write a simple SQL query against this table to generate your daily aggregate(s)
Schedule the job to run [watermark] seconds after midnight.
That's with the impression you're in the default output mode since you didn't specify one. If you want to stick with streaming, more context in your code and what your goal is would be helpful. For example, how often do you want results, and do you need partial results before the end of the day? How long do you want to wait for late data to update aggregates? What output mode are you planning to use?

How big the window could be when using spark stream?

We have some stream data need to be calculated and considering use spark stream to do it.
We need to generate three kinds of reports. The reports are based on
The last 5 minutes data
The last 1 hour data
The last 24 hour data
The frequency of reports is 5 minutes.
After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day.
But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment?

How big the spark stream window could be?

I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?

Tracking metrics using StatsD (via etsy) and Graphite, graphite graph doesn't seem to be graphing all the data

We have a metric that we increment every time a user performs a certain action on our website, but the graphs don't seem to be accurate.
So going off this hunch, we invested the updates.log of carbon and discovered that the action had happened over 4 thousand times today(using grep and wc), but according the Integral result of the graph it returned only 220ish.
What could be the cause of this? Data is being reported to statsd using the statsd php library, and calling statsd::increment('metric'); and as stated above, the log confirms that 4,000+ updates to this key happened today.
We are using:
graphite 0.9.6 with statsD (etsy)
After some research through the documentation, and some conversations with others, I've found the problem - and the solution.
The way the whisper file format is designed, it expect you (or your application) to publish updates no faster than the minimum interval in your storage-schemas.conf file. This file is used to configure how much data retention you have at different time interval resolutions.
My storage-schemas.conf file was set with a minimum retention time of 1 minute. The default StatsD daemon (from etsy) is designed to update to carbon (the graphite daemon) every 10 seconds. The reason this is a problem is: over a 60 second period StatsD reports 6 times, each write overwrites the last one (in that 60 second interval, because you're updating faster than once per minute). This produces really weird results on your graph because the last 10 seconds in a minute could be completely dead and report a 0 for the activity during that period, which results in completely nuking all of the data you had written for that minute.
To fix this, I had to re-configure my storage-schemas.conf file to store data at a maximum resolution of 10 seconds, so every update from StatsD would be saved in the whisper database without being overwritten.
Etsy published the storage-schemas.conf configuration that they were using for their installation of carbon, which looks like this:
[stats]
priority = 110
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974
This has a 10 second minimum retention time, and stores 6 hours worth of them. However, due to my next problem, I extended the retention periods significantly.
As I let this data collect for a few days, I noticed that it still looked off (and was under reporting). This was due to 2 problems.
StatsD (older versions) only reported an average number of events per second for each 10 second reporting period. This means, if you incremented a key 100 times in 1 second and 0 times for the next 9 seconds, at the end of the 10th second statsD would report 10 to graphite, instead of 100. (100/10 = 10). This failed to report the total number of events for a 10 second period (obviously).Newer versions of statsD fix this problem, as they introduced the stats_counts bucket, which logs the total # of events per metric for each 10 second period (so instead of reporting 10 in the previous example, it reports 100).After I upgraded StatsD, I noticed that the last 6 hours of data looked great, but as I looked beyond the last 6 hours - things looked weird, and the next reason is why:
As graphite stores data, it moves data from high precision retention to lower precision retention. This means, using the etsy storage-schemas.conf example, after 6 hours of 10 second precision, data was moved to 60 second (1 minute) precision. In order to move 6 data points from 10s to 60s precision, graphite does an average of the 6 data points. So it'd take the total value of the oldest 6 data points, and divide it by 6. This gives an average # of events per 10 seconds for that 60 second period (and not the total # of events, which is what we care about specifically).This is just how graphite is designed, and for some cases it might be useful, but in our case, it's not what we wanted. To "fix" this problem, I increased our 10 second precision retention time to 60 days. Beyond 60 days, I store the minutely and 10-minutely precisions, but they're essentially there for no reason, as that data isn't as useful to us.
I hope this helps someone, I know it annoyed me for a few days - and I know there isn't a huge community of people that are using this stack of software for this purpose, so it took a bit of research to really figure out what was going on and how to get a result that I wanted.
After posting my comment above I found Graphite 0.9.9 has a (new?) configuration file, storage-aggregation.conf, in which one can control the aggregation method per pattern. The available options are average, sum, min, max, and last.
http://readthedocs.org/docs/graphite/en/latest/config-carbon.html#storage-aggregation-conf

Resources