Spark streaming with Checkpoint - apache-spark

I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is
What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs.
Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.
Appreciate your help.
Thank you

In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).
In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here

Related

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?

I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?
The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:
{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}
The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.
My questions with all this are:
a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?
b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?
EDIT:
What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).
What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:
if price difference >= 10:
process all these ticks
insert the results in this table
But I'm unsure how to implement this trigger.
In addition to the great answer of Marton (Pentium10)
a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here
b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.
Firstly, it's free!
Your data are immediately loaded in the correct partition and immediately request-able in the partition
If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.
At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides
You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
It's not free. The cheapest region is $0.05 per Gb.
Now that you are aware of this, ask yourselves about your use case.
If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
If you have few Gb per month, streaming is also the easiest solution, for few $
If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
Create a PubSub pull subscription
Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
Create a Cloud Scheduler that trigger your service every minute.
Edit 1:
The cost shouldn't drive your use case.
If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.
Streaming is really important for real time!
Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.
Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!
To answer your questions:
a) you need to push to BigQuery using the library's accepting formats usually a collection or either a JSON document formatted to the table's definition.
b) To add data to BigQuery you can Stream data or Load a file.
For your example you need to stream data, so use the 'streaming api' methods insert_rows* family.

Azure Data factory and Data flow taking too much time to process data from staging to Database

So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time

Does Watermark in Update output mode clean the stored state in Spark Structured Streaming?

I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

Adding and retrieving sorted counts in Cassandra

I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.

Resources