Spark stateful streaming seems to yield random results

Spark stateful streaming seems to yield random results - databricks

I have a spark stateful streaming jobs running in Azure Databricks (version: 10.4 LTS), which basically monitors new records from another table, and the records are chunks that are supposed to be merged once all pieces arrive. The state is defined as the arrived chunk indices (for a record group id). The method used is flatMapGroupsWithState -- basically following the official documentations.
Everything seems working except every time I deletes the tables and restart the streaming jobs, the results are different. We expect to have 500,000 mergeable records, but sometimes we get only 50,000 or 100,000 -- it can vary from a few thousands to ~500,000. I looked closed in the code and didn't seem to find anywhere this kind of randomness can be introduced.
Just wonder if anyone has had similar issues too? Thank you.

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?

I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?
The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:
{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}
The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.
My questions with all this are:
a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?
b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?
EDIT:
What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).
What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:
if price difference >= 10:
process all these ticks
insert the results in this table
But I'm unsure how to implement this trigger.

In addition to the great answer of Marton (Pentium10)
a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here
b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.
Firstly, it's free!
Your data are immediately loaded in the correct partition and immediately request-able in the partition
If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.
At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides
You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
It's not free. The cheapest region is $0.05 per Gb.
Now that you are aware of this, ask yourselves about your use case.
If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
If you have few Gb per month, streaming is also the easiest solution, for few $
If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
Create a PubSub pull subscription
Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
Create a Cloud Scheduler that trigger your service every minute.
Edit 1:
The cost shouldn't drive your use case.
If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.
Streaming is really important for real time!
Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.
Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!

To answer your questions:
a) you need to push to BigQuery using the library's accepting formats usually a collection or either a JSON document formatted to the table's definition.
b) To add data to BigQuery you can Stream data or Load a file.
For your example you need to stream data, so use the 'streaming api' methods insert_rows* family.

Records processed metric for intermediate datasets

I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks

The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions

Efficiently creating two RDDs in Spark

I'm investigating using Spark for a project with hundreds of GB of data being generated per hour. I'm struggling with getting the first step optimal, though it feels like it should be simple!
Suppose there is a daily process that splits the data into two parts, neither of which are small enough to cache in memory. Is there a way to do this in a single spark job without the source data having to be loaded from HDFS and parsed twice?
Say I want to write all the "Dog" events into a new HDFS location, and all the "Cat" events into another. As soon as I specify the first "write to file" action (for the dogs), Spark will set off loading each file into RAM in turn and filtering out all the dogs. Then it'll have to re-parse every file to find the cats, even though it could have done both at once. It could have loaded each partition of the raw data and written out the cats and dogs for that partition in one go.
If the data would fit in memory I could filter down to both types of events, then cache, then do the two writes. But it doesn't, and I can't see how to make spark do this on a per-partition basis as it goes along. I've studied the API docs thinking I must be missing something...
Any advice would be greatly appreciated, thanks!

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks

We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string