spark streaming: Perform a daily aggregation - apache-spark

I have a streaming dataframe and I want to calculate some daily counters.
So far, I have been using tumbling windows with watermark as follows:
.withWatermark("timestamp", "10 minutes") \
.groupBy(window("timestamp","1 day")) \
.count()
My question is whether this is the best way (resource wise) to do this daily aggregation, or whether I should instead perform a series of aggregations on smaller windows (say hourly or even less) and then aggregate these hourly counters to achieve the daily count.
Moreover, if I try the second approach, meaning the smaller windows, how can I do this?
I can not perform both aggregations (the hourly and daily) within the same spark streaming application, I keep getting the following:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets.
Should I therefore use a spark application to post the hourly aggregations to a Kafka topic, read this stream from another spark application and perform the daily sum up?
If yes, then how should I go about the "update" outputmode in the producer? The second application will be getting the updated values from the first application and therefore this "sum up" will be wrong.
Moreover, adding any trigger will also not work with the watermark, since any late events arriving will cause a previous counter update and I would be running into the same problem again.

I think you should perform aggregation on the most shortest time span required and then perform secondary aggregation on those primary aggs. Performing a agg for 1 day would OOM your job if not now then definitely in future.
Perform primary aggreagtions count hourly or 5 minute count and record them in a Time serries DB like Prometheus / Graphite.
Use grafana to plot those metrics, perform secondary aggregations like daily count on top of primary aggregation.
This would increase some DevOps efforts but it is but you could visually monitor your application in real-time.

Related

Delta Lake partitioning strategy for event data

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app.
Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions?
Is there a better way to partition this data?
Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?
It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)

How to apply rules that can be defined at runtime to streaming datasets?

Not sure the title is well suited to what I'm trying to achieve, so bear with me.
I'll start with defining my use case:
Many(say millions) IoT devices are sending data to my Spark stream. These devices are sending the current temperature level every 10 seconds.
The owner of all of these IoT devices has the ability to define a preset rules, for example: if temperature > 50 then do something.
I'm trying to figure out if I can output how many of these devices have met this if > 50 criteria in some time period. The catch is that the rules are defined in real time and should be applied to the Spark job at real time.
How would I do that. Is Spark the right tool for the job?
Many thanks
Is Spark the right tool for the job?
I think so.
the rules are defined in real time and should be applied to the Spark job at real time.
Let's assume the rules are in a database so every batch interval Spark would fetch them and apply one by one. They could also be in a file or any other storage. That's just orthogonal to the main requirement.
How would I do that?
The batch interval would be "some time period". I assume that the payload would have deviceId and temperature. With that you can just use regular filter over temperature and get deviceId back. You don't need stateful pipeline for this unless you want to accumulate data over time that is longer than your batch interval.

Perform queries over the time-series stream of data

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?
If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.
Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

DSE Hive taking a constant time of 30 seconds for aggregate functions like sum() and count(*). Map and Reduce Jobs taking too much time

Recently i configured dse on my system for a project. Now i want to use hive to fetch data from cassandra. Everything was fine. It just took below 1 second for select * queries. But the problem is it is taking a constant time of around 30 second for queries that using aggregate functions (I mean whenever map reduce job is launched). I edited mapred-site.xml(also dse-mapred-default.xml :D) based on datastax documentation to tune up hive performance. But unfortunately no change. Please help me
Hive is not meant for faster query processing. Its a data warehouse system which is preferred when you want to process huge amount of data in batch.
If you need faster results I suggest you try hbase/cassandra.

Resources