How to partition unequal distributed events on timeline? - apache-spark

I'm working on an event processing system where I have to read my event data from a hbase table.
The events I read are stored based on their timestamp.
When I read in a whole day (24 hours), I find periods on the day where I 1 million events per hour (e.g. during regular business hours) and other periods where I only get several thousand events.
So when I equally partition a day, I will get partitions (and workers) with a lot of work and some with low work.
Is there any concept on how I could partition my day so that in the off time I use more hours per partition to process and for the main hours I use less hours?
This would result in something like:
* from 0-6am use 4 partitions
* from 6am to 6pm use 60 partitions
* from 6pm to 12am use 6 partitions

If you just you timestamp for row key this means you already have problems with region hot spotting, even before any processing. Simple solution is to add sharding key before timestamp.
Row key = (timestamp % number of regions) + timestamp.
This will equally distribute rows accross regions.

Related

spark partition strategy comparison between date=dd-mm-yyyy vs yyyy={xxxx}/mm={mm}/dd={xx}

How to choose which partition strategy in spark on dates. I have a column in data frame as the date in 2020-02-19 format. should specify the date in partition columns while writing or create multiple columns from the date as dd, mm,yyyy in the table and specify columns yyyy, mm, dd in repartition?
What kind of issues will come if I specify each partition strategy
There is no actual gain breaking in one partition date=yyyy-mm-dd or in multiple partitions year=yyyy/month=mm/day=dd, if you have to process the last 10 days will give the same amount of data at the same time. The biggest difference is the way you query or the way you will maintain your data.
With one single partition your life will be easy to write queries for an specific day. I need to run for something 3 days ago. Or I need to query a date range from 1st of Jan to 1st of May. Having one partition with the date make your life much easier for that.
Having multiple partitions is easy to make monthly analysis, is easy to query a whole month or a whole year in a easy way. But you will loose the capability of query the data in a range.
Besides those features from each type of format, in a performance perspective this will not create any overhead for you, both solutions would bring the data in the same speed because you will not going to break the data in smaller files. I prefer to break just with one partition with the day due to be easy to maintain in point of view.

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

Simple clustering method for following statement

I have a dataset:
I want to apply the clustering technique to create clusters of every 5 minutes' data and want to calculate the average of the last column i.e. percentage congestion.
How to create such clusters of every 5 minutes? I want to use this analysis further for decision making. The decision will be made on the basis of average percentage calculated.
That is a simple aggregation, and not clustering.
Use a loop, read one record at a time, and every 5 minutes output the average and reinitialize the accumulators.
Or round ever time to the 5 minute granularity. Then take the average of now identical keys. That would be a SQL GROUP_BY.

Manipulating Cassandra writetime value

Let's say I have an application, which receives periodically some measurement data.
I know the exact time the data was measured and i want every piece of data to be deleted in 30 days after it was measured.
I'm not inserting the data immediately to the database, but i want to use the time-to-live functionality of Cassandra.
Is there a way to manipulate the system intern time-stamp of a row in Cassandra so, that I can set time-to-live to 60 days, but it actually measures the lifespan of each row with my time-stamp?
E.g. I measure something at the 27.08.2014 - 19:00. I insert this data at 27.08.2014 - 20:00 into the database and set the time-to-live value to 1 day. I now want the row to be deleted at 28.08.2014 - 19:00 and not at 28.08.2014 - 20:00 like it normally would.
Is something like this possible?
I suggest you the folowing approach based on your example:
before insertion calculate Δx = insertTime - measureTime
set TTL = 1day - Δx for inserting row
Addition on the basis of a comment:
You can use Astyanax-client with Batch mutation "to simultaneously enter multiple values at once". There is possibility to set TTL on each column and on whole row at once.

Given data per second, how can I make a seamless web chart of some rolling period of time?

So lets say I have 24 hours of data split into 1 hour chunks, and the data is per second sequentially. I want to graph a rolling hour at a time, such that every second for the user, the graph will update to reflect an hour of data. I want this to look as seamless as possible.
I guess there are two questions here: one is how do you do the physical drawing of the data? The second is: how can you progressively load files? I assume that you would load hour 1 and hour 2 at first, chart hour 1, then sort of "queue" the seconds in hour 2 and take an element and draw it every second. At some point you have to load the next hour of data and continue the process...not really sure how to do this in a web context.
I appreciate the help!

Resources