Delta Lake partitioning strategy for event data - apache-spark

I'm trying to build a system that ingests, stores and can query app event data. In the future it will be used for other tasks (ML, Analytics, etc.) hence why I think Databricks could be a good option(for now).
The main use case will be retrieving user-action events occurring in the app.
Batches of this event data will land in an S3 bucket about every 5-30 mins and Databricks Auto Loader will pick them up and store it in a Delta Table.
A typical query will be: get all events where colA = x over the last day, week, or month.
I think the typical strategy here is to partition by date. e.g:
date_trunc("day", date) # 2020-04-11T00:00:00:00.000+000
This will create 365 partitions in a year. I expect each partition to hold about 1GB of data. In addition to partitioning, I plan on using z-ordering for one of the high cardinality columns that will frequently be used in the where clause.
Is this too many partitions?
Is there a better way to partition this data?
Since I'm partitioning by day and data is coming in every 5-30 mins, is it possible to just "append" data to a days partition instead?

It's really depends on the amount of data that are coming per day and how many files should be read to answer your query. If it 10th of Gb then partition per day is ok. But you can also partition by timestamp truncated to week, and in this case you'll get only 52 partitions per year. ZOrdering will help to keep the files optimized, but if you're appending data every 5-30 minutes, you'll get with at least 24 files per day inside the partition, so you will need to run OPTIMIZE with ZOrder every night, or something like this, to decrease the number of files. Also, make sure that you're using optimized writes - although this make write operation slower, it will decrease the number of files generated (if you're planning to use ZOrdering, then it makes no sense to enable autocompaction)

Related

Differences between BigQuery BQ.insert_rows_json and BQ.load_from_json?

I want to stream data into BigQuery and I was thinking in use PubSub + Cloud Functions, since there is no transformation needed (for now, at least) and using Cloud Data Flow feels like a little bit over kill for just inserting rows to a table. I am correct?
The data is streamed from a GCP VM using a Python script into PubSub and it has the following format:
{'SEGMENT':'datetime':'2020-12-05 11:25:05.64684','values':(2568.025,2567.03)}
The BigQuery schema is datetime:timestamp, value_A: float, value_B: float.
My questions with all this are:
a) Do I need to push this into BigQuery as json/dictionary with all values as strings or it has to be with the data type of the table?
b) What's the difference between using BQ.insert_rows_json and BQ.load_table_from_json and which one should I use for this task?
EDIT:
What I'm trying to get is actually market data of some assets. Say around 28 instruments and capture all their ticks. On an average day, there are ~60.k ticks per instrument, so we are talking about ~33.6 M invocations per month. What is needed (for now) is to insert them in a table for further analysis. I'm currently not sure if real streaming should be performed or loads per batch. Since the project is in doing analysis yet, I don't feel that Data Flow is needed, but PubSub should be used since it allows to scale to Data Flow easier when the time comes. This is my first implementation of doing streaming pipelines and I'm using all what I've learned through courses and reading. Please, correct me if I'm having a wrong approach :).
What I would absolutely love to do is, for example, perform another insert to another table when the price difference between one tick and the n'th tick is, for example, 10. For this, should I use Data Flow or the Cloud Function approach is still valid? Because this is like a trigger condition. Basically, the trigger would be something like:
if price difference >= 10:
process all these ticks
insert the results in this table
But I'm unsure how to implement this trigger.
In addition to the great answer of Marton (Pentium10)
a) You can stream a JSON in BigQuery, a VALID json. your example isn't. About the type, there is an automatic coercion/conversion according with your schema. You can see this here
b) The load job loads file in GCS or a content that you put in the request. The batch is asynchronous and can take seconds or minutes. In addition, you are limited to 1500 load per days and per table -> 1 per minutes works (1440 minutes per day). There is several interesting aspect of the load job.
Firstly, it's free!
Your data are immediately loaded in the correct partition and immediately request-able in the partition
If the load fail, no data are inserted. So, it's easiest to replay a file without having doubled values.
At the opposite, the streaming job insert in real time the data into BigQuery. It's interesting when you have real time constraint (especially for visualisation, anomalie detections,...). But there is some bad sides
You are limited to 500k rows per seconds (in EU and US), 100k rows in other regions, and 1Gb max per seconds
The data aren't immediately in the partition, they are in a buffer name UNPARTITIONED for a while or up to have this buffer full.. So you have to take into account this specificity when you build and test your real time application.
It's not free. The cheapest region is $0.05 per Gb.
Now that you are aware of this, ask yourselves about your use case.
If you need real time (less than 2 minutes of delay), no doubt, streaming is for you.
If you have few Gb per month, streaming is also the easiest solution, for few $
If you have a huge volume of data (more than 1Gb per second), BigQuery isn't the good service, consider BigTable (that you can request with BigQuery as a federated table)
If you have an important volume of data (1 or 2Gb per minutes) and your use case required data freshness at the minute+, you can consider a special design
Create a PubSub pull subscription
Create a HTTP triggered Cloud Function (or a Cloud Run service) that pull the subscription for 1 minutes and then submit the pulled content to BigQuery as a load job (no file needed, you can post in memory content directly to BigQuery). And then exist gracefully
Create a Cloud Scheduler that trigger your service every minute.
Edit 1:
The cost shouldn't drive your use case.
If, for now, it's only for analytics, you simply imagine to trigger once per days your job to pull the full subscriptions. With your metrics: 60k metrics * 28 instruments * 100 bytes (24 + memory loss), you have only 168Mb. You can store this in Cloud Functions or Cloud Run memory and perform a load job.
Streaming is really important for real time!
Dataflow, in streaming mode, will cost you, at least $20 per month (1 small worker of type n1-standard1. Much more than 1.5Gb of streaming insert in BigQuery with Cloud Functions.
Eventually, about your smart trigger to stream or to batch insert, it's not really possible, you have to redesign the data ingestion if you change your logic. But before all, only if your use case requires this!!
To answer your questions:
a) you need to push to BigQuery using the library's accepting formats usually a collection or either a JSON document formatted to the table's definition.
b) To add data to BigQuery you can Stream data or Load a file.
For your example you need to stream data, so use the 'streaming api' methods insert_rows* family.

How to model time series data in cassandra when data has non-uniform generation rate?

I am planning to migrate data from my existing database (Postgres) to Cassandra. Here is a brief overview of the system:
Current data set size is around 2 Billion
Each data point represents an event. Properties of this event are - user_id, event_name, timestamp
This data is coming from a finite set of sources - For the sake of simplicity let's assume 3 different sources S1, S2, S3 - all of them pushing in a Kafka Topic. This cassandra microservice is consuming data from this topic.
The rate of data coming from S1, S2 and S3 is different. Assume S1 is pushing 1 event for a single user every minute, S2 is pushing 1 event for each user every 15 minutes and S3 is pushing single event for each user every 1 hour.
There are two types of queries this system should support
Get latest event for a given user
Get list of events for a given user and date range (This data range can have diff of at most 30 days)
I am trying to model this data using few different approaches.
Partition data for a single user into monthly buckets. For this additional parameters timestamp_year, timestamp_month are added. timestamp is used a cluster key.
Pros: Less than 10ms write latency. Max partition size is around ~60MB (working good for cassandra 3.11). Get latest event is working in less than 10ms (99.999 percentile).
Cons: Getting month level data is slow because of too much data being read from a single partition. If i put limit on number of records being fetched (let's say 10000) the latency improves. Partition size is non-uniform because of different rate of data from 3 different sources.
I have tried using weekly buckets instead of monthly buckets and pagination to improve on other parameters. But this is something i am not able to sort out Partition size is non-uniform because of different rate of data from 3 different sources.
How can i keep partition size consistent (almost) in such a data model? Ideas are welcome.
This is a classical problem and there are no easy solutions to make partition size uniform. If you can predict the rate of ingestion per user, probably you can have different buckets of users, such as, high, medium and low ingestion users.
Depending on the type, the time bucket would be different. For a high ingestion user, partition means a day and for a low ingestion user, partition means a month.
For speeding up your month query on a high ingestion user, you can run parallel queries of 30 days and see if it helps to optimize your query time.

Cassandra aggregation

The Cassandra database is not very good for aggregation and that is why I decided to do the aggregation before write. I am storing some data (eg. transaction) for each user which I am aggregating by hour. That means for one user there will be only one row for each our.
Whenever I receive new data, I read the row for current hour, aggregate it with received data and write it back.I use this data to generate hourly reports.
This works fine with low velocity data but I observed considerably high data loss when velocity is very high (eg 100 records for 1 user in a min). This is because reads and writes are happening very fast and because of "delayed write", I am not getting updated data.
I think my approach "aggregate before write" itself is wrong. I was thinking about UDF but I am not sure how will it impact on performance.
What is the best way to store aggregated data in Cassandra ?
My idea would be:
Model data in Cassandra on hour-by-hour buckets.
Store plain data into Cassandra immediately when they arrive.
Process at X all the data of the X-1 hour and store the aggregate result on another table
This would allow you to have very fast incoming rates, process data only once, store the aggregates into another table to have fast reads.
I use Cassandra to pre-aggregate also. I have different tables for hourly, daily, weekly, and monthly. I think you are probably getting data loss as you are selecting the data before your last inserts have replicated to other nodes.
Look into the counter data type to get around this.
You may also be able to specify a higher consistency level in either the inserts or selects to ensure you're getting the most recent data.

What is the best data model for timeseries in Cassandra when *fast sequential reads* are required

I want to store streaming financial data into Cassandra and read it back fast. I will have up to 20000 instruments ("tickers") each containing up to 3 million 1-minute data points. I have to be able to read large ranges of each of these series as speedily as possible (indeed it is the reason I have moved to a columnar-type database as MongoDB was suffocating on this use case). Sometimes I'll have to read the whole series. Sometimes I'll need less but typically the most recent data first. I also want to keep things really simple.
Is this model, which I picked up in a Datastax tutorial, the most effective? Not everyone seems to agree.
CREATE TABLE minutedata (
ticker text,
time timestamp,
value float,
PRIMARY KEY (ticker, time))
WITH CLUSTERING ORDER BY (time DESC);
I like this because there are up to 20 000 tickers so the partitioning should be efficient, and there are only up to 3 million minutes in a row, and Cassandra can handle up to 2 billion. Also with the time descending order I get most recent data when using a limit on the query.
However, the book Cassandra High Availability by Robbie Strickland mentions the above as an anti-pattern (using sensor-data analogy), and I quote the problems he cites from page 144:
Data will be collected for a given sensor indefinitely, and in many
cases at a very high frequency
With sensorID as the partition key, the row will grow by two
columns for every reading (one marker and one reading).
I understand point one would be a problem but it's not in my case due to the 3 million data point limit. But point 2 is interesting. What are these "markers" between each reading? I clearly want to avoid anything that breaks contiguous data storage.
If point 2 is a problem, what is a better way to model timeseries so that they can efficiently be read in large ranges, fast? I'm not particularly keen to break the timeseries into smaller sub-periods.
If your query pattern was to find a few rows for a ticker using a range query, then I would say having all the data for a ticker in one partition would be a good approach since Cassandra is optimized to access partitions efficiently.
But if everything is in one one partition, then that means the query is happening on only one node. Since you say you often want to read large ranges of rows, then you may want more parallelism.
If you split that same data across many nodes and read it in parallel, you may be able to get better performance. For example, if you partitioned your data by ticker and by year, and you had ten nodes, you could theoretically issue ten async queries and have each year queried in parallel.
Now 3 million rows is a lot, but not really that big, so you'd probably have to run some tests to see which approach was actually faster for your situation.
If you're doing more than just retrieving all these rows and are doing some kind of analytics on them, then parallelism will become more attractive and you might want to look into pairing Cassandra with Spark so that the data and be read and processed in parallel on many nodes.

Adding and retrieving sorted counts in Cassandra

I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.

Resources