Apache Spark bucketed table not scalable in time partitioned table - apache-spark

I am trying to utilize Spark Bucketing on a key-value table that is frequently joined on the key column by batch applications. The table is partitioned by timestamp column, and new data arrives periodically and added in a new timestamp partition. Nothing new here.
I thought it was ideal use case for Spark bucketing, but some limitations seem to be fatal when the table is incremental:
Incremental table forces multiple files per bucket, forcing Spark to sort every bucket upon join even though every file is sorted locally. Some Jira's suggest that this is a conscious design choice, not going to change any time soon. This is quite understood, too, as there could be thousands of locally sorted files in each bucket, and iterating concurrently over so many files does not seem a good idea, either.
Bottom line is, sorting cannot be avoided.
Upon map side join, every bucket is handled by a single Task. When the table is incremented, every such Task would consume more and more data as more partitions (increments) are included in the join. Empirically, this ultimately failed on OOM regardless to the configured memory settings. To my understanding, even if the failures can be avoided, this design does not scale at all. It imposes an impossible trade-off when deciding on the number of buckets - aiming for a long term table results in lots of small files during every increment.
This gives the immediate impression that bucketing should not be used with incremental tables. I wonder if anyone has a better opinion on that, or maybe I am missing some basics here.

Related

Athena: $path vs. partition

I'm storing daily reports per client for query with Athena.
At first I thought I'd use a client=c_1/month=12/day=01/ or client=c2/date=2020-12-01/ folder structure, and run MSCK REPAIR TABLE daily to make new day partition available for query.
Then I realized there's the $path special column, so if I store files as 2020-12-01.csv I could run a query with WHERE $path LIKE '%12-01% thus saving a partition and the need to detect/add it daily.
I can see this having an impact on performance if there was a lot of daily data,
But in my case the day partition will include one file at most, so a partition is mostly to have a field to query, not reduce query dataset.
Any other downside?
When using $path column, all table (partition) location needs to be fully listed.
if you have large number of objects in S3, this listing can become a bottleneck.
Partitions avoid this problem.
Of course, having large number of partitions is also a problem.
I don't know what the cardinality of client column, so hard to tell how many partitions to expect with this approach.
Currently Athena does not apply any optimisations for $path, which means that there is no meaningful difference between WHERE "$path" LIKE '%12-01% and WHERE "date" = '2020-12-01' (assuming you have a column date which contains the same date as the file name). Your data probably already has a date or datetime column, and your queries will be more readable using it rather than $path.
You are definitely on the right track questioning whether or not you need the date part of your current partitioning scheme. There are lots of different considerations when partitioning data sets, and it's not easy to always say what is right without analysing the situation in detail.
I would recommend having some kind of time-based partition key. Otherwise you will have no way to limit the amount of data read by queries, and they will be slower and more expensive as time goes. Partitioning on date is probably too fine grained for your use case, but perhaps year or month would work.
However, if there will only be data for a client for a short time (less than one thousand files in total, the size of one S3 listing page), or queries always read all the data for a client, you don't need a time-based partition key.
To do a deeper analysis on how to partition your data I would need to know more about the types of queries you will be running, how the data is updated, how much data files are expected to contain, and how much difference there will be from client to client.

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

What is the best way to query timeseries data with cassandra?

My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

Dramatic decrease of Azure Table storage performance after querying whole partition

I use Azure Table storage as a time series database. The database is constantly extended with more rows, (approximately 20 rows per second for each partition). Every day I create new partitions for the day's data so that all partition have a similar size and never get too big.
Until now everything worked flawlessly, when I wanted to retrieve data from a specific partition it would never take more than 2.5 secs for 1000 values and on average it would take 1 sec.
When I tried to query all the data of a partition though things got really really slow, towards the middle of the procedure each query would take 30-40 sec for 1000 values.
So I cancelled the procedure just to re start it for a smaller range. But now all queries take too long. From the beginning all queries need 15-30 secs. Can that mean that data got rearranged in a non efficient way and that's why I am seeing this dramatic decrease in performance? If yes is there a way to handle such a rearrangement?
I would definitely recommend you to go over the links Jason pointed above. You have not given too much detail about how you generate your partition keys but from sounds of it you are falling into several anti patterns. Including by applying Append (or Prepend) and too many entities in a single partition. I would recommend you to reduce your partition size and also put either a hash or a random prefix to your partition keys so they are not in lexicographical order.
Azure storage follows a range partitioning scheme in the background, so even if the partition keys you picked up are unique, if they are sequential they will fall into the same range and potentially be served by a single partition server, which would hamper the ability of azure storage service overall to load balance and scale out your storage requests.
The other aspect you should think is how you are reading the entities back, the best recommendation is point query with partition key and row key, worst is a full table scan with no PK and RK, there in the middle you have partition scan which in your case will also be pretty bad performance due to your partition size.
One of the challenges with time series data is that you can end up writing all your data to a single partition which prevents Table Storage from allocating additional resources to help you scale. Similarly for read operations you are constrained by potentially having all your data in a single partition which means you are limited to 2000 entities / second - whereas if you spread your data across multiple partitions you can parallelize the query and yield far greater scale.
Do you have Storage Analytics enabled? I would be interested to know if you are getting throttled at all or what other potential issues might be going on. Take a look at the Storage Monitoring, Diagnosing and Troubleshooting guide for more information.
If you still can't find the information you want please email AzTableFeedback#microsoft.com and we would be happy to follow up with you.
The Azure Storage Table Design Guide talks about general scalability guidance as well as patterns / anti-patterns (see the append only anti-pattern for a good overview) which is worth looking at.

Resources