Storing arrays in Cassandra - cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks

Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Related

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

Data storage parallelization in .NET Core

I am a little lost in this task. There is a requirement for our caching solution to split a large data dictionary into partitions and perform operations on them in separate threads.
The scenario is: We have a large pool of data that is to be kept in memory (40m rows), the chosen strategy is first to have a Dictionary with int key. This dictionary contains a subset of 16 dictionaries that are keyed by guid and contain a data class.
The number 16 is calculated on startup and indicates CPU core count * 4.
The data class contains a byte[] which is basically a translated set of properties and their values, int pointer to metadata dictionary and checksum.
Then there is a set of control functions that takes care of locking and assigns/retrieves Guid keyed data based on a division of the first segment of guid (8 hex numbers) by divider. This divider is just FFFFFFFF / 16. This way each key will have a corresponding partition assigned.
Now I need to figure out how to perform operations (key lookup, iterations and writes) on these dictionaries in separate threads in parallel? Will I just wrap these operations using Tasks? Or will it be better to load these behemoth dictionaries into separate threads whole?
I have a rough idea how to implement data collectors, that will be the easy part I guess.
Also, is using Dictionaries a good approach? Their size is limited to 3mil rows per partition and if one is full, the control mechanism tries to insert on another server that is using the exact same mechanism.
Is .NET actually a good language to implement this solution?
Any help will be extremely appreciated.
Okay, so I implemented ReaderWriterLockSlim and implemented concurrent access through System.Threading.Tasks. I also managed to exclude any dataClass object from the storage, now it is only a dictionary of byte[]s.
It's able to store all 40 million rows taking just under 4GB of RAM and through some careful SIMD optimized manipulations performs EQUALS, <, > and SUM operation iterations in under 20ms, so I guess this issue is solved.
Also the concurrency throughput is quite good.
I just wanted to post this in case anybody faces similar issue in the future.

Hive/Impala performance with string partition key vs Integer partition key

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?
Well, it makes a difference if you look up the official Impala documentation.
Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:
"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."
Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html
No, there is no such recommendation. Consider this:
The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.
Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.
There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.
Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.
1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

Out of memory [divide and conquer algorithm]

So I have a foo table, which is huge and whenever I try to read all data from that table Node.JS gives me out of memory error! However still you can get chuncks of data by having offset and limit; but again I cannot merge all of the chuncks and have them in memory, because I run into out of memory again! In my algorithm I have lots of ids and need to check whether each id exists in foo table or not; what is the best solution (in terms of algorithm complexity) when I cannot have all of the data in memory to see if id exists in foo table or not?
PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe...
Under the constraints you specified, you could create a hash table containing the ID's you are looking for as keys, with all values initialized to false.
Then, read the table chunk by chunk and for each item in the table, look it up in the hash table. If found, mark the hash table entry with true.
After going all the chunks, your hash table will hold values of true for ID's found in the table.
Providing that the hash table lookup has a fixed time complexity, then this algorithm has a time complexity of O(N).
You can sort your ids, and break them to chunks. Then you can keep in memory range of values in each chunk - (lowestId,highestId) for that chunk.
You can quickly find chunk (if any) id may be contained in using in memory binary search, and then load that specific chunk in memory and to binary search on it.
Complexity should be LogN for both. In general, read about binary search algorithm.
"PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe..."
Let's say you could load the whole table into your memory. In any case you'll need to check each ID whether or not it is in the DB. You can't do it any better than comparing.
Having said that, a hash table comes to mind. Lets say the IDs are integers, and they are randomly picked. you could hash the IDs you need to check by the last two digits (or the first two for that matter). Then checking the items you have in your memory will be quicker.

Cassandra schema advice needed

I'm designing a Cassandra schema for a browser event collection system, and I was hoping to sanity check my approach. The system collects user events in the browser, like mouse movements, clicks, etc. The events are stored and processed to create heat maps of user activity on a web page. I've chosen Cassandra for persistence, since my use case is more write heavy than ready heavy: every 50 milliseconds, an ajax call dumps the aggregated events to my server, and into the database. I'm using node.js for the server, and the JSON events look something like this on the server:
{ uuid: dsf86ag487hadf97hadf97, type: 'MOVE', time: 12335234345, pageX: 334, pageY:566, .... }
As you can see each user has a unique uuid, associated with each of their events, generated on the browser, stored in a cookie. My read case will be some map-reduce job. Each top-level domain will be a keyspace, and I was planning using the uuid as my partition key. The main table will be the events table, where each row will be one event, using a composite primary key, consisting of the browser-generated uuid and a cassandra-generated timeuuid. The primary key must have a timeuuid component, since two events may have the same timestamp on certain browsers. The data types for event will be strings, ints, timestamps. The total data for a partition should not exceed a few hundred megabytes. So...Is this sane? What questions should I be asking myself? I recognize that this use case has many analogs in sensor data collection, etc, so please point me to existing examples. Thanks in advance.
Choosing a partition key
While recording the user ID may be important in some cases for distinguishing events from different users that may occur at the same time, the user ID is probably not the best choice for the partition key. That is, unless you are planning to analyze the behavior of specific users.
You are probably more concerned with how the heatmap changes over time and specifically which areas of the page were involved. These are probably better considerations for your partition key, though perhaps not stored as a timestamp nor as X/Y coordinates, which I'll get into later.
You will generally want to choose a partition key that has (1) a large distribution of values, to create even load across your cluster, and (2) is made up of values that are relatively "well known". By "well known", I mean something you either know in advance or something that can be computed easily and deterministically. For instance, you will have many users and will gather statistics over many days. While the the specific of days (encoded as, say, YYYY-MM-DD strings) can be easily determined based on a known start/end date range or query input, the set of all valid user IDs (assuming UUIDs or other non-incremental value, or hash) is much harder to determine without doing a scan of the entire cluster. Avoid doing partition key scans; aim for "exact" random access to your partitions.
Format of the partition key
The partition key is traditionally shown as a single column in many examples, but you can have a multi-column partition key. This can be useful when using date/time information as all or part of the key. You would aim to have as few unique values per column as possible, so that the set of values you need to enumerate is as small as possible, but as many values (or additional columns) as necessary to balance the I/O load and data distribution across the cluster.
For example, if you were to use a timestamp as your partition key, in 64-bit Java timestamp format, there are 1,000 possible partitions per second. Even though you can technically iterate over them, that may be more granular than you need or want. On the other side, if your partition key were simply the 4-digit year, then all of that year's events would go to the same partition (making it very large) and to the same set of replica nodes (hotspots, inefficient cluster use). By choosing a key that balances between these extremes, you can control the size of your partitions and also the number of partitions you must access in order to satisfy a query.
Also consider what you'll do when you ever want to delete old data. The easiest means (within a single column family/table) is to delete an entire partition as this helps avoid accumulating individual column tombstones. If you ever want to run an operation like "delete all data older than 2013" then you definitely don't want to bury the date deep down in the data and would rather have it as part of your partition key.
Choosing a row (clustering) key
Any additional columns in the primary key that are not part of the partition key become the row key within the partition, and the rows are clustered (ordered) by the sort order of the first of these columns.
That clustering/sorting is important, because it's generally the only native sorting you're going to get with Cassandra. Even if the partition key is down to the level of a specific hour or minute of a specific day, you might choose to cluster the rows by your millisecond timestamp or time UUID, to keep everything within that partition in chronological order.
You can still have additional columns, like your X/Y coordinates or user IDs, in your row keys -- in case it sounded like I was recommending that you put time (only) in both the partition and clustering keys.
Using X/Y coordinates
This part has nothing to do with Cassandra, but if you're heat-mapping the page, do be aware that people use different screens and devices at different resolutions. Unless you're doing pixel-perfect layout on your site (and hopefully you're using a fluid, responsive layout instead) then the X/Y coordinate of one user isn't going to match the X/Y coordinates from another user. They might not even match for the same user, if that user switches devices.
Consider mapping not by X/Y coordinate of the mouse, but perhaps the IDs of elements in the DOM. Have an ID for your "sidebar", "main menu", "main body div" and any specific elements you want to map. These would be string keys, not coordinate pairs, and while they'd still be triggered on mouse enter/leave/click the logged information doesn't depend or assume any particular screen geometry.
Perhaps you decide to include the element ID as part of the row or partition key, too.

Resources