How to store Bert embeddings in cassandra - cassandra

I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.

Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.

Related

How do I most effectively compress my highly-unique columns?

I have a Spark DataFrame consisting of many double columns that are measurements, but I want a way of annotating each unique row by computing a hash of several other non-measurement columns. This hash results in garbled strings that are highly unique, and I've noticed my dataset size increases substantially when this column is present. How can I sort / lay out my data to decrease the overall dataset size?
I know that the Snappy compression protocol used on my parquet files executes best upon runs of similar data, so I think a sort over the primary key could be useful, but I also can't coalesce() the entire dataset into a single file (it's hundreds of GB in total size before the primary key creation step).
My hashing function is SHA2(128) FYI.
If you have a column that can be computed from the other columns, then simply omit that column before compression, and reconstruct it after decompression.

What is the cardinality of a partition key?

If I use a randomly generated unique Id , is it correct that
the cardinality would be rather large ?
If I have a key with a low cardinality like 5 category values that the partition key can take, and I want to distribute it, the recommended approach seems to be to make partition key into composite key.
But this requires that I have to specify all the parts of a composite key in my query to retrieve all records of that key.
Even then the generated token might end up being for the same node.
Is there any way to decide on a the additional column for composite key to that would guarantee that the data would be distributed ?
The thing is that with cassandra you actually want to have partitioning keys "known" so that you can access the data when you need it. I'm not sure what you mean when you say large cardinality on partitioning key. You would get a lot of partitions in the cluster. This is usually o.k.
If you want to distribute the data around the cluster. You can use artificial columns. And this approach is sometimes also called bucketing. Basically if you want to keep 100k+ or in never version 1 million+ columns it's o.k. to split this data into partitions.
Some people simply use a trick and when they insert the data they add some artificial bucket column to partition ... let's say random(1-10) and then when they are reading the data out they simply issue 10 queries or use an in operator and then fetch the data and merge it on the client side. This approach has many benefits in that it prevents appearance of "hot rows" in the cluster.
Chances for every key are more or less 1/NUM_NODES that it will end on the same node. So I would say most of the time this is not something you should worry about too much. Unless you have number of partitions that is smaller then the number of nodes in the cluster.
Basically there are two choices for additional column random (already described) or some function based on some input data i.e. when using time series data and you decide to bucket based on the month you can always calculate the month based on the data that you are going to insert and then you just put it in bucket. When you are retrieving the data then you know ... o.k. I'm looking something in May 2016 and then you know how to select the appropriate bucket.

Cassandra Partition and Cluster Keys to Store Inverted index

I need to use Cassandra to store an inverted index, in which words and their frequencies in articles are stored as follows:
word, article_title, frequency
Number of unique words is about 40M and number of Cassandra nodes = 2.
Which is better to use the first character of the word as Partition key or the word itself?
what about the Primary key?
TL;DR: With regards to your query, I would definitely say to use the word as the partition key.
If you would use only the first character, you will have only 26 partitions. You do not want that, if anything else, you will get hot-spotting. Some rows will be pretty short, since there are not a lot of words starting with a specific letter, and others will be very, very long, maybe even beyond the point it is performant to use. Yes, Cassandra has a two billion column per row limit, but the recommendation is to keep the size of the row in the millions. You also do not want to access all the words starting with 'A' if you want only 'AIRPORT'.
You need a high carnality, as random as possible, partition key so that the rows are easily dispersed throughout the cluster. On the other hand, it has to reflect your access patterns. In your case, you wan't to see stats for a word or set of words. Accessing by partition/primary is basically as fast as it gets with Cassandra.
As for the clustering key, it is more or less obvious, you can use the article title, OR, what I would do is actually use an article identifier (a UUID or such) as the cluster key. Article titles might change (typo?), and you certainly do not want to iterate through all your rows changing the title.

Sparse matrix using column store on MemSQL

I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
CREATE TABLE matrix (
r_id INT,
c_id INT,
cell_data VARCHAR(10),
KEY (`r_id`, `c_id`) USING CLUSTERED COLUMNSTORE,
);
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.
1.
Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
2.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
3.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see http://docs.memsql.com/latest/concepts/columnstore/). Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Resources