"Capped collections" in Cassandra - cassandra

Cassandra doesn't have capped collections (or row size limits), but one way of simulating it is to use an offline mapreduce job clean up extra entries. Would it be better to have a second table that stores row counts for primary keys in another table? The downside is that you have to scan through the entire row_count table since counters aren't indexable. Or would it be faster to just scan over the backing table with the real data?
Or is there another technique I should look into?
Edit: I found this Columns count vs counter column performance. Row counts go over all the data, so I'm leaning away from that.

Related

Counter table in cassandra

Whats the point of having no non-key column in cassandra counter table?
I have a table with some key and non key column but I cannot keep a counter column....although I want the rows to be sorted based on some counter(hits).
If I create a separate table for counter, how do I relate two table for sorting?
thanks in advance
Counters are a very different type of cell in cassandra internals. Everything about them is different than most other cassandra types. They require special care and it just isn't worth the complexity to be able to mix them in with other cells.
You can use the same primary key structure in two tables, one with counters and one with other cells/columns. You just can't have the other cells/columns in the counter table.

Sparse matrix using column store on MemSQL

I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
CREATE TABLE matrix (
r_id INT,
c_id INT,
cell_data VARCHAR(10),
KEY (`r_id`, `c_id`) USING CLUSTERED COLUMNSTORE,
);
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.
1.
Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
2.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
3.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see http://docs.memsql.com/latest/concepts/columnstore/). Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.

There's no better way to Count Keys In Cassandra?

I have a log table in cassandra, and now I want to search the rows count of the table.
First, I use the select count(*) from log,but it's very, very slow.
Then I want to use the counter type, and then the problem is coming. My table is a TTL table, all rows keep an hour, use the counter type become very difficult.
Cassandra isn't efficient for doing table scan operations. It is good at ingesting high volumes of data and then accessing small slices of that data rather than the whole table.
So if you want to count keys without using a counter, you need to break the table into chunks of data that are small enough to be processed quickly. For example if you want to use count(*), you should only use it on a single partition, and keep the partition size below about 100,000 rows.
In your case you might want to partition your data by hour (or something small like 5 minute intervals if you insert a lot of log lines per second).
Be careful with using a TTL of an hour if you are inserting a lot of data continuously since it could cause a lot of tombstones. To avoid building up tombstones you should delete each hour partition after the hour has passed.

Cassandra Performance : Less rows with more columns vs more rows with less columns

We are evaluating if we can migrate from SQL SERVER to cassandra for OLAP. As per the internal storage structure we can have wide rows. We almost need to access data by the date. We often need to access data within date range as we have financial data. If we use date as Partition key to support filter by date,we end up having less row with huge number of columns.
Will it hamper performance if we have millions of columns for a single row key in future as we process millions of transactions every day.
Do we need to have some changes in the access pattern to have more rows with less number of columns per row.
Need some performance insight to proceed in either direction
Using wide rows is typically fine with Cassandra, there are however a few things to consider:
Ensure that you don't reach the 2 billion column limit in any case
The whole wide row is stored on the same node: it needs to fit on the disk. Also, if you have some dates that are accessed more frequently then other dates (e.g. today) then you can create hotspots on the node that stores the data for that day.
Very wide rows can affect performance however: Aaron Morton from The Last Pickle has an interesting article about this: http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
It is somewhat old, but I believe that the concepts are still valid.
For a good table design decision one needs to know all typical filter conditions. If you have any other fields you typically filter for as an exact match, you could add them to the partition key as well.

How to optimize a table containing 1 billion rows, fixed row format using myisam engine in mysql?

I am having a table containing 1 billion rows, fixed row format and using myisam engine in mysql. I am thinking of shardding the table but that development takes time. Are there any temporary solutions for improving the performance?
you can take a look at mysql partitioning. http://dev.mysql.com/doc/refman/5.1/en/partitioning-overview.html
it allows you to distribute portions of individual tables across a file system transparent to your queries
As per your comment if "insert/select ratio = 100:1" is the case, then i don see any reason to have indexes (apart from primary key index if any) on the table. It will further slow down your inserts.
Also, if you can queue inserts to this table then you can try creating a in-memory table (http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html) and direct all the inserts to the table which will be faster and then do a bulk insert/periodic flush in to ur myisam engine based table.
Also you can partition the table on a specific column out of those 4 you have(if there is any good candidate) or go for hash based partition (if you don find any). I am not sure why you are saying sharding is going to take dev time. you can partition an existing non partitioned table too. http://forums.mysql.com/read.php?106,264106,264110

Resources