Too many columns in Cassandra - cassandra

I have 20 columns in a table in Cassandra. Will there be a performance impact in performing
select * from table where partitionKey = 'test';
I am not able to understand from this link,
https://wiki.apache.org/cassandra/CassandraLimitations
1) What will be the consequence of having too many columns (say 20) in the Cassandra tables?

Unless you have a lot of rows on the partition, I don't see an impact with having 20 columns. As stated in the documentation that you linked:
The maximum number of cells (rows x columns) in a single partition is 2 billion.
So, unless you are expecting to have more than 100 million rows in a single partition, I don't see why 20 columns would be an issue. Keep in mind that Cassandra is a column family store. This designation means that Cassandra can store a large number of columns per partition.
Having said that, I would personally recommend not to go over 100 MB per partition. It might bring you problems in the future with streaming during repairs.
===============================
To answer to your comment. Keep in mind that partitions and rows are 2 different things in Cassandra. A partition is only equal to a row if there's no clustering columns. For instance, take a look at this table creation and the values we insert, and then look at the sstabledump:
create TABLE tt2 ( foo int , bar int , mar int , PRIMARY KEY (foo , bar )) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 2, 3) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 3, 4) ;
sstabledump:
./cassandra/tools/bin/sstabledump ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 32,
"clustering" : [ "2" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:57:36.362483Z" },
"cells" : [
{ "name" : "mar", "value" : 3 }
]
},
{
"type" : "row",
"position" : 32,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:58:03.538482Z" },
"cells" : [
{ "name" : "mar", "value" : 4 }
]
}
]
}
]
Also, if you use the -d option, it might make it easier for you to see the internal representation. As you can see, for the same partition, we have 2 distinct rows:
./cassandra/tools/bin/sstabledump -d ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[1]#0 Row[info=[ts=1517317056362483] ]: 2 | [mar=3 ts=1517317056362483]
[1]#32 Row[info=[ts=1517317083538482] ]: 3 | [mar=4 ts=1517317083538482]

Related

CosmosDB - list in aggregate query response

I have following document structure:
{
"id": "1",
"aId": "2",
"bId": "3",
....
},
{ "id":"2",
"aId": "2",
"bId": "4"
}
How do i return for that JSON that has aId that has list of all bIds of the same aId, and as additional field: count of such bIds? So for example above and condtion: "WHERE aId="2" response would be:
{
"aId": "2",
"bIds" : ["4","3"],
"bIds count" : 2
}
Assuming i only pass one aId as parameter.
I tried something like:
select
(select 'something') as aId,
(select distinct value c.bId from c where c.aId='something') as bIds
from TableName c
But for love of me i cant figure out how to get that list + its count + hardcoded aId in single JSON response (single row)
For example this query:
select
(select distinct value 'someId') as aId,
(select distinct value c.bId) as bIds
from c where c.aId='someId'
will return
{ { 'aId': 'someId', 'bIds':'2'},{'aId':'someId','bIds':'4'}}
while what i acutally want is
{ {'aId':''someId', 'bIds':['2','4']}}
Here is query that is closest to what i want:
select
c.aId as aId,
count(c2) as bIdCount,
array(select distinct value c2.bId from c2)
from c join (select c.bId from c) as c2
where c.aId = 'SOME_ID'
Only thing line with array make this query fail if i delete this line it works (correctly returns id and count in one row). But i need to select content of this list also, and i ma lost why its not working, example is almost copypasted from "How to perform array projection Cosmos Db"
https://azurelessons.com/array-in-cosmos-db/#How_to_perform_array_projection_Azure_Cosmos_DB
Here is how you'd return an array of bId:
SELECT distinct value c.bId
FROM c
where c.aId = "2"
This yields:
[
"3",
"4"
]
Removing the value keyword:
SELECT distinct c.bId
FROM c
where c.aId = "2"
yields:
[
{ "bId" : "3" },
{ "bId" : "4" }
]
From either of these, you can count the number of array elements returned. If your payload must include count and aId, you'll need to add those to your JSON output.

What happens when adding a field in UDT in Cassandra?

For example, suppose I have a basic_info type:
CREATE TYPE basic_info (first_name text, last_name text, nationality text)
And table like this:
CREATE TABLE student_stats (id int PRIMARY KEY, grade text, basics FROZEN<basic_info>)
And I have millions of record in the table.
If I add a field in the basic_info like this:
ALTER TYPE basic_info ADD address text;
I want to ask what happens in Cassandra when you add a new field in UDT type (it's currently a column in a table)? The reason for this question is I afraid that some side effects will happen when the table contains a lot of data (millions of record). It's best if you can explain things that will happen from the start to the end.
fields of UDT are described in table system_schema.types. When you add a new field, the entry for that type is updated inside Cassandra, but no changes in data on disk will happen (SSTables are immutable). Instead, when Cassandra read data, it checks if field is present or not, and if not (because it wasn't set, or it's a new field of UDT), then it will return null for that value, but not modify data on disk.
For example, if I have following type and table that uses it:
CREATE TYPE test.udt (
id int,
t1 int
);
CREATE TABLE test.u2 (
id int PRIMARY KEY,
u udt
)
And I have some data in the table, so I get:
cqlsh> select * from test.u2; id | u ----+---------------- 5 | {id: 1, t1: 3}
If I add a field to UDT with alter type test.udt add t2 int;, I immediately see the null as a value for a new UDT field:
cqlsh> select * from test.u2;
id | u
----+--------------------------
5 | {id: 1, t1: 3, t2: null}
And if I do sstabledump on the SSTable, I can see that it contains only old data:
[
{
"partition" : {
"key" : [ "5" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 46,
"liveness_info" : { "tstamp" : "2019-07-28T09:33:12.019Z" },
"cells" : [
{ "name" : "u", "path" : [ "id" ], "value" : 1 },
{ "name" : "u", "path" : [ "t1" ], "value" : 3 }
]
}
]
}
]
See also my answer about adding/removing columns

Redis: How to store a hashset with 5 fields efficiently

I am working with redis and want to store more than 1 Million keys as a hashset with 5 fields.
At first it looked like that:
key : {
a : 1,
b : 2,
c : 3,
d : 4,
e : 5
}
1M keys used about 130MB of Memory. Then I found an post from the instagram developers: http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value
So I adapted my approach to:
key1 : {
1a : 1,
1b : 2,
1c : 3,
1d : 4,
1e : 5,
2a : 1,
...
}
key2 : {
1a : 1,
1b : 2,
1c : 3,
1d : 4,
1e : 5,
2a : 1,
...
}
...
This used more than 200MB for just 500k entries. So is there a better structure within redis my data?
//EDIT
Ok so I have an object named 'user' I want to save in redis with 5 fields ('name','email','age', 'gender', 'active'). So at first I used the hmset command to create an entry in redis via nodejs 'redis'-package.
userId : {
name : Emil Example,
email : email#stackoverflow.com,
age : 20,
gender : m,
active : true
}
So I created 1M entries of dummy data an checked the memory-usage via 'INFO' command. About ~130MB. Then I read the instagram blog and their usage of buckets. So I implemented it with my data (userID : '12345'):
12 : {
345name : Emil Example,
345email : email#stackoverflow.com,
345age : 20,
345gender : m,
345active : true
}
Of course every userID beginning with '12' would fall into this bucket. After I tried a run with 500k dummy entries I had a usage of more than 500MB. So I just ask myself if I made something wrong or for my case, are there some other optimization options?
P.S: Ofc I adjusted the 'hash-zipmap-max-entries' settings in the config file and started redis using it.

Why is sorting in arangodb slow?

I am experimenting to see whether arangodb might be suitable for our usecase.
We will have large collections of documents with the same schema (like an sql table).
To try some queries I have inserted about 90K documents, which is low, as we expect document counts in the order of 1 million of more.
Now I want to get a simple page of these documents, without filtering, but with descending sorting.
So my aql is:
for a in test_collection
sort a.ARTICLE_INTERNALNR desc
limit 0,10
return {'nr': a.ARTICLE_INTERNALNR}
When I run this in the AQL Editor, it takes about 7 seconds, while I would expect a couple of milliseconds or something like that.
I have tried creating a hash index and a skiplist index on it, but that didn't have any effect:
db.test_collection.getIndexes()
[
{
"id" : "test_collection/0",
"type" : "primary",
"unique" : true,
"fields" : [
"_id"
]
},
{
"id" : "test_collection/19812564965",
"type" : "hash",
"unique" : true,
"fields" : [
"ARTICLE_INTERNALNR"
]
},
{
"id" : "test_collection/19826720741",
"type" : "skiplist",
"unique" : false,
"fields" : [
"ARTICLE_INTERNALNR"
]
}
]
So, am I missing something, or is ArangoDB not suitable for these cases?
If ArangoDB needs to sort all the documents, this will be a relatively slow operation (compared to not sorting). So the goal is to avoid the sorting at all.
ArangoDB has a skiplist index, which keeps indexed values in sorted order, and if that can be used in a query, it will speed up the query.
There are a few gotchas at the moment:
AQL queries without a FILTER condition won't use an index.
the skiplist index is fine for forward-order traversals, but it has no backward-order traversal facility.
Both these issues seem to have affected you.
We hope to fix both issues as soon as possible.
At the moment there is a workaround to enforce using the index in forward-order using an AQL query as follows:
FOR a IN
SKIPLIST(test_collection, { ARTICLE_INTERNALNR: [ [ '>', 0 ] ] }, 0, 10)
RETURN { nr: a.ARTICLE_INTERNALNR }
The above picks up the first 10 documents via the index on ARTICLE_INTERNALNR with a condition "value > 0". I am not sure if there is a solution for sorting backwards with limit.

Cassandra Hector Client: Is a RangeSlicesQuery on Composite Row Keys possible when using Random Partitioning?

Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?
You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Resources