Apache Cassandra Data modeling - cassandra

I need some help for a data model to save smart meter data, im pretty new working with cassandra.
The data that has to be stored:
This is a example of 1 smart meter:
{"logical_name": "smgw_123",
"ldevs":
[{"logical_name": "sm_1", "objects": [{"capture_time": 390600, "unit": 30, "scaler": -3, "status": "000", "value": 152.361925}]},
{"logical_name": "sm_2", "objects": [{"capture_time": 390601, "unit": 33, "scaler": -3, "status": "000", "value": 0.3208547253907171}]},
{"logical_name": "sm_3", "objects": [{"capture_time": 390602, "unit": 36, "scaler": -3, "status": "000", "value": 162.636025}]}]
}
So this is 1 smart meter gateway with the logical_name "smgw_123".
And in the ldevs array are 3 smartmeters with their values described.
So the smart meter gateway has a relation to the 3 smart meters. And the smart meters again have their own data.
Questions
I dont know how I can store these data which have relations in a no sql database (in my case cassandra).
Do I have to use than 2 columns? Like smartmetergateway (logical name, smart meter1, smart meter 2, smart meter 3)
and another with smart meter (logical name, capture time, unit, scaler, status, value)
???
Another problem is, all smart meter gateways can have different amount of smart meters.
I hope I could describe my problem understandable.
thx

In Cassandra data modelling, the first thing you should do is to determine your queries. You will model partition keys and clustering columns of your tables according to your queries.
In your example, I assume you will query your smart meter gateways based on their logical names. I mean, your queries will look like
select <some_columns>
from smart_meter_gateway
where smg_logical_name = <a_smg_logical_name>;
Also I assume each smart meter gateway logical names are unique and each smart meter name in ldevs array has a unique logical name.
If this is the case, you should create a table with a partition key column of smg_logical_name and clustering column of sm_logical_name. By doing this, you will create a table where each smart meter gateway partition will contain some number of rows of smart meters:
create table smart_meter_gateway
(
smg_logical_name text,
sm_logical_name text,
capture_time int,
unit int,
scaler int,
status text,
value decimal,
primary key ((smg_logical_name), sm_logical_name)
);
And you can insert into this table by using following statements:
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_1', 390600, 30, -3, '000', 152.361925);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_2', 390601, 33, -3, '000', 0.3208547253907171);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_3', 390602, 36, -3, '000', 162.636025);
And when you query smart_meter_gateway table by smg_logical_name, you will get 3 rows in the result set:
select * from smart_meter_gateway where smg_logical_name = 'smgw_123';
The result of this query is:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
smgw_123 sm_2 390601 -3 000 33 0.3208547253907171
smgw_123 sm_3 390602 -3 000 36 162.636025
You can also add sm_name as a filter to your query:
select *
from smart_meter_gateway
where smg_logical_name = 'smgw_123' and sm_logical_name = 'sm_1';
This time you will get only 1 row in the result set:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
Note that there are other ways you can model your data. For example, you can use collection columns for ldevs array and this approach has some advantages and disadvantages. As I said in the beginning, it depends on your query needs.

Related

Why do Azure Cosmos queries have higher RUs when specifying the partition key?

I have a question similar to this one. Basically, I have been testing different ways to use partition key, and have noticed that at any time, the more a partition key is referenced in a query, the higher the RUs. It is quite consistent, and doesn't even matter how the partition key is used. So I narrowed it down to the basic queries for test.
To start, this database has about 850K documents, all more than 1KB in size. The partition key is basically a 100 modulus of the id in number form, is set to /partitionKey, and the container uses a default indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
Here is my basic query test:
SELECT c.id, c.partitionKey
FROM c
WHERE c.partitionKey = 99 AND c.id = '99999'
-- Yields One Document; Actual Request Charge: 2.95 RUs
SELECT c.id, c.partitionKey
FROM c
WHERE c.id = '99999'
-- Yields One Document; Actual Request Charge: 2.85 RUs
Azure Cosmos documentation says without the partition key, the query will "fan out" to all logical partitions. Therefore I would fully expect the first query to target a single partition and the second to target all of them, meaning the first one should have a lower RUs. I suppose I am using RU results as evidence to whether or not the Cosmos is fanning out and scanning each partition, and comparing it to what the documentation says should happen.
I know these results are just 0.1 RUs in difference. But my point is the more complex the query, the bigger the difference. For example, here is another query ever so slightly more complex:
SELECT c.id, c.partitionKey
FROM c
WHERE (c.partitionKey = 98 OR c.partitionKey = 99) AND c.id = '99999'
-- Yields One Document; Actual Request Charge: 3.05 RUs
Notice the RUs continues to grow and separate from having not specified a partition key at all. Instead I would expect the above query to only target two partitions, compared to no partition key check which supposedly fans out to all partitions.
I am starting to suspect the partition key check is happening after the other filters (or inside each partition scan). For example, going back to the first query but changing the id to something which does not exist:
SELECT c.id, c.partitionKey
FROM c
WHERE c.partitionKey = 99 AND c.id = '99999x'
-- Yields Zero Documents; Actual Request Charge: 2.79 RUs
SELECT c.id, c.partitionKey
FROM c
WHERE c.id = '99999x'
-- Yields Zero Documents; Actual Request Charge: 2.79 RUs
Notice the RUs are exactly the same, and both (including the one with the partition filter) have less RUs than when a document exists. This seems like it would be a symptom of the partition filter being executed on the results, not restricting a fan-out. But this is not what the documentation says.
Why does Cosmos have higher RUs when a partition key is specified?
like the comment specifies if you are testing via the portal (or via the code, but with the query you provided) it will become more expensive, because you are not querying a specific partition, but rather querying everything and then introducing another filter, which is more expense.
what you should do instead - is use the proper way in the code to pass in the partition key. my result were quite impressive: 3 ru\s with the PK and 20.000 ru\s without the PK, so I'm quite confident intworks (I've had a really large dataset)

Create Computed Column from JSONB column (containing json arrays) in Cockroachdb

Is there a way we can split a JSON containing array of strings into computed columns?
credit_cards column is of jsonb type with below sample data:
[{"bank": "HDFC Bank", "cvv": "8253", "expiry": "2020-05-31T14:22:34.61426Z", "name": "18a81ea99250bf236a5e27a762a32d62", "number": "c4ca4238acf96a36"}, {"bank": "HDFC Bank", "cvv": "9214", "expiry": "2020-05-30T21:44:55.173339Z", "name": "6725a156df733ec2dd33b94f06ee2e06", "number": "c81e728dacf96a36"}, {"bank": "HDFC Bank", "cvv": "1161", "expiry": "2020-05-31T07:59:28.458905Z", "name": "eb102765424d07d8b713211c14e837b4", "number": "eccbc87eacf96a36"}]
I tried this, but looks like its not supported in Computed Column.
alter table users_combined add column projected_number STRING[] AS (json_array_elements(credit_cards)->>'number') STORED;
ERROR: json_array_elements(): generator functions are not allowed in computed column
Another alternative which worked was:
alter table users_combined add column projected_number STRING[] AS (ARRAY[credit_cards->0->>'number',credit_cards->1->>'number',credit_cards->2->>'number']) STORED;
However this has a problem of user having to specify the "indices" of credit_card array. If we've more than 3 credit cards then we'll have to alter the column with new indices.
So is there a way to create Computed Column without having to specify the indices?
There is no way to do this. But, inverted indexes are a way to get the same capabilities for looking up data in the table that I think you're going for here.
If you create an inverted index on the table, then you can search for rows that have a given number attribute efficiently:
demo#127.0.0.1:26257/defaultdb> create table users_combined (credit_cards jsonb);
CREATE TABLE
Time: 3ms total (execution 3ms / network 0ms)
demo#127.0.0.1:26257/defaultdb> create inverted index on users_combined(credit_cards);
CREATE INDEX
Time: 53ms total (execution 4ms / network 49ms)
demo#127.0.0.1:26257/defaultdb> explain select * from users_combined where credit_cards->'number' = '"c4ca4238acf96a36"';
info
----------------------------------------------------------------------------------------
distribution: local
vectorized: true
• index join
│ table: users_combined#primary
│
└── • scan
table: users_combined#users_combined_credit_cards_idx
spans: [/'{"number": "c4ca4238acf96a36"}' - /'{"number": "c4ca4238acf96a36"}']
(10 rows)

SELECT rows with primary key of multiple columns

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?
Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.
You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

Cassandra CQL retrieve various rows from list of primary keys

I have at a certain point in my software a list of primary keys of which I want to retrieve information from a massively huge table, and I'm wondering what's the most practical way of doing this. Let me illustrate:
Let this be my table structure:
CREATE TABLE table_a(
name text,
date datetime,
key int,
information1 text,
information2 text,
PRIMARY KEY ((name, date), key)
)
say I have a list of primary keys:
list = [['Jack', '2015-01-01 00:00:00', 1],
['Jack', '2015-01-01 00:00:00', 2],
['Richard', '2015-02-14 00:00:00', 5],
['David', '2015-01-01 00:00:00', 9],
...
['Last', '2014-08-13 00:00:00', 12]]
Say this list is huge (hundreds of thousands) and not ordered in any way. I want to retrieve, for every key on the list, the value of the information columns.
As of now, the way I'm solving this issue is executing a select query for each key, and that has been sufficient hitherto. However I'm worried about execution times when the list of keys get too huge. Is there a more practical way of querying cassandra for a list of rows of which I know the primary keys without executing one query per key?
If the key was a single field, I could use the select * from table where key in (1,2,6,3,2,4,8) syntax to obtain all the keys I want in one query, however I don't see how to do this with composite primary keys.
Any light on the case is appreciated.
The best way to go about something like this, is to run these queries in parallel. You can do that on the (Java) application side by using async futures, like this:
Future<List<ResultSet>> future = ResultSets.queryAllAsList(session,
"SELECT * FROM users WHERE id=?",
UUID.fromString("0a63bce5-1ee3-4bbf-9bad-d4e136e0d7d1"),
UUID.fromString("7a69657f-39b3-495f-b760-9e044b3c91a9")
);
for (ResultSet rs : future.get()) {
... // process the results here
}
Create a table that has the 3 columns worth of data piped together into a single value and store that single string value in a single column. Make that column the PK. Then you can use the IN clause to filter. For example, select * from table where key IN ('Jack|2015-01-01 00:00:00|1', 'Jack|2015-01-01 00:00:00|2').
Hope that helps!
Adam

How do you insert a string or text as a blob in Cassandra (specifically CQLSH)?

I was trying to insert text or some string as a blob for testing purposes in CQLSH
insert into test_by_score (commit, delta, test, score)
values (textAsBlob('bdb14fbe076f6b94444c660e36a400151f26fc6f'), 0,
textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100);
It didn't really work, because after I did:
select * from test_by_score where commit = 0x0b5db8b91bfdeb0a304b372dd8dda123b3fd1ab6;
its said there were 0 columns...which was a little unexpected (because it didn't throw an error at me) but I guess textAsBlob is not a thing in cqlsh. Then does someone know how to do this?
Schema:
CREATE TABLE IF NOT EXISTS test_by_score (
commit blob,
delta int,
score int,
test blob,
PRIMARY KEY(commit, delta, test)
);
I have posted my schema a little reluctantly because I believe my question is not really about this specific schema. What I simply want to to know is, if there is one column that holds blobs, is it possible to insert a string in that position by first changing it to a blob and then inserting it in cqlsh?
The following seems to be working. Your WHERE condition in your SELECT statement may be trying to access the incorrect hex 0x0b5db8b91bfdeb0a304b372dd8dda123b3fd1ab6.
DROP KEYSPACE example;
CREATE KEYSPACE example WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
USE example;
CREATE TABLE IF NOT EXISTS test_by_score (
commit blob, -- blob representing the commit hash
delta int, -- how much the scores have changed
score int, -- the test score, which is determined by the client
test blob, -- blob for the test
PRIMARY KEY(commit, delta, test)
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES
(textAsBlob('bdb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES (
textAsBlob('cdb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
INSERT INTO test_by_score (commit, delta, test, score) VALUES (
textAsBlob('adb14fbe076f6b94444c660e36a400151f26fc6f'), 0, textAsBlob('{"prefix": "enwiki", "title": "\"Aghnadarragh\""}'), 100
);
To grab all the rows
SELECT * FROM example.test_by_score
To select a specific row.
SELECT * FROM example.test_by_score
WHERE commit = 0x62646231346662653037366636623934343434633636306533366134303031353166323666633666;

Resources