How to search a cassandra collection map using QueryBuilder - cassandra

In my cassandra table i have a collection of Map also i have indexed the map keys.
CREATE TABLE IF NOT EXISTS test.collection_test(
name text,
year text,
attributeMap map<text,text>,
PRIMARY KEY ((name, year))
);
CREATE INDEX ON collection_test (attributeMap);
The QueryBuilder syntax is as below:
select().all().from("test", "collection_test")
.where(eq("name", name)).and(eq("year", year));
How should i put where condition on attributeMap?

First of all, you will need to create an index on the keys in your map. By default, an index created on a map indexes the values of the map, not the keys. There is special syntax to index the keys:
CREATE INDEX attributeKeyIndex ON collection_test (KEYS(attributeMap));
Next, to SELECT from a map with indexed keys, you'll need the CONTAINS KEY keyword. But currently, there is not a definition for this functionality in the query builder API. However, there is an open ticket to support it: JAVA-677
Currently, to accomplish this with the Java Driver, you'll need to build your own query or use a prepared statement:
PreparedStatement statement = _session.prepare("SELECT * " +
"FROM test.collection_test " +
"WHERE attributeMap CONTAINS KEY ?");
BoundStatement boundStatement = statement.bind(yourKeyValue);
ResultSet results = _session.execute(boundStatement);
Finally, you should read through the DataStax doc on When To Use An Index. Secondary indexes are known to not perform well. I can't imagine that a secondary index on a collection would be any different.

Related

Add String to Array, without creating a new row in Clickhouse table

I just started to study the clickhouse! I use python and library clickhouse_connect. Can't get to add a new string to the Array(String)
I try to create new String to Array
My code:
import clickhouse_connect
ch_client = clickhouse_connect.get_client(host=ch_host, user=ch_user, password=ch_pass, database=ch_datebase)
ch_client.command(f'CREATE TABLE IF NOT EXISTS {ch_table} (key String, strings Array(String)) ENGINE MergeTree ORDER BY key')
insert_data = [['123', ['string1']]]
ch_client.insert(ch_table, insert_data, column_names=['key', 'strings'])
insert_data = [['123', ['string2']]]
ch_client.insert(ch_table, insert_data, column_names=['key', 'strings'])
Is there an easy way to insert a new row into the list if there is already such a key, and if there is no such key, then create a new row?
You could just insert your rows, then write a query that gives you what you want:
SELECT
key,
groupArrayArray(strings)
FROM ch_table
GROUP BY key;
If that works, you could create a materialized view from this query:
CREATE MATERIALIZED VIEW ch_table_view
ENGINE = AggregatingMergeTree
ORDER BY key
POPULATE AS
SELECT
key,
groupArrayArrayState(strings) AS strings_merged
FROM ch_table
GROUP BY key;
Notice the -State aggregate combinator was used, which keeps a "running total" of the array of strings. To read this column, you need to use the corresponding -Merge combinator:
SELECT
key,
groupArrayArrayMerge(strings_merged)
FROM ch_table_view
GROUP BY key;

Yugabyte YCQL check if a set contain a value?

Is there there any way to query on a SET type(or MAP/LIST) to find does it contain a value or not?
Something like this:
CREATE TABLE test.table_name(
id text,
ckk SET<INT>,
PRIMARY KEY((id))
);
Select * FROM table_name WHERE id = 1 AND ckk CONTAINS 4;
Is there any way to reach this query with YCQL api?
And can we use a SET type in SECONDRY INDEX?
Is there any way to reach this query with YCQL api?
YCQL does not support the CONTAINS keyword yet (feel free to open an issue for this on the YugabyteDB GitHub).
One workaround can be to use MAP<INT, BOOLEAN> instead of SET<INT> and the [] operator.
For instance:
CREATE TABLE test.table_name(
id text,
ckk MAP<int, boolean>,
PRIMARY KEY((id))
);
SELECT * FROM table_name WHERE id = 'foo' AND ckk[4] = true;
And can we use a SET type in SECONDRY INDEX?
Generally, collection types cannot be part of the primary key, or an index key.
However, "frozen" collections (i.e. collections serialized into a single value internally) can actually be part of either primary key or index key.
For instance:
CREATE TABLE table2(
id TEXT,
ckk FROZEN<SET<INT>>,
PRIMARY KEY((id))
) WITH transactions = {'enabled' : true};
CREATE INDEX table2_idx on table2(ckk);
Another option is to use with compound primary key and defining ckk as clustering key:
cqlsh> CREATE TABLE ybdemo.tt(id TEXT, ckk INT, PRIMARY KEY ((id), ckk)) WITH CLUSTERING ORDER BY (ckk DESC);
cqlsh> SELECT * FROM ybdemo.tt WHERE id='foo' AND ckk=4;

Batch conditional delete from dynamodb without sort key

I am shifting my database from mongodb to dynamo db. I have a problem with delete function from a table where labName is partition key and serialNumber is my sort key and there is one Id as feedId I want to delete all the records from the table where labName is given and feedId is NOT IN (array of ids).
I am doing it in mongo like below mentioned code
Is there a way with BatchWriteItem where i can add condition for feedId without sort key.
let dbHandle = await getMongoDbHandle(dbName);
let query = {
feedid: {$nin: feedObjectIds}
}
let output = await dbModule.removePromisify(dbHandle,
dbModule.collectionNames.feeds, query);
While working with DynamoDB, you can perform Conditional Retrieval (GET) / Deletion (DELETE) on the records only & only if you have provided all of the attributes for the Primary Key. For example:
For a Simple Primary key, you only need to provide a value for the Partition key.
For a Composite Primary Key, you must need to provide values for both the Partition key & sort key.

How to make a lookup-table in cassandra

I want to create a table in cassandra, that is used as a lookup table. I have a lot of urls in my database and want to store ids instead of the urls-strings. So my approach is, to store the urls in a table with two columns: id (int) and url (text).
My problem is, that I need an index for the url field and also for the id field.
The first index is used during progressing new ulrs (so find an id for an url in the database) and the second index is use during displaying data (get the url for an id).
How can I implement that in cassandra?
I would suggest creating 2 separate tables for this:
CREATE TABLE id_url (id int primary key, url text);
and
CREATE TABLE url_id (url text primary key, id int);
Inserts to these tables should be done with a batch:
BEGIN BATCH
INSERT INTO id_url (id, url) VALUES (1, '<url1>');
INSERT INTO url_id (url, id) VALUES ('<url1>', 1);
APPLY BATCH
You could create your table like this:
CREATE TABLE urls_table(
id int PRIMARY KEY,
url text
);
and then create an index on the second column:
create index urls_table_url on urls_table (url);
Your first query is satisfied since you're querying over partition key. The second one is satisfied since you created an index on url column.

Data Versioning in Cassandra with CQL3

I am quite a n00b in Cassandra (I'm mainly from an RDBMS background with some NoSQL here and there, like Google's BigTable and MongoDB) and I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.
I have this basic table:
CREATE TABLE documents (
itemid_version text,
xml_payload text,
insert_time timestamp,
PRIMARY KEY (itemid_version)
);
itemid is actually a UUID (and unique for all documents), and version is an int (version 0 is the "first" version). xml_payload is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.
As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:
user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)
user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version
user needs the version history of a single (1) doc.
user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)
I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic
first one's straightforward:
$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};"
-f $itemid_version)
now to satisfy the 2nd and 3rd use cases, I am adding the following table:
CREATE TABLE document_versions (
itemid uuid,
version int,
PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);
new records will be added as new docs and new versions of existing docs are created
now we have this (use case #2):
$latest_itemid, $latest_version = csql("select itemid,
version from document_versions where item_id = {0}
order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};"
-f $itemid_version)
and this (use case #3):
$versions = csql("select version from document_versions where item_id = {0}"
-f $itemid)
for the 3rd requirement, I am adding yet another table:
CREATE TABLE latest_documents (
itemid uuid,
version int,
PRIMARY KEY (itemid, version)
)
records are inserted for new docs, records are updated for existing docs
and now we have this:
$latest_itemids, $latest_versions = csql("select itemid, version
from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())
foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
$itemid_version = concat($latest_itemid, $latest_version)
$latest_docs.append(
cql("select * from documents where itemid_version = {0};"
-f $itemid_version))
Now I hope it's clear why I concatenated itemid and version to create an index for documents as opposed to creating a compound key: I cannot have OR in the WHERE clause in SELECT
You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.
Am I on the right track here? there are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:
I feel that the primary key for documents should be a composite of (itemid, version) but I can't satisfy use case #4 (return a list from a query)...I can't possibly use a separate SELECT statement for each document due to the performance hit (network overhead)...or can (should) I?
2 trips to get a document if the version is not known beforehand. probably a compromise I have to live with, or maybe there's a better way.
How would this work Dexter?
It is actually very similar to your solution actually except you can store all versions and be able to fetch the 'latest' version just from one table (document_versions).
In most cases I think you can get what you want in a single SELECT except use case #2 where fetching the most recent version of a document where a pre SELECT is needed on document_versions first.
SECOND ATTEMPT
(I removed the code from the first attempt, apologies to anyone who was following in the comments).
CREATE TABLE documents (
itemid_version text,
xml_payload text,
insert_time timestamp,
PRIMARY KEY (itemid_version)
);
CREATE TABLE document_versions (
itemid text,
version int,
PRIMARY KEY (itemid, version)
) WITH CLUSTERING ORDER BY (version DESC);
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);
user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)
SELECT * FROM documents WHERE itemid_version = 'doc1-2';
user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version
(You would feed concatenated itemid + version in result of first query into second query)
SELECT * FROM document_versions WHERE itemid = 'doc2' LIMIT 1;
SELECT * FROM documents WHERE itemid_version = 'doc2-2';
user needs the version history of a single (1) doc.
SELECT * FROM document_versions WHERE itemid = 'doc2';
user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)
SELECT * FROM documents WHERE itemid_version IN ('doc1-2', 'doc2-1');
Cheers,
Lets see if we can come up with a model in a top down fashion starting from your queries:
CREATE TABLE document_versions (
itemid uuid,
name text STATIC,
vewrsion int,
xml_payload text,
insert_time timestamp,
PRIMARY KEY ((itemid), version)
) WITH CLUSTERING ORDER BY (version DESC);
Use case 1: user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)
SELECT * FROM document_versions
WHERE itemid = ? and version = ?;
Use case 2: user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version
SELECT * FROM document_versions
WHERE itemid = ? limit 1;
Use case 3: user needs the version history of a single (1) doc.
SELECT * FROM document_versions
WHERE itemid = ?
Use case 4: user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)
SELECT * FROM documents
WHERE itemid = 'doc1' and version IN ('1', '2');
One table for all these queries is the correct approach. I would suggest taking the Datastax free online course: DS220 Data Modeling

Resources