Cassandra CQL retrieve various rows from list of primary keys - cassandra

I have at a certain point in my software a list of primary keys of which I want to retrieve information from a massively huge table, and I'm wondering what's the most practical way of doing this. Let me illustrate:
Let this be my table structure:
CREATE TABLE table_a(
name text,
date datetime,
key int,
information1 text,
information2 text,
PRIMARY KEY ((name, date), key)
)
say I have a list of primary keys:
list = [['Jack', '2015-01-01 00:00:00', 1],
['Jack', '2015-01-01 00:00:00', 2],
['Richard', '2015-02-14 00:00:00', 5],
['David', '2015-01-01 00:00:00', 9],
...
['Last', '2014-08-13 00:00:00', 12]]
Say this list is huge (hundreds of thousands) and not ordered in any way. I want to retrieve, for every key on the list, the value of the information columns.
As of now, the way I'm solving this issue is executing a select query for each key, and that has been sufficient hitherto. However I'm worried about execution times when the list of keys get too huge. Is there a more practical way of querying cassandra for a list of rows of which I know the primary keys without executing one query per key?
If the key was a single field, I could use the select * from table where key in (1,2,6,3,2,4,8) syntax to obtain all the keys I want in one query, however I don't see how to do this with composite primary keys.
Any light on the case is appreciated.

The best way to go about something like this, is to run these queries in parallel. You can do that on the (Java) application side by using async futures, like this:
Future<List<ResultSet>> future = ResultSets.queryAllAsList(session,
"SELECT * FROM users WHERE id=?",
UUID.fromString("0a63bce5-1ee3-4bbf-9bad-d4e136e0d7d1"),
UUID.fromString("7a69657f-39b3-495f-b760-9e044b3c91a9")
);
for (ResultSet rs : future.get()) {
... // process the results here
}

Create a table that has the 3 columns worth of data piped together into a single value and store that single string value in a single column. Make that column the PK. Then you can use the IN clause to filter. For example, select * from table where key IN ('Jack|2015-01-01 00:00:00|1', 'Jack|2015-01-01 00:00:00|2').
Hope that helps!
Adam

Related

SELECT rows with primary key of multiple columns

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?
Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.
You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

How to avoid Cassandra ALLOW FILTERING?

I have Following Data Model :-
campaigns {
id int PRIMARY KEY,
scheduletime text,
SchduleStartdate text,
SchduleEndDate text,
enable boolean,
actionFlag boolean,
.... etc
}
Here i need to fetch the data basing on start date and end data with out ALLOW FILTERING .
I got more suggestions to re-design schema to full fill the requirement But i cannot filter the data basing on id since i need the data in b/w the dates .
Some one give me a good suggestion to full fill this scenario to execute Following Query :-
select * from campaings WHERE startdate='XXX' AND endDate='XXX' ; // With out Allow Filtering thing
CREATE TABLE campaigns (
SchduleStartdate text,
SchduleEndDate text,
id int,
scheduletime text,
enable boolean,
PRIMARY KEY ((SchduleStartdate, SchduleEndDate),id));
You can make the below queries to the table,
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx'; -- to get the answer to above question.
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx' and id = 1; -- if you want to filter the data again for specific ids
Here the SchduleStartdate and SchduleEndDate is used as the Partition Key and the ID is used as the Clustering key to make sure the entries are unique.
By this way, you can filter based on start, end and then id if needed.
One downside with this will be if you only need to filter by id that wont be possible as you need to first restrict the partition keys.

CQL table design for temporal data

As a Cassandra novice, I have a CQL design question. I want to re-use a concept which I've build before using RDBMS systems, to create history for customerData. The customer himself will only see the latest version, so that should be the fastest, but queries on whole history can be performed.
My suggested entity properties:
customerId text,
validFromDate date,
validUntilDate date,
customerData text
First save of customerData just INSERTs customerData with validFromDate=NOW and validUntilDate=31-12-9999
Subsequent saves of customerData changes the last record - setting validUntilDate=NOW - and INSERT new customerData with validFromDate=NOW and validUntilDate=31-12-9999
Result:
This way a query of (customerId, validUntilDate)=(id,31-12-9999) will give last saved version.
Query on (customerId) will give all history.
To query customerData at certain time t just use query with validFromDate < t < validUntilDate
My guess is PARTITION_KEY = customerId and CLUSTER_KEY can be validFromDate. Or use PRIMARY KEY = customerId. Or I could create two tables, one for fast querying of lastest version (has no history), and another for historical analyses.
How do you design this in CQL-way? I think I'm thinking too much RDBMish.
Use change timestamp as CLUSTERING KEY with DESC order, e.g
CREATE TABLE customer_data_versions (
id text,
change_time timestamp,
name text,
PRIMARY KEY (id, change_time)
) WITH CLUSTERING ORDER BY ( change_time DESC );
It will allow you to store data versions per customer id in descending order.
Insert two versions for the same id:
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John');
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John Doe');
Get last saved version:
SELECT * FROM customer_data_versions WHERE id='id1' LIMIT 1;
Get all versions for the id:
SELECT * FROM customer_data_versions WHERE id='id1';
Get versions between dates:
SELECT * FROM customer_data_versions WHERE id='id1' AND change_time <= before_date AND change_time >= after_date;
Please note, there are some limits for partition size (how much versions you will be able to store per customer id):
Cells in a partition: ~2 billion (231); single column value size: 2 GB ( 1 MB is recommended)

nested map in cassandra data modelling

I have following requirement of my dataset, need to unserstand what datatype should I use and how to save my data accordingly :-
CREATE TABLE events (
id text,
evntoverlap map<text, map<timestamp,int>>,
PRIMARY KEY (id)
)
evntoverlap = {
'Dig1': {{'2017-10-09 04:10:05', 0}},
'Dig2': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0}},
'Dig3': {{'2017-10-09 04:11:05', 0},{'2017-10-09 04:15:05', 0},{'2017-10-09 04:11:05', 0}}
}
This gives an error :-
Error from server: code=2200 [Invalid query] message="Non-frozen collections are not allowed inside collections: map<text, map<timestamp, int>>"
How should I store this type of data in single column . Please suggest datatype and insert command for the same.
Thanks,
There is limitation of Cassandra - you can't nest collection (or UDT) inside collection without making it frozen. So you need to "froze" one of the collections - either nested:
CREATE TABLE events (
id text,
evntoverlap map<text, frozen<map<timestamp,int>>>,
PRIMARY KEY (id)
);
or top-level:
CREATE TABLE events (
id text,
evntoverlap frozen<map<text, map<timestamp,int>>>,
PRIMARY KEY (id)
);
See documentation for more details.
CQL collections limited to 64kb, if putting things like maps in maps you might push that limit. Especially with frozen maps you are deserializing the entire map, modifying it, and re inserting. Might be better off with a
CREATE TABLE events (
id text,
evnt_key, text
value map<timestamp, int>,
PRIMARY KEY ((id), evnt_key)
)
Or even a
CREATE TABLE events (
id text,
evnt_key, text
evnt_time timestamp
value int,
PRIMARY KEY ((id), evnt_key, evnt_time)
)
It would be more efficient and safer while giving additional benefits like being able to order the event_time's in ascending or descending order.

Spark Cassandra Connector keyBy and shuffling

I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
#Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.
Thanks,
Shai
I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
.spanByKey
Read more in the docs:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

Resources