SELECT rows with primary key of multiple columns - node.js

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?

Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.

You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

Related

Cassandra : Key Level access in Map type columns

In Cassandra,Suppose we require to access key level against map type column. how to do it?
Create statement:
create table collection_tab2(
empid int,
emploc map<text,text>,
primary key(empid));
Insert statement:
insert into collection_tab2 (empid, emploc ) VALUES ( 100,{'CHE':'Tata Consultancy Services','CBE':'CTS','USA':'Genpact LLC'} );
select:
select emploc from collection_tab2;
empid | emploc
------+--------------------------------------------------------------------------
100 | {'CBE': 'CTS', 'CHE': 'Tata Consultancy Services', 'USA': 'Genpact LLC'}
In that case, if want to access 'USA' key alone . What I should do?
I tried based on the Index. But all values are coming.
CREATE INDEX fetch_index ON killrvideo.collection_tab2 (keys(emploc));
select * from collection_tab2 where emploc CONTAINS KEY 'CBE';
empid | emploc
------+--------------------------------------------------------------------------
100 | {'CBE': 'CTS', 'CHE': 'Tata Consultancy Services', 'USA': 'Genpact LLC'}
But expected:
'CHE': 'Tata Consultancy Services'
Just as a data model change I would strongly recommend:
create table collection_tab2(
empid int,
emploc_key text,
emploc_value text,
primary key(empid, emploc_key));
Then you can query and page through simply as the emploc_key is clustering key instead of part of the cql collection that has multiple limits and negative performance impacts.
Then:
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'CHE', 'Tata Consultancy Services');
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'CBE, 'CTS');
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'USA', 'Genpact LLC');
Can also put it in a unlogged batch and it will still be applied efficiently and atomically because all in the same partition.
To do it as you have you can after 4.0 with CASSANDRA-7396 with [] selectors like:
SELECT emploc['USA'] FROM collection_tab2 WHERE empid = 100;
But I would still strongly recommend data model changes as its significantly more efficient, and can work in existing versions with:
SELECT * FROM collection_tab2 WHERE empid = 100 AND emploc_key = 'USA';

Run query on specific partition of partitioned MySQL table

I would like to run my Ecto.Query.from on a specific partition of a partitioned MySQL table.
Example table:
CREATE TABLE `dogs` (
`dog_id` bigint(20) unsigned NOT NULL,
...
PRIMARY KEY (`dog_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (dog_id)
PARTITIONS 10 */
Idealistic query for what I would like to achieve:
from(i in dogs, select: i.dog_id, partition: "p1")
The above doesn't work ofc, so
I have achieved this with transforming the query to string with
Ecto.Adapters.SQL.to_sql and editing it.
... <> "PARTITION (#{partition}) AS" <> ...
This feels hacky and it might break with future versions,
is there a way to achieve this with Ecto?

How to use order by(Sorting) on Secondary index using Cassandra DB

My table schema is:
CREATE TABLE users
(user_id BIGINT PRIMARY KEY,
user_name text,
email_ text);
I inserted below rows into the table.
INSERT INTO users(user_id, email_, user_name)
VALUES(1, 'abc#test.com', 'ABC');
INSERT INTO users(user_id, email_, user_name)
VALUES(2, 'abc#test.com', 'ZYX ABC');
INSERT INTO users(user_id, email_, user_name)
VALUES(3, 'abc#test.com', 'Test ABC');
INSERT INTO users(user_id, email_, user_name)
VALUES(4, 'abc#test.com', 'C ABC');
For searching data into the user_name column, I created an index to use the LIKE operator with '%%':
CREATE CUSTOM INDEX idx_users_user_name ON users (user_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'};
Problem:1
When I am executing below Query, it returns 3 records only, instead of 4.
select *
from users
where user_name like '%ABC%';
Problem:2
When I use below query, it gives an error as
ERROR: com.datastax.driver.core.exceptions.InvalidQueryException:
ORDER BY with 2ndary indexes is not supported.
Query =select * from users where user_name like '%ABC%' ORDER BY user_name ASC;
Query:
select *
from users
where user_name like '%ABC%'
ORDER BY user_name ASC;
My requirement is to filter the user_name with order by user_name.
The first query does work correctly for me using cassandra:latest which is now cassandra:3.11.3. You might want to double-check the inserted data (or just recreate from scratch using the cql statements you provided).
The second one gives you enough info - ordering by secondary indexes is not possible in Cassandra. You might have to sort the result set in your application.
That being said I would not recommend running this setup in real apps. With some additional scale (when you have many records) this will be a suicide performance-wise. I should not go into much detail since maybe you already understand this and SO is not a wikia/documentation site, so here is a link.

CQL table design for temporal data

As a Cassandra novice, I have a CQL design question. I want to re-use a concept which I've build before using RDBMS systems, to create history for customerData. The customer himself will only see the latest version, so that should be the fastest, but queries on whole history can be performed.
My suggested entity properties:
customerId text,
validFromDate date,
validUntilDate date,
customerData text
First save of customerData just INSERTs customerData with validFromDate=NOW and validUntilDate=31-12-9999
Subsequent saves of customerData changes the last record - setting validUntilDate=NOW - and INSERT new customerData with validFromDate=NOW and validUntilDate=31-12-9999
Result:
This way a query of (customerId, validUntilDate)=(id,31-12-9999) will give last saved version.
Query on (customerId) will give all history.
To query customerData at certain time t just use query with validFromDate < t < validUntilDate
My guess is PARTITION_KEY = customerId and CLUSTER_KEY can be validFromDate. Or use PRIMARY KEY = customerId. Or I could create two tables, one for fast querying of lastest version (has no history), and another for historical analyses.
How do you design this in CQL-way? I think I'm thinking too much RDBMish.
Use change timestamp as CLUSTERING KEY with DESC order, e.g
CREATE TABLE customer_data_versions (
id text,
change_time timestamp,
name text,
PRIMARY KEY (id, change_time)
) WITH CLUSTERING ORDER BY ( change_time DESC );
It will allow you to store data versions per customer id in descending order.
Insert two versions for the same id:
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John');
INSERT INTO customer_data_versions (id, change_time, name) VALUES ('id1', totimestamp(now()),'John Doe');
Get last saved version:
SELECT * FROM customer_data_versions WHERE id='id1' LIMIT 1;
Get all versions for the id:
SELECT * FROM customer_data_versions WHERE id='id1';
Get versions between dates:
SELECT * FROM customer_data_versions WHERE id='id1' AND change_time <= before_date AND change_time >= after_date;
Please note, there are some limits for partition size (how much versions you will be able to store per customer id):
Cells in a partition: ~2 billion (231); single column value size: 2 GB ( 1 MB is recommended)

Cassandra CQL retrieve various rows from list of primary keys

I have at a certain point in my software a list of primary keys of which I want to retrieve information from a massively huge table, and I'm wondering what's the most practical way of doing this. Let me illustrate:
Let this be my table structure:
CREATE TABLE table_a(
name text,
date datetime,
key int,
information1 text,
information2 text,
PRIMARY KEY ((name, date), key)
)
say I have a list of primary keys:
list = [['Jack', '2015-01-01 00:00:00', 1],
['Jack', '2015-01-01 00:00:00', 2],
['Richard', '2015-02-14 00:00:00', 5],
['David', '2015-01-01 00:00:00', 9],
...
['Last', '2014-08-13 00:00:00', 12]]
Say this list is huge (hundreds of thousands) and not ordered in any way. I want to retrieve, for every key on the list, the value of the information columns.
As of now, the way I'm solving this issue is executing a select query for each key, and that has been sufficient hitherto. However I'm worried about execution times when the list of keys get too huge. Is there a more practical way of querying cassandra for a list of rows of which I know the primary keys without executing one query per key?
If the key was a single field, I could use the select * from table where key in (1,2,6,3,2,4,8) syntax to obtain all the keys I want in one query, however I don't see how to do this with composite primary keys.
Any light on the case is appreciated.
The best way to go about something like this, is to run these queries in parallel. You can do that on the (Java) application side by using async futures, like this:
Future<List<ResultSet>> future = ResultSets.queryAllAsList(session,
"SELECT * FROM users WHERE id=?",
UUID.fromString("0a63bce5-1ee3-4bbf-9bad-d4e136e0d7d1"),
UUID.fromString("7a69657f-39b3-495f-b760-9e044b3c91a9")
);
for (ResultSet rs : future.get()) {
... // process the results here
}
Create a table that has the 3 columns worth of data piped together into a single value and store that single string value in a single column. Make that column the PK. Then you can use the IN clause to filter. For example, select * from table where key IN ('Jack|2015-01-01 00:00:00|1', 'Jack|2015-01-01 00:00:00|2').
Hope that helps!
Adam

Resources