Cassandra where clause as a tuple - cassandra

Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.

Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)

Related

Cassandra create duplicate table with different primary key

I'm new to Apache Cassandra and have the following issue:
I have a table with PRIMARY KEY (userid, countrycode, carid). As described in many tutorials this table can be queried by using following filter criteria:
userid = x
userid = x and countrycode = y
userid = x and countrycode = y and carid = z
This is fine for most cases, but now I need to query the table by filtering only on
userid = x and carid = z
Here, the documentation sais that is the best solution to create another table with a modified primary key, in this case PRIMARY KEY (userid, carid, countrycode).
The question here is, how to copy the data from the "original" table to the new one with different index?
On small tables
On huge tables
And another important question concerning the duplication of a huge table: What about the storage needed to save both tables instead of only one?
You can use COPY command to export from one table and import into other table.
From your example - I created 2 tables. user_country and user_car with respective primary keys.
CREATE KEYSPACE user WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 } ;
CREATE TABLE user.user_country ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, country_code, car_id));
CREATE TABLE user.user_car ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, car_id, country_code));
Let's insert some dummy data into one table.
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('1', 'IN', 'CAR1');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('2', 'IN', 'CAR2');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('3', 'IN', 'CAR3');
cqlsh> select * from user.user_country ;
user_id | country_code | car_id
---------+--------------+--------
3 | IN | CAR3
2 | IN | CAR2
1 | IN | CAR1
(3 rows)
Now we will export the data into a CSV. Observe the sequence of columns mentioned.
cqlsh> COPY user.user_country (user_id,car_id, country_code) TO 'export.csv';
Using 1 child processes
Starting copy of user.user_country with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 4 rows/s; Avg. rate: 4 rows/s
3 rows exported to 1 files in 0.824 seconds.
export.csv can now be directly inserted into other table.
cqlsh> COPY user.user_car(user_id,car_id, country_code) FROM 'export.csv';
Using 1 child processes
Starting copy of user.user_car with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 6 rows/s; Avg. rate: 8 rows/s
3 rows imported from 1 files in 0.359 seconds (0 skipped).
cqlsh>
cqlsh>
cqlsh> select * from user.user_car ;
user_id | car_id | country_code
---------+--------+--------------
3 | CAR3 | IN
2 | CAR2 | IN
1 | CAR1 | IN
(3 rows)
cqlsh>
About your other question - yes the data will be duplicated, but that's how cassandra is used.

Cassandra query max of a particular column for a particular ID

I am trying to write a Cassandra query and my use case is as follows
Let's say the table is
ID | Version
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
Now what I want is to get the latest version for all the IDs.
So the query should give me 2 rows. The first with Id:1 Version 2 and second with ID:2 Version:3
I tried a query like Select * from table where ID=1 and Version= MAX(Version) but it's not a valid syntax.
Can anybody help in this?
SELECT * FROM table WHERE ID = 1 LIMIT 1 would give you the highest version if your clustering key is Version ordered by descending.
CREATE TABLE table (
id int,
version int,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);

Insert new rows, continue existing rowset row_number count

I'm attempting to perform some sort of upsert operation in U-SQL where I pull data every day from a file, and compare it with yesterdays data which is stored in a table in Data Lake Storage.
I have created an ID column in the table in DL using row_number(), and it is this "counter" I wish to continue when appending new rows to the old dataset. E.g.
Last inserted row in DL table could look like this:
ID | Column1 | Column2
---+------------+---------
10 | SomeValue | 1
I want the next rows to have the following ascending ids
11 | SomeValue | 1
12 | SomeValue | 1
How would I go about making sure that the next X rows continues the ID count incrementally such that the next rows each increases the ID column by 1 more than the last?
You could use ROW_NUMBER then add it to the the max value from the original table (ie using CROSS JOIN and MAX). A simple demo of the technique:
DECLARE #outputFile string = #"\output\output.csv";
#originalInput =
SELECT *
FROM ( VALUES
( 10, "SomeValue 1", 1 )
) AS x ( id, column1, column2 );
#newInput =
SELECT *
FROM ( VALUES
( "SomeValue 2", 2 ),
( "SomeValue 3", 3 )
) AS x ( column1, column2 );
#output =
SELECT id, column1, column2
FROM #originalInput
UNION ALL
SELECT (int)(x.id + ROW_NUMBER() OVER()) AS id, column1, column2
FROM #newInput
CROSS JOIN ( SELECT MAX(id) AS id FROM #originalInput ) AS x;
OUTPUT #output
TO #outputFile
USING Outputters.Csv(outputHeader:true);
My results:
You will have to be careful if the original table is empty and add some additional conditions / null checks but I'll leave that up to you.

cql-import dynamic map Entries in cassandra

I have 2 mysql tables as given below
Table Employee:
id int,
name varchar
Table Emails
emp_id int,
email_add varchar
Table Emails & Employee are connected by employee.id = emails.emp_id
I have entries like:
mysql> select * from employee;
id name
1 a
2 b
3 c
mysql> select * from emails;
empd_id emails
1 aa#gmail.com
1 aaa#gmail.com
1 aaaa#gmail.com
2 bb#gmail.com
2 bbb#gmail.com
3 cc#gmail.com
6 rows in set (0.02 sec)
Now i want to import data to cassandra in below 2 formats
---format 1---
table in cassandra : emp_details:
id , name , email map{text,text}
i.e. data should be like
1 , a, { 'email_1' : 'aa#gmail.com' , 'email_2 : 'aaa#gmail.com' ,'email_3' :'aaaa#gmail.com'}
2 , b , {'email_1' :'bb#gmail.com' ,'email_2':'bbb#gmail.com'}
3, c, {'email_1' : 'cc#gmail.com'}
---- format 2 ----
i want to have the dynamic columns like
id , name, email_1 , email_2 , email_3 .... email_n
Please help me for the same. My main concern is to import data from mysql into above 2 formats.
Edit: change list to map
Logically, you don't expect an user to have >1000 emails, I would suggest to use Map<text, text> or even List<text>. It's a good fit for CQL collections.
CREATE TABLE users (
id int,
name text,
emails map<text,text>,
PRIMARY KEY(id)
);
INSERT INTO users(id,name,emails)
VALUES(1, 'a', {'email_1': 'aa#gmail.com', 'email_2': 'bb#gmail.com', 'email_3': 'cc#gmail.com'});

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

Resources