How to populate related table in Cassandra using CQL?

How to populate related table in Cassandra using CQL? - cassandra

I am trying to practice Cassandra using this example (under Composite Columns paragraph):
So, I have created table tweets and it looks like following:
cqlsh:twitter> SELECT * from tweets;
tweet_id | author | body
--------------------------------------+-------------+--------------
73954b90-baf7-11e4-a7d0-27983e9e7f51 | gwashington | I chopped...
(1 rows)
Now I am trying to populate timeline, which is a related table using CQL and I am not sure how to do it. I have tried SQL approach, but it did not work:
cqlsh:twitter> INSERT INTO timeline (user_id, tweet_id, author, body) SELECT 'gmason', 73954b90-baf7-11e4-a7d0-27983e9e7f51, author, body FROM tweets WHERE tweet_id = 73954b90-baf7-11e4-a7d0-27983e9e7f51;
Bad Request: line 1:55 mismatched input 'select' expecting K_VALUES
So I have two questions:
How to populate timeline table with SQL, so that it would relate to tweets?
How do I make sure that Timeline Physical Layout will be created as shown in that example?
Thanks.
EDIT:
This is explanation for my question #2 above (the picture is taken from here):

tldr;
Use cqlsh COPY to export tweets, modify the file, use COPY to import timeline.
Use cassandra-cli to verify the physical structure.
Long version...
I'll go a different way on this one, and suggest that it will probably be easier using the native COPY command in cqlsh.
I followed the similar examples found here. After creating the tweets and timeline tables in cqlsh, I inserted rows into tweets as indicated. My tweets table then looked like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM tweets;
tweet_id | author | body
--------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------
05a5f177-f070-486d-b64d-4e2bb28eaecc | gmason | Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
b67fe644-4dbe-489b-bc71-90f809f88636 | jmadison | All men having power ought to be distrusted to a certain degree.
819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1 | gwashington | To be prepared for war is one of the most effectual means of preserving peace.
I then exported them like this:
aploetz#cqlsh:stackoverflow> COPY tweets TO '/home/aploetz/tweets_20150223.txt'
WITH DELIMITER='|' AND HEADER=true;
3 rows exported in 0.052 seconds.
I then edited the tweets_20150223.txt file, adding a user_id column on the front and copying a couple of rows, like this:
userid|tweet_id|author|body
gmason|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
jmadison|b67fe644-4dbe-489b-bc71-90f809f88636|jmadison|All men having power ought to be distrusted to a certain degree.
gwashington|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
jmadison|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
ahamilton|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
ahamilton|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
I saved that file as timeline_20150223.txt, and imported it into the timeline table, like this:
aploetz#cqlsh:stackoverflow> COPY timeline FROM '/home/aploetz/timeline_20150223.txt'
WITH DELIMITER='|' AND HEADER=true;
6 rows imported in 0.016 seconds.
Yes, timeline will be a wide-row table, partitioning on user_id and then clustering on tweet_id. I verified the "under the hood" structure by running the cassandra-cli tool, and listing the timeline column family (table). Here you can see how the rows are partitioned by user_id, and each column has the tweet_id uuid as a part of its name:
-
[default#stackoverflow] list timeline;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: ahamilton
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585904)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585904)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585904)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585715)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585715)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585715)
-------------------
RowKey: gmason
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585150)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585150)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585150)
-------------------
RowKey: gwashington
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585475)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585475)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585475)
-------------------
RowKey: jmadison
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585597)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585597)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585597)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:, value=, timestamp=1424707827585348)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:author, value=6a6d616469736f6e, timestamp=1424707827585348)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:body, value=416c6c206d656e20686176696e6720706f776572206f7567687420746f206265206469737472757374656420746f2061206365727461696e206465677265652e, timestamp=1424707827585348)
4 Rows Returned.
Elapsed time: 35 msec(s).

In order to accomplish this you will need to use an ETL tool. Use either Hadoop or Spark. There is no INSERT/SELECT in CQL and this is for a reason. In a real world you will need to execute 2 inserts from your application - one into each table.
You will just have to believe that when you have primary key with partition key and clustering key, this would store the data in a wide row format.

Related

Is Leveled Compaction Strategy still beneficial for reads when Rows Are Write-Once?

Among other cases, this datastax post says that Compaction may not be a Good Option when Rows Are Write-Once:
If your rows are always written entirely at once and are never updated, they will naturally always be contained by a single SSTable when using size-tiered compaction. Thus, there's really nothing to gain from leveled compaction.
Also, in the talk The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla) slide 30 it says that Where LCS fits the best
Use cases needing very consistent read performance with much higher read to write ratio
Wide-partition data model with limited (or slow-growing) number of total partitions but a lot of updates and deletes, or fully TTL’ed dataset
I understand that if a row is updated or deleted frequently it can end up in several SSTables, hence this will impact in read performance. From Leveled Compaction in Apache Cassandra
Performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.
However, in a scenario where Rows Are Write-Once, this strategy does not represent a benefit too when reading by all rows of a partition key?
Because if I understood correctly, with this strategy the rows with the same partition key tend to be in the same SSTable, because merges SSTables that overlaps in contrast to Size Tiered Compaction that merges SSTables with similar size.

When the rows are written strictly once, there is no effect of choosing LeveledCompactionStrategy over SizeTieredCompactionStrategy, regarding read performance (there are other effects, e.g. LCS requires more IO)
Regarding the below comments from question
with this strategy the rows with the same partition key tend to be in
the same SSTable, because merges SSTables that overlaps in contrast to
Size Tiered Compaction that merges SSTables with similar size.
When a row with same partition key is written exactly once, then there is no scenario of merging SSTables, as its not spread out across different SSTables in the first place.
When we talk in terms of update, it need not be an existing column within that row being updated. There could be scenario where we add a complete new set of clustering column along with associated columns for an already existing partition key.
Here is a sample table
CREATE TABLE tablename(
emailid text,
sent-date date,
column3 text,
PRIMARY KEY (emailid,sent-date)
)
Now for a given emailid (say hello#gmail.com) a single partition key, there could be inserts at two or more times with different "sent-date". Though they are inserts (essentially upserts) to same partition key and hence LeveledCompaction would benefit here.
But assume the same table with just emailid as primary key and written exactly once. Then there is no advantage irrespective of how SSTables are compacted, be it SizeTieredCompactionStrategy or LeveledCompactionStrategy, as the row always would live on only one SSTable.

I think that the response is that when the blog talks about a row, it's referring to a Thrift row and not a CQL row. (I'm not the only to confuse this terms)
When we say Thrift row we are talking about a partition (or a set of CQL rows with the same partition key).
From Does CQL support dynamic columns / wide rows?
+--------------------------------------------------+-----------+
| Thrift term | CQL term |
+--------------------------------------------------+-----------+
| row | partition |
| column | cell |
| [cell name component or value] | column |
| [group of cells with shared component prefixes] | row |
+--------------------------------------------------+-----------+
From Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
With the following schema
CREATE TABLE tweets (
... user text,
... time timestamp,
... tweet text,
... lat float,
... long float,
... PRIMARY KEY (user, time)
... );
(remember that the partition key is the first that appears in the primary key, in this case "user")
The following CQL rows
user | time | lat | long | tweet
--------------+--------------------------+--------+---------+---------------------
softwaredoug | 2013-07-13 08:21:54-0400 | 38.162 | -78.549 | Having chest pain.
softwaredoug | 2013-07-21 12:15:27-0400 | 38.093 | -78.573 | Speedo self shot.
jnbrymn | 2013-06-29 20:53:15-0400 | 38.092 | -78.453 | I like programming.
jnbrymn | 2013-07-14 22:55:45-0400 | 38.073 | -78.659 | Who likes cats?
jnbrymn | 2013-07-24 06:23:54-0400 | 38.073 | -78.647 | My coffee is cold.
Are internally persisted in Thrift like this
RowKey: softwaredoug
=> (column=2013-07-13 08:21:54-0400:, value=, timestamp=1374673155373000)
=> (column=2013-07-13 08:21:54-0400:lat, value=4218a5e3, timestamp=1374673155373000)
=> (column=2013-07-13 08:21:54-0400:long, value=c29d1917, timestamp=1374673155373000)
=> (column=2013-07-13 08:21:54-0400:tweet, value=486176696e67206368657374207061696e2e, timestamp=1374673155373000)
=> (column=2013-07-21 12:15:27-0400:, value=, timestamp=1374673155407000)
=> (column=2013-07-21 12:15:27-0400:lat, value=42185f3b, timestamp=1374673155407000)
=> (column=2013-07-21 12:15:27-0400:long, value=c29d2560, timestamp=1374673155407000)
=> (column=2013-07-21 12:15:27-0400:tweet, value=53706565646f2073656c662073686f742e, timestamp=1374673155407000)
-------------------
RowKey: jnbrymn
=> (column=2013-06-29 20:53:15-0400:, value=, timestamp=1374673155419000)
=> (column=2013-06-29 20:53:15-0400:lat, value=42185e35, timestamp=1374673155419000)
=> (column=2013-06-29 20:53:15-0400:long, value=c29ce7f0, timestamp=1374673155419000)
=> (column=2013-06-29 20:53:15-0400:tweet, value=49206c696b652070726f6772616d6d696e672e, timestamp=1374673155419000)
=> (column=2013-07-14 22:55:45-0400:, value=, timestamp=1374673155434000)
=> (column=2013-07-14 22:55:45-0400:lat, value=42184ac1, timestamp=1374673155434000)
=> (column=2013-07-14 22:55:45-0400:long, value=c29d5168, timestamp=1374673155434000)
=> (column=2013-07-14 22:55:45-0400:tweet, value=57686f206c696b657320636174733f, timestamp=1374673155434000)
=> (column=2013-07-24 06:23:54-0400:, value=, timestamp=1374673155485000)
=> (column=2013-07-24 06:23:54-0400:lat, value=42184ac1, timestamp=1374673155485000)
=> (column=2013-07-24 06:23:54-0400:long, value=c29d4b44, timestamp=1374673155485000)
=> (column=2013-07-24 06:23:54-0400:tweet, value=4d7920636f6666656520697320636f6c642e, timestamp=1374673155485000)
We clearly see that the 2 CQL rows with user softwaredoug are a single Thrift Row.
The case where a single CQL row corresponds to a single Thrift row (e.g. when the partition key == primary key) is what Deng and Svihla indicate like an anti-pattern use case for LCS
Heavy write with all unique partitions
However, I will mark dilsingi answer as correct because I think he already knew this relation.

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3

"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

How to get the raw row content in Cassandra 3.3

I am using Cassandra 3.3 and CQL to create the following table
CREATE TABLE collected_data (
collection_hour int,
source_id int,
entity_id int,
measurement text,
value text,
primary key((collection_hour),source_id,entity_id,measurement)
);
After inserting a bunch of values into this table I wish to see how each row is really stored in Cassandra. For that I have seen that folks were using cassandra-cli (list command), but that is not available anymore in 3.3 ( post 3.0 )
Is there a way I can use to query cassandra to see how each row is really stored ? I am looking for some tool or any way to do this from CQL ...
Thank you
PS: in cassandra CLI one would use the the "list command" and get an output similar to the following (different table ofcourse):
RowKey: 1
=> (column=, value=, timestamp=1374546754299000)
=> (column=field2, value=00000002, timestamp=1374546754299000)
=> (column=field3, value=00000003, timestamp=1374546754299000)
RowKey: 4
=> (column=, value=, timestamp=1374546757815000)
=> (column=field2, value=00000005, timestamp=1374546757815000)
=> (column=field3, value=00000006, timestamp=1374546757815000)

The storage engine has been rewritten since Cassandra 3.0 so the on-disk layout has changed completely.
There is no official documentation on this subject but you can look at several places in the source code to have a big picture of how data are laid on disk
UnfilteredSerializer: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L29-L71
Cell storage: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L145-L163
ClusteringPrefix: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ClusteringPrefix.java#L33-L45

Cassandra IN query not working if table has SET type column

I am new to Cassandra. I have an issue when using IN in cassandra query.
If table has no column of SET type it works.
CREATE TABLE test (
test_date bigint,
test_id bigint,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test where test_date = 2022015 and test_id IN (1,2);
But if I add a column of SET type f.e. tags set in the above table and rerun the select query, it gives error.
CREATE TABLE test1 (
test_date bigint,
test_id bigint,
tags set<text>,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test1 where test_date = 2022015 and test_id IN (1,2);
code=2200 [Invalid query] message="Cannot restrict column "test_id" by
IN relation as a collection is selected by the query"

I'm not sure why this restriction should apply particulary for collections. But in your case you can get around this issue by making the test_id part of your partition key:
PRIMARY KEY((test_date,test_id))
This will allow you to do IN queries as long as you specify the first part of the composite key (test_date).

I think you are seeing this error due to Cassandra's underlying storage model. When I query your test1 table within CQLSH (with my own test data), this is what I see:
aploetz#cqlsh:stackoverflow> SELECT * FROM test1;
test_date | test_id | caption | tags
-----------+---------+-----------+-------------------------
2022015 | 1 | blah blah | {'one', 'three', 'two'}
2022015 | 2 | blah blah | {'one', 'three', 'two'}
(2 rows)
This view gives a misleading interpretation of how the data is actually stored. This is what it looks like when I query the same table from within cassandra-cli:
[default#stackoverflow] list test1;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 2022015
=> (name=1:, value=, timestamp=1422895168730184)
=> (name=1:caption, value=626c616820626c6168, timestamp=1422895168730184)
=> (name=1:tags:6f6e65, value=, timestamp=1422895168730184)
=> (name=1:tags:7468726565, value=, timestamp=1422895168730184)
=> (name=1:tags:74776f, value=, timestamp=1422895168730184)
=> (name=2:, value=, timestamp=1422895161891116)
=> (name=2:caption, value=626c616820626c6168, timestamp=1422895161891116)
=> (name=2:tags:6f6e65, value=, timestamp=1422895161891116)
=> (name=2:tags:7468726565, value=, timestamp=1422895161891116)
=> (name=2:tags:74776f, value=, timestamp=1422895161891116)
1 Row Returned.
This suggests that collection (set) values are stored as additional column keys. A restriction on using the IN relation, is that it must operate on the last key (partitioning or clustering) of a primary key. So I would guess that this is a limitation based on how Cassandra stores the collection data "under the hood."
And just a warning, but using IN for production-level queries is not recommended. Some have even gone as far as to put it on the list of Cassandra anti-patterns. My answer to this question (Is the IN relation in Cassandra bad for queries?) explains why IN queries are not optimal.
EDIT
Just to see, I tried your schema with a list instead of a set to see if that made any difference. It still didn't work, but from within the cassandra-cli it appeared to add an additional UUID identifier to the key, and stored the actual value as the column value. Which is different from how a set was treated...this must be how sets are restricted to unique values.

You can use a Materialized View with test_id as a part of partitioning expression to satisfy your requirement if changing the PK on your base table is not an option:
CREATE MATERIALIZED VIEW test1_mv AS
SELECT * FROM test1
WHERE test_date IS NOT NULL AND test_id IS NOT NULL
PRIMARY KEY((test_date,test_id));
Then use the Materialized View instead of the base table in your query:
select * from test1_mv where test_date = 2022015 and test_id IN (1,2);

Cassandra long row with different data types

I have read the following article about Cassandra CQL3 and Thrift API
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
In the article, they give an example on creating a scheme for gathering data from sensors.
They show a “wide row” solution by making the timestamp as a column. Cassandra's strength, as I see it is by supporting 2 billion columns and a fast way to extract data according to column.
In the article, with CQL3 they build a table
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
PRIMARY KEY (sensor_id, collected_at)
) WITH COMPACT STORAGE;
which translates to:
sensor_id | collected_at | volts
1 | 2013-06-05 15:11:00-0500 | 3.1
1 | 2013-06-05 15:11:10-0500 | 4.3
1 | 2013-06-05 15:11:20-0500 | 5.7
2 | 2013-06-05 15:11:00-0500 | 3.2
3 | 2013-06-05 15:11:00-0500 | 3.3
3 | 2013-06-05 15:11:10-0500 | 4.3
In Thrift it translates to:
list data;
RowKey: 1
=> (cell=2013-06-05 15:11:00-0500, value=3.1, timestamp=1370463146717000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463282090000)
=> (cell=2013-06-05 15:11:20-0500, value=5.7, timestamp=1370463282093000)
RowKey: 2
=> (cell=2013-06-05 15:11:00-0500, value=3.2, timestamp=1370463332361000)
RowKey: 3
=> (cell=2013-06-05 15:11:00-0500, value=3.3, timestamp=1370463332365000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463332368000)
I'm trying to think of a Cassandra Schema example for the following sensor data gathering problem.
Let's say I add a new set of sensors which have a bigint (long) value (instead of float).
Any ideas how to design such a table schema to include both sensor types with different data types yet keep the columns based on timestamp?
Thanks,
Guy

If you don't need to use COMPACT STORAGE (and backwards compatibility with Thrift), just create your table as
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
other_field bigint,
volts float,
PRIMARY KEY (sensor_id, collected_at)
)
Cassandra supports sparse columns with basically no overhead, so if you programmatically decide to only populate one of the two fields for any cql row, you will achieve your goal.
Alternatively, you can continue to use COMPACT STORAGE and just switch to blob type. The blob type will do absolutely no interpretation or transformation of the bytes that you insert into it, so accuracy can be guaranteed. I would not recommend using a text type for this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string