Cassandra CQL: different SELECT results - cassandra

I am using latest Cassandra 2.1.0 and have the different results for the following queries.
select * from zzz.contact where user_id = 53528c87-0691-46f7-81a1-77173fd8390f
and contact_id = 5ea82764-ce42-45f3-8724-e121c8b7d32e;
returns me one decired record but
select * from zzz.contact where user_id = 53528c87-0691-46f7-81a1-77173fd8390f;
returns 6 other rows except the row which is returned by first SELECT.
Structure of the keyspace/table is:
CREATE KEYSPACE zzz
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'DC1' : '2' };
CREATE TABLE IF NOT EXISTS contact (
user_id uuid,
contact_id uuid,
approved boolean,
ignored boolean,
adding_initiator boolean,
PRIMARY KEY ( user_id, contact_id )
);
Both instances are in keyspace and UN
d:\Tools\apache-cassandra-2.1.0\bin>nodetool status
Starting NodeTool
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: DC1
================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.0.146 135.83 KB 256 51.7% 6d035991-3471-498b-8051-55f99a2fdfed RAC1
UN 192.168.0.216 3.26 MB 256 48.3% d82f3a69-c6f8-4237-b50e-d2f370ac644a RAC1
I have two Cassandra instances.
Tried command "nodetool repair" - didn't help.
Tried to add ALLOW FILTERING in the end of the queries - didn't help.
Any help is highly appreciated.
UPD:
here is result of queries:
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
d:\Tools\apache-cassandra-2.1.0\bin>cqlsh 192.168.0.216
Connected to ClusterZzz at 192.168.0.216:9042.
[cqlsh 5.0.1 | Cassandra 2.1.0 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
cqlsh> select * from zzz.contact where user_id = 53528c87-0691-46f7-81a1-77173fd8390f and contact_id = 5ea82764-ce42-45f3-8724-e121c8b7d32e;
user_id | contact_id | adding_initiator | approved | ignored
--------------------------------------+--------------------------------------+------------------+----------+---------
53528c87-0691-46f7-81a1-77173fd8390f | 5ea82764-ce42-45f3-8724-e121c8b7d32e | False | True | False
(1 rows)
cqlsh> select * from zzz.contact where user_id = 53528c87-0691-46f7-81a1-77173fd8390f;
user_id | contact_id | adding_initiator | approved | ignored
--------------------------------------+--------------------------------------+------------------+----------+---------
53528c87-0691-46f7-81a1-77173fd8390f | 6fc7f6e4-ac48-484e-9660-128476ca5bf9 | False | False | False
53528c87-0691-46f7-81a1-77173fd8390f | 7a240937-8b28-4424-9772-8c4c8e381432 | False | False | False
53528c87-0691-46f7-81a1-77173fd8390f | 8e6cb13a-96e7-45af-b9d8-40ea459df996 | False | False | False
53528c87-0691-46f7-81a1-77173fd8390f | 938af09a-0fe3-4cdd-b02e-cbdfb078335c | False | True | False
53528c87-0691-46f7-81a1-77173fd8390f | d84d9e7a-e81d-42a2-87b3-f163f7a9a646 | False | True | False
53528c87-0691-46f7-81a1-77173fd8390f | fd2ec705-1661-4cf8-98ef-46f627a9a382 | False | False | False
(6 rows)
cqlsh>
UPD #2:
Worth to mention that my nodes are on Windows7 machines. On production, we use Linux, so there were no problems like I have it with Windows nodes.

Related

Displaying indexes with metadata in YugabyteDB YCQL

[Question posted by a user on YugabyteDB Community Slack]
How can I get metadata info about indexes from the driver?
https://github.com/yugabyte/cassandra-java-driver/blob/3.10.0-yb-x/driver-core/src/main/java/com/datastax/driver/core/IndexMetadata.java
I use this one but I do not see any info about unique status or about where conditions.
You can query the system_schema.indexes table:
ycqlsh:ybdemo> select * from system_schema.indexes;
keyspace_name | table_name | index_name | kind | options | table_id | index_id | transactions | is_unique | tablets
---------------+------------+--------------------+------------+--------------------------------------------------------------------------+--------------------------------------+--------------------------------------+---------------------+-----------+---------
ybdemo | emp | emp_by_userid | COMPOSITES | {'include': 'enum', 'target': 'userid'} | 13563f8c-997e-a298-de46-0c05025e00a7 | 57b96b7f-6be9-55b8-4145-c30739b6d467 | {'enabled': 'true'} | True | null
ybdemo | emp | emp_by_userid_bbbb | COMPOSITES | {'include': 'enum', 'predicate': 'lastname = ''x''', 'target': 'userid'} | 13563f8c-997e-a298-de46-0c05025e00a7 | f44cc10d-5251-6189-8040-6d73857f09dc | {'enabled': 'true'} | True | null
You can see both is_unique and predicate that is used on the partial index.
This was done on 2.15.0.0.

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1.
Given below the sample create table command and its show create table output.
Beeline Output:
Spark-Shell:
I wanted to know if there is any configuration on the Spark side to preserve the data type and length information for CHAR,VARCHAR data types. If there are other ways to get DDL from Hive quickly, I will be fine with that also.
This is in
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."
Create a simple table in Hive in test database
hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.
In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

Cassnadra - Update/Delete based on timestamp datatype

I have below table structure which houses failed records.
CREATE TABLE if not exists dummy_plan (
id uuid,
payload varchar,
status varchar,
bucket text,
create_date timestamp,
modified_date timestamp,
primary key ((bucket), create_date, id))
WITH CLUSTERING ORDER BY (create_date ASC)
AND COMPACTION = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};
My table looks like below
| id | payload | status | bucket | create_date | modified_date |
| abc| text1 | Start | 2021-02-15 | 2021-02-15 08:07:50+0000 | |
Table and records are created and inserted successfully. However after processing, we want to update (if failed) and delete (if successful) record based on Id.
But am facing problem with timestamp where I tried giving same value but it still doesn't deletes/updates.
Seems Cassandra doesn't works with EQ with timestamp.
Please guide.
Thank you in advance.
Cassandra works just fine with the timestamp columns - you can use equality operation on that. But you need to make sure that you include milliseconds into the value, otherwise it won't match:
cqlsh> insert into test.dummy_service_plan_contract (id, create_date, bucket)
values (1, '2021-02-15T11:00:00.123Z', '123');
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+---------------------------------+----+---------------+---------+--------
123 | 2021-02-15 11:00:00.123000+0000 | 1 | null | null | null
(1 rows)
cqlsh> delete from test.dummy_service_plan_contract where bucket = '123' and
id = 1 and create_date = '2021-02-15T11:00:00Z';
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+---------------------------------+----+---------------+---------+--------
123 | 2021-02-15 11:00:00.123000+0000 | 1 | null | null | null
(1 rows)
cqlsh> delete from test.dummy_service_plan_contract where bucket = '123' and
id = 1 and create_date = '2021-02-15T11:00:00.123Z';
cqlsh> select * from test.dummy_service_plan_contract;
bucket | create_date | id | modified_date | payload | status
--------+-------------+----+---------------+---------+--------
(0 rows)
If you don't see the milliseconds in your output in the cqlsh, then you need to configure datetimeformat setting in the .cqlshrc

cassandra stateful set in kubernetes

I've been trying to setup a redundant stateful set in kubernetes with the google cassandra image, as depicted in kubernetes 1.7 documentation.
According to the image used It's a stateful set with a consistency level of ONE.
In my testing example I'm using a SimpleStrategy replication with a replication factor of 3, as I have setup 3 replicas in the stateful set in one datacenter only.
I've defined cassandra-0,cassandra-1,cassandra-2 as seeds, so all are seeds.
I've created a keyspace and a table:
"create keyspace if not exists testing with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }"
"create table testing.test (id uuid primary key, name text, age int, properties map<text,text>, nickames set<text>, goals_year map<int,int>, current_wages float, clubs_season tuple<text,int>);"
I am testing with inserting data from another unrelated pod, using the cqlsh binary, and I can see that data ends up in every container, so replication is successfull.
nodetool status on all pods comes up with:
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.16.0.161 71.04 KiB 32 100.0% 4ad4e1d3-f984-4f0c-a349-2008a40b7f0a Rack1-K8Demo
UN 10.16.0.162 71.05 KiB 32 100.0% fffca143-7ee8-4749-925d-7619f5ca0e79 Rack1-K8Demo
UN 10.16.2.24 71.03 KiB 32 100.0% 975a5394-45e4-4234-9a97-89c3b39baf3d Rack1-K8Demo
...and all cassandra pods have the same data in the table created before:
id | age | clubs_season | current_wages | goals_year | name | nickames | properties
--------------------------------------+-----+--------------+---------------+------------+----------+----------+--------------------------------------------------
b6d6f230-c0f5-11e7-98e0-e9450c2870ca | 26 | null | null | null | jonathan | null | {'goodlooking': 'yes', 'thinkshesthebest': 'no'}
5fd02b70-c0f8-11e7-8e29-3f611e0d5e94 | 26 | null | null | null | jonathan | null | {'goodlooking': 'yes', 'thinkshesthebest': 'no'}
5da86970-c0f8-11e7-8e29-3f611e0d5e94 | 26 | null | null | null | jonathan | null | {'goodlooking': 'yes', 'thinkshesthebest': 'no'}
But then I delete one of those db replica pods(cassandra-0), a new pod springs up again as expected, a new cassandra-0 (thanks kubernetes!), and I see now that all the pods have lost one row of those 3:
id | age | clubs_season | current_wages | goals_year | name | nickames | properties
--------------------------------------+-----+--------------+---------------+------------+----------+----------+--------------------------------------------------
5fd02b70-c0f8-11e7-8e29-3f611e0d5e94 | 26 | null | null | null | jonathan | null | {'goodlooking': 'yes', 'thinkshesthebest': 'no'}
5da86970-c0f8-11e7-8e29-3f611e0d5e94 | 26 | null | null | null | jonathan | null | {'goodlooking': 'yes', 'thinkshesthebest': 'no'}
...and nodetool status now comes up with:
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.16.0.161 71.04 KiB 32 81.7% 4ad4e1d3-f984-4f0c-a349-2008a40b7f0a Rack1-K8Demo
UN 10.16.0.162 71.05 KiB 32 78.4% fffca143-7ee8-4749-925d-7619f5ca0e79 Rack1-K8Demo
DN 10.16.2.24 71.03 KiB 32 70.0% 975a5394-45e4-4234-9a97-89c3b39baf3d Rack1-K8Demo
UN 10.16.2.28 85.49 KiB 32 69.9% 3fbed771-b539-4a44-99ec-d27c3d590f18 Rack1-K8Demo
... shouldn't the cassandra ring replicate all the data into the newly created pod, and still have the 3 rows there in all cassandra pods?
... this experience is documented in github.
...has someone tried this experience, what might be wrong in this testing context?
super thanks in advance
I think that after bringing down the node, you need to inform the other peers from the cluster that the node is dead and needs replacing.
I would recommend some reading in order to have a correct test case.

retrieving data from cassandra database

I'm working on smart parking data stored in Cassandra database and i'm trying to get the last status of each device.
I'm working on self-made dataset.
here's the description of the table.
table description
select * from parking.meters
need help please !
trying to get the last status of each device
In Cassandra, you need to design your tables according to your query patterns. Building a table, filling it with data, and then trying to fulfill a query requirement is a very backward approach. The point, is that if you really need to satisfy that query, then your table should have been designed to serve that query from the beginning.
That being said, there may still be a way to make this work. You haven't mentioned which version of Cassandra you are using, but if you are on 3.6+, you can use the PER PARTITION LIMIT clause on your SELECT.
If I build your table structure and INSERT some of your rows:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters ;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 20 | 2017-01-10T09:11:51Z | True
1 | 20 | 2017-01-01T13:51:50Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 7 | 2016-12-02T16:50:04Z | True
1 | 7 | 2016-11-24T23:38:31Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
1 | 19 | 2016-11-22T15:15:23Z | False
(8 rows)
And I consider your PRIMARY KEY and CLUSTERING ORDER definitions:
PRIMARY KEY ((parking_id, device_id), date, status)
) WITH CLUSTERING ORDER BY (date DESC, status ASC);
You are at least clustering by date (which should be an actual date type, not a text), so that will order your rows in a way that helps you here:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters PER PARTITION LIMIT 1;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
(3 rows)

Resources