My Cassandra DB not responding as expected Row result. please see the below details of my Cassandra keyspace creation and to query of Count(*)
Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra
3.11.0 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE key1 WITH replication = {'class':'SimpleStrategy', 'replicationfactor' : 1};
cqlsh> CREATE TABLE Key.Transcation_CompleteMall (i text, i1 text static, i2 bigint , i3 int static, i4 decimal static, i5 bigint static, i6 decimal static, i7 decimal static, PRIMARY KEY ((i),i1));
cqlsh> COPY Key1.CompleteMall (i,i1,i2,i3,i4,i5,i6,i7) FROM '/home/gpadmin/all.csv' WITH HEADER = TRUE; Using 16 child processes
Starting copy of Key1.completemall with columns [i, i1, i2, i3, i4, i5, i6, i7]. Processed: 25461792 rows; Rate: 15162 rows/s; Avg. rate: 54681 rows/s
> **bold**25461792 rows imported from 1 files in 7 minutes and 45.642 seconds (0 skipped).
cqlsh> select count(*) from Key1.transcation_completemall; OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1 cqlsh> exit
[gpadmin#hmaster ~]$ cqlsh --request-timeout=3600
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select count(*) from starhub.transcation_completemall;
count
---------
**bold**2865767
(1 rows)
Warnings :
Aggregation query used without partition key
cqlsh>
I got only 2865767 rows but Copy command shows that 25461792 Rows accepted Cassandra. all.csv file has 2.5G size. For evaluating I exported the table to another file test.csv file which file size I wondered it became 252Mb.
My question is that, is Cassandra will automatically remove the duplicate in a row ?
If yes how the Cassandra delete the duplicate in a table? Like primary Key repetition or Partition Key or like exact field duplication?
or
What would be the possibility that data get Loss
Expected your valuable suggestion
Advance Thanks to you all
Cassandra will overwrite data with same primary key (Ideally all database will not have duplicate values for primary key(some throws constraint error,while some overwrites data)).
Example:
CREATE TABLE test(id int,id1 int,name text,PRIMARY KEY(id,id1))
INSERT INTO test(id,id1,name) VALUES(1,2,'test');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');
INSERT INTO test(id,id1,name) VALUES(1,2,'test2');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');
SELECT * FROM test;
-----------------
|id |id1 |name |
-----------------
|1 |2 |test2 |
-----------------
|1 |1 |test1 |
-----------------
The above statement will have only 2 records in table one with primary key (1,1) and other with primary key(1,2).
So in your case if values of i and i1 have duplicates that data will be overwritten.
Maybe check LIMIT option on SELECT statement, see ref doc here
Ref doc says:
Specifying rows returned using LIMIT
Using the LIMIT option, you can specify that the query return a limited number of rows.
SELECT COUNT() FROM big_table LIMIT 50000;
SELECT COUNT() FROM big_table LIMIT 200000;
The output of these statements if you had 105,291 rows in the database would be: 50000, and 105,291. The cqlsh shell has a default row limit of 10,000. The Cassandra server and native protocol do not limit the number of rows that can be returned, although a timeout stops running queries to protect against running malformed queries that would cause system instability.
Related
I am trying to find a way to determine if the table is empty in Cassandra DB.
cqlsh> SELECT * from examples.basic ;
key | value
-----+-------
(0 rows)
I am running count(*) to get the value of the number of rows , but I am getting warning message, So I wanted to know if there is any better way to check if the table is empty(zero rows).
cqlsh> SELECT count(*) from examples.basic ;
count
-------
0
(1 rows)
Warnings :
Aggregation query used without partition key
cqlsh>
Aggregations, like count, can be an overkill for what you are trying to accomplish, specially with the star wildcard, as if there is any data on your table, the query will need to do a full table scan. This can be quite expensive if you have several records.
One way to get the result you are looking for is the query
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
Empty table:
The resultset will be empty
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
key
-----
(0 rows)
Table with data:
The resultset will have a record
cqlsh> SELECT key FROM keyspace1.table1 LIMIT 1;
key
----------------------------------
uL24bhnsHYRX8wZItWM6xKdS0WLvDsgi
(1 rows)
I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)
I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...
Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3
"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".
How can I get the creation date and time of a cassandra table?
I tried to use cqlsh DESC TABLE but there is no information about the creation time stamp...
Depending on your version of Cassandra, you can check the schema tables. Each table gets a unique ID when it is created, and that ID gets written to the schema tables. If you query the WRITETIME of that ID, it should give you a UNIX timestamp (in microseconds) of when it was created.
Cassandra 2.2.x and down:
> SELECT keyspace_name, columnfamily_name, writetime(cf_id)
FROM system.schema_columnfamilies
WHERE keyspace_name='stackoverflow' AND columnfamily_name='book';
keyspace_name | columnfamily_name | writetime(cf_id)
---------------+-------------------+------------------
stackoverflow | book | 1446047871412000
(1 rows)
Cassandra 3.0 and up:
> SELECT keyspace_name, table_name, writetime(id)
FROM system_schema.tables
WHERE keyspace_name='stackoverflow' AND table_name='book';
keyspace_name | table_name | writetime(id)
---------------+------------+------------------
stackoverflow | book | 1442339779097000
(1 rows)
Is it possible to create an index on a UUID/TIMEUUID column in Cassandra? I'm testing out a model design which would have an index on a UUID column, but queries on that column always return 0 rows found.
I have a table like this:
create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
I create an index with this command:
create index idx on some_data (run_id) ;
No errors are thrown by CQL when I create this index.
I have a small bit of test data in the table:
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-----------------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
However, when I run the query:
select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
CQLSH just returns: (0 rows)
If I use an int for the run_id then the index behaves as expected.
Yes, you can create a secondary index on a UUID. The real question is "should you?"
In any case, I followed your steps, and got it to work.
Connected to Test Cluster at 192.168.23.129:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
aploetz#cqlsh> use stackoverflow ;
aploetz#cqlsh:stackoverflow> create table some_data (site_id int, user_id int, run_id uuid, value int, primary key((site_id, user_id), run_id));
aploetz#cqlsh:stackoverflow> create index idx on some_data (run_id) ;
aploetz#cqlsh:stackoverflow> INSERT INTO some_data (site_id, user_id, run_id, value) VALUES (1,1,9e118af0-ac92-11e4-81ae-8d1bc921f26d,3);
aploetz#cqlsh:stackoverflow> select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
code=2200 [Invalid query] message="unconfigured columnfamily usr_rec3"
aploetz#cqlsh:stackoverflow> select * from some_data where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d;
site_id | user_id | run_id | value
---------+---------+--------------------------------------+-------
1 | 1 | 9e118af0-ac92-11e4-81ae-8d1bc921f26d | 3
(1 rows)
Notice though, that when I ran this command, it failed:
select * from usr_rec3 where run_id = 9e118af0-ac92-11e4-81ae-8d1bc921f26d
Are you sure that you didn't mean to select from some_data instead?
Also, creating secondary indexes on high-cardinality columns (like a UUID) is generally not a good idea. If you need to query by run_id, then you should revisit your data model and come up with an appropriate query table to serve that.
Clarification:
Using secondary indexes in general is not considered good practice. In the new book Cassandra High Availability, Robbie Strickland identifies their use as an anti-pattern, due to poor performance.
Just because a column is of the UUID data type doesn't necessarily make it high-cardinality. That's more of a data model question for you. But knowing the nature of UUIDs and their underlying purpose toward being unique, is setting off red flags.
Put these two points together, and there isn't anything about creating an index on a UUID that sounds appealing to me. If it were my cluster, and (more importantly) I had to support it later, I wouldn't do it.
Folks,
I am currently have started reading about NOSQL related DB as currently working on Database warehousing related application.
I have following questions. I have already read basics.
Question 1) How entire raw is retrived in column oriented database as data with same column is stored together ?
lets say we store data in following format so internally it will be stored like this in column oriented DB.
test|test1 together and 5|10 together.
key 1 : { name : test, value : 5 }
key 2 : { name : test1 , value : 10 }
So if we have to retrive data for key1 how does it happen ? (A and B is my guess)
A) If it has to pick data from each column storage seperately then it will be very costly
B) is there any indexing mechanism to fetch this data for all columns for given raw key ?
Question 2 )
I was reading through some of the docs and found column oriented Database is more suited to run aggregation function on single column as I/O will be less.
I didnot find proper support for aggregation function like SUM,AVG etc in NOSQL column oriented store like cassandra and HBASE. ( There could be some tweaking/hacking/more code writing like below)
How does Apache Cassandra do aggregate operations?
realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
How to use hbase coprocessor to implement groupby?
Question 3 ) How the joins happens internally in column oriented database is it advisable to do ?
Nice question,
1) In Cassandra if you are using cqlsh then it will look like as you store data in mysql or some other rdbms stores.
Connected to Test Cluster at localhost:9160.
[cqlsh 3.1.7 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', 'replication_factor': 1
<value>
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', replication_factor': 1};
cqlsh> USE test ;
cqlsh:test> create table entry(key text PRIMARY KEY, name text, value int );
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key1', 'test',5);
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key2', 'test1',10);
cqlsh:test> select * from entry;
key | name | value
------+-------+-------
key1 | test | 5
key2 | test1 | 10
cqlsh:test>
Note:- you can select rows using key or using some criteria on other column by using secondary indexes.
But in hbase the structure will look like following
rowkey | column family | column | value
key1 | entry | name | test
key1 | entry | value | 5
key2 | entry | name | test1
key2 | entry | value | 10
Note:- you can select each row using key or any column value its very easy.
2) Yes nosqls also supports batch operation only for DMLs.
3) Joins are not supported in none of nosqls datastores. They are not meant for joins.
Hope it will help you.