Get cassandra tables creation date - cassandra

How can I get the creation date and time of a cassandra table?
I tried to use cqlsh DESC TABLE but there is no information about the creation time stamp...

Depending on your version of Cassandra, you can check the schema tables. Each table gets a unique ID when it is created, and that ID gets written to the schema tables. If you query the WRITETIME of that ID, it should give you a UNIX timestamp (in microseconds) of when it was created.
Cassandra 2.2.x and down:
> SELECT keyspace_name, columnfamily_name, writetime(cf_id)
FROM system.schema_columnfamilies
WHERE keyspace_name='stackoverflow' AND columnfamily_name='book';
keyspace_name | columnfamily_name | writetime(cf_id)
---------------+-------------------+------------------
stackoverflow | book | 1446047871412000
(1 rows)
Cassandra 3.0 and up:
> SELECT keyspace_name, table_name, writetime(id)
FROM system_schema.tables
WHERE keyspace_name='stackoverflow' AND table_name='book';
keyspace_name | table_name | writetime(id)
---------------+------------+------------------
stackoverflow | book | 1442339779097000
(1 rows)

Related

SparkSQL/Hive: equivalent of MySQL's `information_schema.table.{data_length, table_rows}`?

In MySQL, we can query the table information_schema.tables and obtain useful information such as data_length or table_rows
select
data_length
, table_rows
from
information_schema.tables
where
table_schema='some_db'
and table_name='some_table';
+-------------+------------+
| data_length | table_rows |
+-------------+------------+
| 8368 | 198 |
+-------------+------------+
1 row in set (0.01 sec)
Is there an equivalent mechanism for SparkSQL/Hive?
I am okay to use SparkSQL or program API like HiveMetaStoreClient (java API org.apache.hadoop.hive.metastore.HiveMetaStoreClient). For the latter I read the API doc (here) and could not find any method related to table row numbers and sizes.
There is no one command for meta-information. Rather there are a set of commands, you may use
Describe Table/View/Column
desc [formatted|extended] schema_name.table_name;
show table extended like part_table;
SHOW TBLPROPERTIES tblname("foo");
Display Column Statistics (Hive 0.14.0 and later)
DESCRIBE FORMATTED [db_name.]table_name column_name;
DESCRIBE FORMATTED [db_name.]table_name column_name PARTITION (partition_spec);

Cassandra Result Not Responding Correct Rows in Tables

My Cassandra DB not responding as expected Row result. please see the below details of my Cassandra keyspace creation and to query of Count(*)
Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra
3.11.0 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE key1 WITH replication = {'class':'SimpleStrategy', 'replicationfactor' : 1};
cqlsh> CREATE TABLE Key.Transcation_CompleteMall (i text, i1 text static, i2 bigint , i3 int static, i4 decimal static, i5 bigint static, i6 decimal static, i7 decimal static, PRIMARY KEY ((i),i1));
cqlsh> COPY Key1.CompleteMall (i,i1,i2,i3,i4,i5,i6,i7) FROM '/home/gpadmin/all.csv' WITH HEADER = TRUE; Using 16 child processes
Starting copy of Key1.completemall with columns [i, i1, i2, i3, i4, i5, i6, i7]. Processed: 25461792 rows; Rate: 15162 rows/s; Avg. rate: 54681 rows/s
> **bold**25461792 rows imported from 1 files in 7 minutes and 45.642 seconds (0 skipped).
cqlsh> select count(*) from Key1.transcation_completemall; OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1 cqlsh> exit
[gpadmin#hmaster ~]$ cqlsh --request-timeout=3600
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select count(*) from starhub.transcation_completemall;
count
---------
**bold**2865767
(1 rows)
Warnings :
Aggregation query used without partition key
cqlsh>
I got only 2865767 rows but Copy command shows that 25461792 Rows accepted Cassandra. all.csv file has 2.5G size. For evaluating I exported the table to another file test.csv file which file size I wondered it became 252Mb.
My question is that, is Cassandra will automatically remove the duplicate in a row ?
If yes how the Cassandra delete the duplicate in a table? Like primary Key repetition or Partition Key or like exact field duplication?
or
What would be the possibility that data get Loss
Expected your valuable suggestion
Advance Thanks to you all
Cassandra will overwrite data with same primary key (Ideally all database will not have duplicate values for primary key(some throws constraint error,while some overwrites data)).
Example:
CREATE TABLE test(id int,id1 int,name text,PRIMARY KEY(id,id1))
INSERT INTO test(id,id1,name) VALUES(1,2,'test');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');
INSERT INTO test(id,id1,name) VALUES(1,2,'test2');
INSERT INTO test(id,id1,name) VALUES(1,1,'test1');
SELECT * FROM test;
-----------------
|id |id1 |name |
-----------------
|1 |2 |test2 |
-----------------
|1 |1 |test1 |
-----------------
The above statement will have only 2 records in table one with primary key (1,1) and other with primary key(1,2).
So in your case if values of i and i1 have duplicates that data will be overwritten.
Maybe check LIMIT option on SELECT statement, see ref doc here
Ref doc says:
Specifying rows returned using LIMIT
Using the LIMIT option, you can specify that the query return a limited number of rows.
SELECT COUNT() FROM big_table LIMIT 50000;
SELECT COUNT() FROM big_table LIMIT 200000;
The output of these statements if you had 105,291 rows in the database would be: 50000, and 105,291. The cqlsh shell has a default row limit of 10,000. The Cassandra server and native protocol do not limit the number of rows that can be returned, although a timeout stops running queries to protect against running malformed queries that would cause system instability.

Group by in CQL for Cassandra DB not working

I have this table in a Cassandra keyspace:
create table hashtags(
id uuid,
text text,
frequence int,
primary key ((text), frequence, id))
with clustering order by (frequence desc, id asc);
So I have text as partition key and frequence, id as clustering key.
According to Cassandra documentation regarding the support for GROUP BY operations, I should be able to run this sort of query:
select text, sum(frequence) from hashtags
group by text;
But I keep getting this error:
com.datastax.driver.core.exceptions.SyntaxError: line 2:0 no viable alternative at input 'group' (...text, sum(frequence) from [hashtags] group...)
Is there something I misunderstood from the guide? How can I run correctly this query?
Thanks for helping.
It worked for me on Apache Cassandra 3.10. I tried from cqlsh.
cqlsh:test> select * from hashtags ;
text | frequence | id
-------+-----------+--------------------------------------
hello | 5 | 07ef8ee4-6492-4112-babb-fc3ac2893701
hello | 4 | 3f6f3b1d-4a33-4a07-ad60-2274a9dc5577
hello | 1 | 4adf7e2a-f3b9-41eb-85cf-f4c4bdc5d322
hi | 7 | 71718f46-455e-4012-a306-f31f1cb2454a
(4 rows)
cqlsh:test> select text, sum(frequence) from hashtags group by text;
text | system.sum(frequence)
-------+-----------------------
hello | 10
hi | 7
(2 rows)
Warnings :
Aggregation query used without partition key

Add Column in Apache Cassandra

How to check in node.js that the column does not exist in Apache Cassandra ?
I need to add a column only if it not exists.
I have read that I must make a select before, but if I select a column that does not exist, it will return an error.
Note that if you're on Cassandra 3.x and up, you'll want to query from the columns table on the system_schema keyspace:
aploetz#cqlsh:system_schema> SELECT * FROm system_schema.columns
WHERE keyspace_name='stackoverflow'
AND table_name='vehicle_information'
AND column_name='name';
keyspace_name | table_name | column_name | clustering_order | column_name_bytes | kind | position | type
---------------+---------------------+-------------+------------------+-------------------+---------+----------+------
stackoverflow | vehicle_information | name | none | 0x6e616d65 | regular | -1 | text
(1 rows)
You can check a column existance using a select query on system.schema_columns table.
Suppose you have the table test_table on keyspace test. Now you want to check a column test_column If exit or not.
Use the below query :
SELECT * FROM system.schema_columns WHERE keyspace_name = 'test' AND columnfamily_name = 'test_table' AND column_name = 'test_column';
If the above query return a result then the column exist otherwise not.

Column oriented database related

Folks,
I am currently have started reading about NOSQL related DB as currently working on Database warehousing related application.
I have following questions. I have already read basics.
Question 1) How entire raw is retrived in column oriented database as data with same column is stored together ?
lets say we store data in following format so internally it will be stored like this in column oriented DB.
test|test1 together and 5|10 together.
key 1 : { name : test, value : 5 }
key 2 : { name : test1 , value : 10 }
So if we have to retrive data for key1 how does it happen ? (A and B is my guess)
A) If it has to pick data from each column storage seperately then it will be very costly
B) is there any indexing mechanism to fetch this data for all columns for given raw key ?
Question 2 )
I was reading through some of the docs and found column oriented Database is more suited to run aggregation function on single column as I/O will be less.
I didnot find proper support for aggregation function like SUM,AVG etc in NOSQL column oriented store like cassandra and HBASE. ( There could be some tweaking/hacking/more code writing like below)
How does Apache Cassandra do aggregate operations?
realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
How to use hbase coprocessor to implement groupby?
Question 3 ) How the joins happens internally in column oriented database is it advisable to do ?
Nice question,
1) In Cassandra if you are using cqlsh then it will look like as you store data in mysql or some other rdbms stores.
Connected to Test Cluster at localhost:9160.
[cqlsh 3.1.7 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
Use HELP for help.
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', 'replication_factor': 1
<value>
cqlsh> create keyspace test with replication={'class':'SimpleStrategy', replication_factor': 1};
cqlsh> USE test ;
cqlsh:test> create table entry(key text PRIMARY KEY, name text, value int );
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key1', 'test',5);
cqlsh:test> INSERT INTO entry (key, name , value ) VALUES ( 'key2', 'test1',10);
cqlsh:test> select * from entry;
key | name | value
------+-------+-------
key1 | test | 5
key2 | test1 | 10
cqlsh:test>
Note:- you can select rows using key or using some criteria on other column by using secondary indexes.
But in hbase the structure will look like following
rowkey | column family | column | value
key1 | entry | name | test
key1 | entry | value | 5
key2 | entry | name | test1
key2 | entry | value | 10
Note:- you can select each row using key or any column value its very easy.
2) Yes nosqls also supports batch operation only for DMLs.
3) Joins are not supported in none of nosqls datastores. They are not meant for joins.
Hope it will help you.

Resources