How do I delete all data in a Cassandra column family? - cassandra

I'm looking for a way to delete all of the rows from a given column family in cassandra.
This is the equivalent of TRUNCATE TABLE in SQL.

You can use the truncate thrift call, or the TRUNCATE <table> command in CQL.
http://www.datastax.com/docs/1.0/references/cql/TRUNCATE

You can also do this via Cassandra CQL.
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh> TRUNCATE my_keyspace.my_column_family;

Its very simple in Astyanax. Just a Single Line statement
/* keyspace variable is Keyspace Type */
keyspace.truncateColumnFamily(ColumnFamilyName);

If you are using Hector it is easy as well:
cluster.truncate("our keyspace name here", "your column family name here");

If you are using cqlsh, then you can either do it in 2 ways
use keyspace; and then truncate column_family;
truncate keyspace.column_family;
If you want to use DataStax Java driver, you can look at -
http://www.datastax.com/drivers/java/1.0/com/datastax/driver/core/querybuilder/QueryBuilder.html
or
http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/querybuilder/Truncate.html
depending on your version.

if you are working on cluster setup, truncate can only be used when all the nodes of the cluster are UP.
By using truncate, we will miss the data(we are not sure with the importance of the data)
So the very safe way as well a trick to delete data is to use COPY command,
1) backup data using copy cassandra cmd
copy tablename to 'path'
2) duplicate the file using linux cp cmd
cp 'src path' 'dst path'
3) edit duplicate file in dst path, delete all lines expect first line.
save the file.
4) use copy cassandra cmd to import
copy tablename from 'dst path'

Related

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say.
Hbase version >> Version 2.1.0-cdh6.2.1
So, the problem statement is, in Hbase, we have a row in our table.
We perform new insert and then subsequent updates of the same Hbase row, as we receive the data from downstream.
say we received data like below
INSERT of {a=1,b=1,c=1,d=1,rowkey='row1'}
UPDATE of {b=1,c=1,d=1,rowkey='row1'}
and
say the final row is like this in our Hbase table
hbase(main):008:0> get 'test', 'row1'
COLUMN CELL
cf:b timestamp=1288380727188, value=value1
cf:c timestamp=1288380727188, value=value1
cf:d timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds
So, cf:a, column qualifier is missing in above data as visible above when fetched via scan or get commands. But as per our ingestion flow/process, it should have been there. So, we are triaging as to where it went or what happened and so on. Still the analysis is in process and we are kind of clueless as to where it is.
Now, cut story short, we have a spark util to read the Hbase table into a Rdd, via
hbasecontext.hbaseRdd API function, convert it into a dataframe and display the tabular data. So, we ran this spark util on the same table to help locate this row and very surprisingly it returned 2 rows for the this same rowkey 'row1', where 1st row was the same as above get/scan row (above data) and the 2nd row had our missing column cf:a (surprising it had the same value which was expected). Say the output dataframe appeared something like below.
rowkey |cf:a |cf:b|cf:c|cf:d
row1 |null | 1 | 1 | 1 >> cf:a col qualifier missing (same as in Hbase shell)
row1 | 1 | 1 | 1 | 1 >> This cf:a was expected
We checked our Hbase table schema as well, so we dont have multiple versions of the cf:a in the describe or we dont do versioning on the table. The schema of the Hbase table describe has
VERSIONS => '1'
Anyways, i am clueless as to how hbaseRdd is able to read that row or missing col qualifier, but the Hbase shell cmds via get, scans does not read the missing col qualifier or row.
Any Hbase expert or suggestions please.
Fyi, i tried Hbase shell cmds as well via get - versions on the row, but it only returns the above get data and not the missing cf:a.
Is the col qualifier cf:a marked for deletion or something like that, which the Hbase shell cmd doesn't show ?
Any help would be appreciated.
Thanks !!
This is a strange problem, which I suspect has to do with puts with the same rowkey having different column qualifiers at different times. However, I just tried to recreate this behaviour and I don't seem to be getting this problem. But I have a regular HBase 2.x build, as opposed to yours.
One option I would recommend to explore the problem more closely is to inspect the HFiles physically, outside of hbase shell. You can use the HBase HFile utility to print the physical key-value content at the HFile level. Obviously try to do this on a small HFile! Don't forget to flush and major-compact your table before you do it though, because HBase stores all updates in memory while it can.
You can launch the utility as below, and it will print all key-values sequentially:
hbase hfile -f hdfs://HDFS-NAMENODE:9000/hbase/data/default/test/29cfaecf083bff2f8aa2289c6a078678/f/09f569670678405a9262c8dfa7af8924 -p --printkv
In the above command, HDFS-NAMENODE is your HDFS server, default is your namespace (assuming you have none), test is your table name, and f is the column family name. You can find out the exact path to your HFiles by using the HDFS browse command recursively:
hdfs dfs -ls /hbase/data
[Updated] We worked with Cloudera and found the issue was due to the Hbase region servers getting overlapped. Cloudera fixed it for us. I dont have the full details how they did it.

updating all Cassandra tables starting with a specific name

I am trying to alter my cassandra tables starting with a specific name.
My table starts with sample_1,sample_2,sample_13567,sample_adgf and so on...
The table names are random but starting with same prefix.
I want to add a new column to all these tables.
Can some one suggest me the update query using the regex for table names.
If you are using linux You can this in two step :
First Generate all alter command into a file like below :
for i in {1..13567}; do echo "ALTER TABLE sample_$i ADD test text;"; done > alter.cql
The above command will create alter command to add test text column for table sample_1 to sample_13567 and store into a file alter.cql
Now you can just load the cql file into cqlsh like below :
cqlsh 127.0.0.1 -u cassandra -p cassandra -k ashraful_test -f alter.cql
Here
-u username
-p password
-k keyspace_name
-f file name to load
By the way having too much table is not a good idea.
Check this link https://stackoverflow.com/a/33389204/2320144

How can i describe table in cassandra database?

$describe = new Cassandra\SimpleStatement(<<<EOD
describe keyspace.tablename
EOD
);
$session->execute($describe);
i used above code but it is not working.
how can i fetch field name and it's data type from Cassandra table ?
Refer to CQL documentation. Describe expects a table/schema/keyspace.
describe table keyspace.tablename
Its also a cqlsh command, not an actual cql command. To get this information query the system tables. try
select * from system.schema_columns;
- or for more recent versions -
select * from system_schema.columns ;
if using php driver may want to check out http://datastax.github.io/php-driver/features/#schema-metadata
Try desc table keyspace.tablename;

CQL3.2: DROP TABLE with certain prefix?

I have a Cassandra 2.1.8 database with a bunch of tables, all in the form of either "prefix1_tablename" or "prefix2_tablename".
I want to DROP every table that begins with prefix1_ and leave anything else alone.
I know I can grab table names using the query:
SELECT columnfamily_name FROM system.schema_columnfamilies
WHERE keyspace_name='mykeyspace'
And I thought about filtering the results somehow to get only prefix1_ tables, putting them into a table with DROP TABLE prepended to each one, then executing all the statements in my new table. It was similar thinking to strategies I've seen for people solving the same problem with MySQL or Oracle.
With CQL3.2 though, I don't have access to User-Defined Functions (at least according to the docs I've read...) and I don't know how to do something like execute statements off of a table query result, as well as even how to filter out prefix1_ tables with no LIKE operator in Cassandra.
Is there a way to accomplish this?
I came up with a Bash shell script to solve my own issue. Once I realized that I could export the column families table to a CSV file, it made more sense to me to perform the filtering and text manipulation with grep and awk as opposed to finding a 'pure' cqlsh method.
The script I used:
#!/bin/bash
# No need for a USE command by making delimiter a period
cqlsh -e "COPY system.schema_columnfamilies (keyspace_name, columnfamily_name)
TO 'alltables.csv' WITH DELIMITER = '.';"
cat alltables.csv | grep -e '^mykeyspace.prefix1_' \
| awk '{print "DROP TABLE " $0 ";"}' >> remove_prefix1.cql
cqlsh -f 'remove_prefix1.cql'
rm alltables.csv remove_prefix1.cql

RPC timeout error while exporting data from CQL

I am trying to export data from cassandra using CQL client. A column family has about 100000 rows in it. when i am copying dta into csv file using COPY TO command i get following rpc_time out error.
copy mycolfamily to '/root/mycolfamily.csv'
Request did not complete within rpc_timeout.
I am running in:
[cqlsh 3.1.6 | Cassandra 1.2.8 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
How can I increase RPC timeout limit?
I tried adding rpc_timeout_in_ms: 20000 (defalut is 10000) in my conf/cassandra.yaml file. but while restarting cassandra I get:
[root#user ~]# null; Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=Cannot create property=rpc_timeout_in_ms for JavaBean=org.apache.cassandra.config.Config#71bfc4fc; Unable to find property 'rpc_timeout_in_ms' on class: org.apache.cassandra.config.Config
Invalid yaml; unable to start server. See log for stacktrace.
The COPY command currently does the same thing with SELECT with LIMIT 99999999. So, it will eventually goes to timeout while your data is growing. Here's the export function;
https://github.com/apache/cassandra/blob/trunk/bin/cqlsh#L1524
I'm doing the same export on production. What I'm doing is the following;
make select * from table where timeuuid = someTimeuuid limit 10000
write the result set to a csv file w/ >> mode
make the next selects with respect to the last timeuuid
You can pipe command in cqlsh by the following cqlsh command
echo "{$cql}" | /usr/bin/cqlsh -u user -p password localhost 9160 > file.csv
You can use Auto pagination by specifying fetch size in Datastax Java driver.
Statement stmt = new SimpleStatement("SELECT id FROM mycolfamily;");
stmt.setFetchSize(500);
session.execute(stmt);
for (Row r:result.all()){
//write to file
}
I have encountered the same problem a few minutes ago then I have found CAPTURE and it worked:
First start capturing on cqlsh and then run your query with some limiting of your choice.
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/capture_r.html
The best way yo export the data is using nodetool snapshot option. This returns immediately and can be restored later on. The only issue is that this export is per node and for the entire cluster.
Example:
nodetool -h localhost -p 7199 snapshot
See reference:
http://docs.datastax.com/en/archived/cassandra/1.1/docs/backup_restore.html#taking-a-snapshot

Resources