I am NoSQL n00b, and just trying things out. I have the following keyspace with a single table in cassandra 2.0.2
CREATE KEYSPACE PersonDB WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
USE PersonDB;
CREATE TABLE Persons (
id int,
lastname text,
firstname text,
PRIMARY KEY (id)
)
I have close to 500 entries in the Persons table. I want to select any random row from the table. Is there an efficient way to do it in CQL? I am using groovy to invoke APIs exposed by datastax.
If want to get "any" row you can just use LIMIT.
select * from persons LIMIT 1;
You would get the row with the lower hash of the partition key (id).
It will not be random, it will depend on your partitioner, but you would get A row.
Related
I am trying to update the few fields of each row of a big mysql table (having close to 500 million rows). The table doesn't have any primary key (or having string primary key like UUID). I don't have enough executor memory to read and hold the entire data in once. Can anyone please let me know what are my options to process such tables.
Below is the schema
CREATE TABLE Persons ( Personid varchar(255) NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255) DEFAULT NULL, Email varchar(255) DEFAULT NULL, Age int(11) DEFAULT NULL) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Spark code is like
SparkSession spark = SparkSession.builder().master("spark://localhost:7077").appName("KMASK").getOrCreate();
DataFrame rawDataFrame = spark.read().format("jdbc").load();
rawDataFrame.createOrReplaceTempView("data");
//encrypt is UDF
String sql = "select Personid, LastName, FirstName, encrypt(Email), Age from data";
Dataset newData = spark.sql(sql);
newData.write().mode(SaveMode.Overwrite).format("jdbc").options(options).save();
This table has around 150 million records, size of data is around 6GB. My executor memory is just 2 gb. Can I process this table using Spark - jdbc.
Ideally you can alter the spark jdbc fetchsize option to reduce/increase how many records are fetched and processed each time.
Partitioning the data can also help to reduce shuffles and additional overhead. Since you have Age as a numerical field. You may also process the data in partitions determined by the Age. First determine the min and max age and use the Spark JDBC Options.
Notably:
partitionColumn : Age
lowerBound : min age you identified
upperBound : max age you identified
numPartitions: really dependent on the number of cores and worker nodes but more hints and links are here
You may also use custom queries to only select and update a few records that can hold in memory with the query option. NB. when using the query option you should not use dbtable option.
I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.
Hello we have a table in Cassandra whose structure is as below
CREATE TABLE dmp.user_profiles_6 (
vuid text PRIMARY KEY,
brand_model text,
first_seen timestamp,
last_seen timestamp,
total_day_count int,
total_usage_count int,
user_type text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = '99PERCENTILE';
I read a few articles about data modeling in Cassandra from datastax. In in it said that primary key consists of partition key and clustering key.
Now in above case we have a vuid column which is an identifier for every unique user. It is primary key. We have 400M unique users. So now does it mean that Cassandra is making 400M partitions? Then this must degrade the performance. In one datastax article about data modeling an example table shows primary key on a uuid column which is unique and having a very high cardinality. I am totally confused, can anyone help me identify which column can be set as partition key and which as cluster key?
Queries can be as below:
1. Select record directly on basis of vuid
2. Select vuids on basis of range of last seen or first seen
Select record directly on basis of vuid >>
Your table does that. It already has vuid as a primary key.
Select vuids on basis of range of last seen or first seen >>
There are two options here:
Either add last_seen or first_seen in clustering columns (you can do range selection on clustering columns only)
In this case you need to provide vuid along with last_seen and first_seen on the query. I don't think you want that.
OR
Create another table which has the same data(Yes,in C* we create another table for different query with same data and change the keys as per query. Welcome to data duplication). In this table you have to have to add a dummy column as primary key and make the last_seen and first_seen as clustering keys.You pass these seen dates in query to fetch vuid.
Hope this is clear.
you need to create 3 tables as below.
table 1:-
CREATE TABLE dmp.user_profiles_ZZZZ (
Dummy_column uuid ,
vuid text,
........other colums
PRIMARY KEY((Dummy_column,vuid))
) .....
table 2:-
CREATE TABLE dmp.user_profiles_YYYY (
Dummy_column uuid ,
.......other colums
PRIMARY KEY((Dummy_column),first_seen)
) .....
CREATE TABLE dmp.user_profiles_XXXX (
Dummy_column uuid ,
.....other colums
PRIMARY KEY((Dummy_column),last_seen)
) .....
In Cassandra(Query Driven model), tables are created to satisfy the query this is different from relation database Data modeling.
In cassandra, Primary Key consists of 2 type of keys
1.Partition key -> defines the partitions
2.Clustring key -> defines the order in partition
depending on the uses.
if the column mentioned in Partition key and clustring key are not enough to provide the uniqueness then we need to add Primary key of the relationship in the
Primary key.
Apart from the as a tip:-
[Column name XX] = ? -> equality check than add column name in Partition key
[Column name yy] >= ? -> Range check add column name in Clustring key
here in question its not mentioned what is your query which should be served.
Please share the query based on that table can be created.
Anybody please help me understand why Cassandra is inserting null values in columns that was skipped? Isn't it supposed to skip the column? It should not insert any value (not even null) if I skip the column entirely while inserting data? I am bit confused because as per the following tutorial, data is stored by row key with the columns (the diagram in column family), if it is true then I should not get null for the column.
Or the whole concept I learned about the Cassandra column family is wrong?
http://www.tutorialspoint.com/cassandra/cassandra_data_model.htm
Here is the CQL script
create keyspace test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
create table users (firstname text,lastname text,age int, gender ascii, primary key(firstname))
insert into users(firstname,age,gender,lastname) values("Michael",30,"male","smith");
Here, I am skipping a column, but when I run select query, it shows null for that column. Why Cassandra is filling up null in that column?
insert into users(firstname,age,gender) values('Jane',23,'female');
select * from users;
Why don't you go to the most comprehensive source of documentation and learning for Cassandra : http://academy.datastax.com ? And it's free. The content and tutorialspoint.com is very old and not updated since ages (SuperColumn are deprecated since 2011 - 2012 ...)
Here, I am skipping a column, but when I run select query, it shows null for that column. Why Cassandra is filling up null in that column?
In CQL, null == value is not present or value has been deleted
Since you did not insert any value for column lastname Cassandra will return null (== not present in this case)
We are trying to store lots of attributes for a particular profile_id inside a table (using CQL3) and cannot wrap our heads around which approach is the best:
a. create table mytable (profile_id, a1 int, a2 int, a3 int, a4 int ... a3000 int) primary key (profile_id);
OR
b. create MANY tables, eg.
create table mytable_a1(profile_id, value int) primary key (profile_id);
create table mytable_a2(profile_id, value int) primary key (profile_id);
...
create table mytable_a3000(profile_id, value int) primary key (profile_id);
OR
c. create table mytable (profile_id, a_all text) primary key (profile_id);
and just store 3000 "columns" inside a_all, like:
insert into mytable (profile_id, a_all) values (1, "a1:1,a2:5,a3:55, .... a3000:5");
OR
d. none of the above
The type of query we would be running on this table:
select * from mytable where profile_id in (1,2,3,4,5423,44)
We tried the first approach and the queries keep timing out and sometimes even kill cassandra nodes.
The answer would be to use a clustering column. A clustering column allows you to create dynamic columns that you could use to hold the attribute name (col name) and it's value (col value).
The table would be
create table mytable (
profile_id text,
attr_name text,
attr_value int,
PRIMARY KEY(profile_id, attr_name)
)
This allows you to add inserts like
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a1', 3);
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'a2', 1031);
.....
insert into mytable (profile_id, attr_name, attr_value) values ('131', 'an', 2);
This would be the optimal solution.
Because you then want to do the following
'The type of query we would be running on this table: select * from mytable where profile_id in (1,2,3,4,5423,44)'
This would require 6 queries under the hood but cassandra should be able to do this in no time especially if you have a multi node cluster.
Also if you use the DataStax Java Driver you can run this requests asynchronously and concurrently on your cluster.
For more on data modelling and the DataStax Java Driver check out DataStax's free online training. Its worth a look
http://www.datastax.com/what-we-offer/products-services/training/virtual-training
Hope it helps.