Cassandra-stress does not generate random values for every row - cassandra

With DDL and profile yaml below, I generate random data for my table using cassandra-stress. The results I get for the columns amount and status don't match expectation. The random values seem to be drawn once per partition, not for each row.
If, for example, cassandra-stress generates 5 rows with the same business_date (i.e. one partition) the amount and status values are repeated 5 times, the "next" random value comes when the business_date changes. How can I make this so I get a new draw of amount and status for every row?
Sample output, notice last two columns change value only once first column changes.
2018-09-26,y~8.>6MZ,00000000-0004-0a3c-0000-000000040a3c,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-004c-4e7e-0000-0000004c4e7e,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-003d-b97f-0000-0000003db97f,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-004f-db3f-0000-0000004fdb3f,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-008c-f0ea-0000-0000008cf0ea,5.133114565746717E10,3PR|I{3B
2018-10-14,Y ?R| |u,00000000-002b-5707-0000-0000002b5707,6.698617679577381E10,,fkb[cU~N!
.
.
.
Table structure:
CREATE TABLE IF NOT EXISTS record (
business_date date,
region text,
id uuid,
status text,
amount double,
PRIMARY KEY (business_date, region, id)
);
Profile YAML:
keyspace: dev
table: record
columnspec:
- name: business_date
population: uniform(17800..17845)
- name: region
size: fixed(10)
population: seq(10..16)
cluster: fixed(7)
- name: id
size: fixed(32)
population: seq(1..10M)
cluster: fixed(5)
- name: status
size: fixed(10)
population: uniform(1000..1010)
- name: amount
population: uniform(500000..10M)
insert:
partitions: fixed(1)
select: fixed(1)/35
queries:
selectall:
cql: select * from record where business_date = ? and region = ?
fields : samerow

Related

How to update fields in cassandra frozen UDT column?

I am aware of fact that fields in frozen UDT column is not possible and entire records needs to update , in that case does it imply update on frozen UDT column is not possible and if there is scenario of field update of frozen UDT column , in that case one has to insert new record and delete older one ?
You are correct that you cannot update individual fields of a frozen UDT column but you can update the whole column value. You do not need to delete the previous record. It's fine to update the fields directly. Let me illustrate with an example I created on Astra.
Here is a user-defined type that stores a user's address:
CREATE TYPE address (
number int,
street text,
city text,
zip int
)
and here is the definition for the table of users:
CREATE TABLE users (
name text PRIMARY KEY,
address frozen<address>
)
In this table, there is one user with their address stored as:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------------------
alice | {number: 100, street: 'Main Rd', city: 'Melbourne', zip: 3000}
Let's say that the street number is incorrect. If we try to update just the street number field with:
cqlsh> UPDATE users SET address = {number: 456} WHERE name = 'alice';
We'll end up with an address that only has the street number and nothing else:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------
alice | {number: 456, street: null, city: null, zip: null}
This is because the whole value (not just the street number field) got overwritten by the update. The correct way to update the street number is to explicitly set a value for all the fields of the address with:
cqlsh> UPDATE users SET address = {number: 456, street: 'Main Rd', city: 'Melbourne', zip: 3000} WHERE name = 'alice';
so we end up with:
cqlsh> SELECT * FROM users ;
name | address
-------+----------------------------------------------------------------
alice | {number: 456, street: 'Main Rd', city: 'Melbourne', zip: 3000}
Cheers!
You can update column that is frozen UDT, but you'll need to insert all values for fields inside that UDT. So you can just do normal update of that column only
UPDATE table SET udt_col = new_value WHERE pk = ....
without need to delete something first, etc.
Basically, frozen value is just blob obtained by serializing UDT or collection, and stored as one cell inside row and having the single timestamp. That's different from the non-frozen value, where different pieces of UDT/collection could be stored in different places, and having different timestamps.

How to model for word search in cassandra

my model design to save word search from checkbox and it must have update word search and status, delete(fake). my old model set pk is uuid(id of word search) and set index is status (enable, disable, deleted)
but I don't want to set index at status column(I think its very bad to set index at update column) and I don't change database
Is it have better way for model this?
sorry for my english grammar
You should not create index on very low cardinality column status
Avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE & FEMALE. If the number of users per node is very dense (e.g. millions) we’ll have very wide partitions for MALE & FEMALE index, which is bad
Source : https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
Best way to handle this type of case :
Create separate table for each type of status
Or Status with a known parameter (year, month etc) as partition key
Example of 2nd Option
CREATE TABLE save_search (
year int,
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY((year, status), uuid)
);
Here you can see that i have made a composite partition key with year and status, because of low cardinality issue. If you think huge data will be in a single status then you should also add month as the part of composite partition key
If your dataset is small you can just remove the year field.
CREATE TABLE save_search (
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY(status, uuid)
);
Or
If you are using cassandra version 3.x or above then you can use materialized view
CREATE MATERIALIZED VIEW search_by_status AS
SELECT *
FROM your_main_table
WHERE uuid IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, uuid);
You can query with status like :
SELECT * FROM search_by_status WHERE status = 0;
All the deleting, updating and inserting you made on your main table cassandra will sync it with the materialized view

Cassandra data loading sequence for columns disabling

I am new to cassandra. I created the below cassandra table with primary keys.
Create table Query:
create table DB.EMP(
Name text,
age int,
id int,
loc text,
salary double,
PRIMARY KEY (id,salary)
);
I loaded above table with below data using command:
Command ::: copy emp from '/home/data' with delimiter=',';
Data ::: /home/data
"Sdd,25,123,Chennai,28000"
I am getting this error:
Using 1 child processes
Starting copy of pmm.emp with columns ['id', 'salary', 'age', 'loc', 'name'].
Failed to import 1 rows: ParseError - invalid literal for int() with base 10: 'Sdd' - given up without retries
Failed to process 1 rows; failed rows written to import_db_emp.err
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 1 files in 0.170 seconds (0 skipped).
Please suggest how can I load the data.
Is there any way I can disable , alphabetical order insertion option except primary keys.
Is there any way I can disable, alphabetical order insertion option except primary keys?
No. Cassandra stores the column names that way to ensure proper on-disk order.
An easy solution would be to specify your column order in your COPY command:
aploetz#cqlsh:stackoverflow> COPY emp (name,age,id,loc,salary)
FROM '/home/aploetz/data.txt' WITH DELIMITER=',';
Reading options from the command line: {'delimiter': ','}
Using 1 child processes
Starting copy of stackoverflow.emp with columns [name, age, id, loc, salary].
Processed: 1 rows; Rate: 0 rows/s; Avg. rate: 1 rows/s
1 rows imported from 1 files in 1.919 seconds (0 skipped).

How to define keyspaces for a timeseries data in Cassandra?

There are 100s of data points, each data point has its own seperate table with schema and queries as mentioned below:
Current Schema in SQLite
Table Name: Name of Data Point e.g. Tempearature
Column-1: Name: Timestamp Type: TEXT (yyyy-MM-dd HH:mm:ss.ttt format) PRIMARY KEY
Column-2: Name: Value Type: FLOAT
Column-3: Name: Quality Type: TEXT ("GOOD", "BAD")
Queries for SQLite
SELECT * FROM data-point-name;
SELECT * FROM data-point-name WHERE Timestamp BETWEEN timesamp-1 AND timestamp-2;
INSERT INTO data-point-name (Timestamp, Value, Quality) VALUES ("2016-01-01 00:00:05.254", 123.25454, "GOOD"); (this is an example)
Currently I have SQLite db where I have a table per data-point with above schema, essentially I have 100s of tables. This way reads/writes are not disturbing queries running on different data-points.
How to translate this schema to be used in Cassandra?
In your case, you can store all your data points in a single table :
CREATE TABLE datapoints (
datatype varchar(30),
time timestamp,
value float,
quality varchar(4),
PRIMARY KEY (datatype, time)
);
With this structure, you can run queries like :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name';
SELECT *
FROM datapoints
WHERE datatype = 'data-point-name'
AND time >= '2016-01-01 00:00:00'
AND time <= '2016-01-02 00:00:00';
But with this structure, cassandra will partition data by datapoint name,
if you have many points, your partition will be huge and you can have query performence issues.
You can also refine the partitionning by decompose the time :
CREATE TABLE datapoints (
datatype varchar(30),
year int,
month int,
day int,
milisecondsinday int,
value float,
quality varchar(4),
PRIMARY KEY ((datatype, year, month, day), milisecondsinday)
) WITH CLUSTERING ORDER BY (milisecondsinday ASC);
In this case, this structure allow cassandra to store datas in more small partition than the first exemple and it's more powerfull if you query you data by day :
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1;
get all points for 'data-points-type'
for the 2016-01-01
between 00:00 AM and 01:00 AM
SELECT *
FROM datapoints
WHERE datatype = 'data-point-type'
AND year = 2016
AND month = 1
AND day = 1
AND milisecondsinday >= 0
AND milisecondsinday <= 3600000;
Of course, you can partition by day (like exemple) or others time scale (hours, minutes, seconds and miliseconds). If you can, small partition will be good for performence.
Hope this can help you.

Cassandra: Is there a limit to amount of data that a collection column can hold?

In the below table, what is the maximum size phone_numbers column can accommodate ?
Like normal columns, is it 2GB ?
Is it 64K*64K as mentioned here
CREATE TABLE d2.employee (
id int PRIMARY KEY,
doj timestamp,
name text,
phone_numbers map<text, text>
)
Collection types in Cassandra are represented as a set of distinct cells in the internal data model: you will have a cell for each key of your phone_numbers column. Therefore they are not normal columns, but a set of columns. You can verify this by executing the following command in cassandra-cli (1001 stands for a valid employee id):
use d2;
get employee[1001];
The good answer is your point 2.

Resources