Cassandra data loading sequence for columns disabling

Cassandra data loading sequence for columns disabling - cassandra

I am new to cassandra. I created the below cassandra table with primary keys.
Create table Query:
create table DB.EMP(
Name text,
age int,
id int,
loc text,
salary double,
PRIMARY KEY (id,salary)
);
I loaded above table with below data using command:
Command ::: copy emp from '/home/data' with delimiter=',';
Data ::: /home/data
"Sdd,25,123,Chennai,28000"
I am getting this error:
Using 1 child processes
Starting copy of pmm.emp with columns ['id', 'salary', 'age', 'loc', 'name'].
Failed to import 1 rows: ParseError - invalid literal for int() with base 10: 'Sdd' - given up without retries
Failed to process 1 rows; failed rows written to import_db_emp.err
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 1 files in 0.170 seconds (0 skipped).
Please suggest how can I load the data.
Is there any way I can disable , alphabetical order insertion option except primary keys.

Is there any way I can disable, alphabetical order insertion option except primary keys?
No. Cassandra stores the column names that way to ensure proper on-disk order.
An easy solution would be to specify your column order in your COPY command:
aploetz#cqlsh:stackoverflow> COPY emp (name,age,id,loc,salary)
FROM '/home/aploetz/data.txt' WITH DELIMITER=',';
Reading options from the command line: {'delimiter': ','}
Using 1 child processes
Starting copy of stackoverflow.emp with columns [name, age, id, loc, salary].
Processed: 1 rows; Rate: 0 rows/s; Avg. rate: 1 rows/s
1 rows imported from 1 files in 1.919 seconds (0 skipped).

Related

Pyspark: Row count is not matching to the count of records appended

I am trying to identify and insert only the delta records to the target hive table from pyspark program. I am using left anti join on ID columns and it's able to identify the new records successfully. But I could notice that the total number of delta records is not the same as the difference between table record count before load and afterload.
delta_df = src_df.join(tgt_df, src_df.JOIN_HASH == tgt_df.JOIN_HASH,how='leftanti')\
.select(src_df.columns).drop("JOIN_HASH")
delta_df.count() #giving out correct delta count
delta_df.write.mode("append").format("hive").option("compression","snappy").saveAsTable(hivetable)
But if I could see delta_df.count() is not the same as count( * ) from hivetable after writting data - count(*) from hivetable before writting data. The difference is always coming higher compared to the delta count.
I have a unique timestamp column for each load in the source, and to my surprise, the count of records in the target for the current load(grouping by unique timestamp) is less than the delta count.
I am not able to identify the issue here, do I have to write the df.write in some other way?

It was a problem with the line delimiter. When the table is created with spark.write, in SERDEPROPERTIES there is no line.delim specified and column values with * were getting split into multiple rows.
Now I added the below SERDEPROPERTIES and it stores the data correctly.
'line.delim'='\n'

How can we filter rows based on timestamp column?

I have a cassandra column which is of type date and has values in timestamp format like below. How can we filter rows based on this column which have date greater than today's date?
Example:
Type: date
Timestamp: 2021-06-29 11:53:52 +00:00
TTL: null
Value: 2021-03-16T00:00:00.000+0000
I was able to filter rows using columname <= '2021-09-25' which gives ten rows some of them having dates on sep 23 and 24. When i filter using columname < '2021-09-24', i get an error like below
An error occurred on line 1 (use Ctrl-L to toggle line numbers):
Cassandra failure during read query at consistency ONE (1 responses were required but only 0 replica responded, 1 failed)

The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Depending on where you're running the query, the filter could be translated in the local timezone. Let me illustrate with this example table:
CREATE TABLE community.tstamptbl (
id int,
tstamp timestamp,
PRIMARY KEY (id, tstamp)
)
These 2 statements may appear similar but translate to 2 different entries:
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09');
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09 +0000');
The first statement creates an entry with a timestamp in my local timezone (Melbourne, Australia) while the second statement creates an entry with a timestamp in UTC (+0000):
cqlsh:community> SELECT * FROM tstamptbl WHERE id = 5;
id | tstamp
----+---------------------------------
5 | 2021-08-08 14:00:00.000000+0000
5 | 2021-08-09 00:00:00.000000+0000
Similarly, you need to be precise when reading the data. You need to specify the timezone to remove ambiguity. Here are some examples:
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-09 +0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:00+0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:34:56+0000';
In the second part of your question, the error isn't directly related to your filter. The problem is that the replica(s) failed to respond for whatever reason (e.g. unresponsive/overloaded, down, etc). You need to investigate that issue separately. Cheers!

Cassandra UUID partition key and partition size

Given a table
CREATE TABLE sensors_by_id (
id uuid,
time timeuuid,
some_text text,
PRIMARY KEY (id, time)
)
Will this scale when there are a lot of entries? I´m not sure, if a UUID field is sufficient as a good partition key or is there a need to create some artificial key like week_first_day or something similar?

It's really depends on how will you insert your data - if you generate the UUID really randomly for every insert, then the chance of duplicates is very low, and you'll get so-called "skinny rows" (a lot of partitions with 1 row inside). Even if you start to get the duplicates, there will be not so many for every row...

It could be a problem with partition size cause cassandra has limit for disk size per one partition.
Good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB.
It is easy to calculate partition size by using that formula
You can read more about data modeling here.
So in your case with current schema for 1 000 000 rows count per one partition with average size 100 byte for some_text column will be:
Number of Values: (1000000 * (3 - 2 - 0) + 0) = 1000000
Partition Size on Disk: (16 + 0 + (1000000 * 116) + (8 * 1000000))
= 124000016 bytes (118.26 Mb)
So as you can see you out of limit with 118.26 Mb per one partition. So you need optimize your partition keys.
I calculated it using my open source project - cql-calculator.

Cassandra-stress does not generate random values for every row

With DDL and profile yaml below, I generate random data for my table using cassandra-stress. The results I get for the columns amount and status don't match expectation. The random values seem to be drawn once per partition, not for each row.
If, for example, cassandra-stress generates 5 rows with the same business_date (i.e. one partition) the amount and status values are repeated 5 times, the "next" random value comes when the business_date changes. How can I make this so I get a new draw of amount and status for every row?
Sample output, notice last two columns change value only once first column changes.
2018-09-26,y~8.>6MZ,00000000-0004-0a3c-0000-000000040a3c,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-004c-4e7e-0000-0000004c4e7e,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-003d-b97f-0000-0000003db97f,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-004f-db3f-0000-0000004fdb3f,5.133114565746717E10,3PR|I{3B
2018-09-26,y~8.>6MZ,00000000-008c-f0ea-0000-0000008cf0ea,5.133114565746717E10,3PR|I{3B
2018-10-14,Y ?R| |u,00000000-002b-5707-0000-0000002b5707,6.698617679577381E10,,fkb[cU~N!
.
.
.
Table structure:
CREATE TABLE IF NOT EXISTS record (
business_date date,
region text,
id uuid,
status text,
amount double,
PRIMARY KEY (business_date, region, id)
);
Profile YAML:
keyspace: dev
table: record
columnspec:
- name: business_date
population: uniform(17800..17845)
- name: region
size: fixed(10)
population: seq(10..16)
cluster: fixed(7)
- name: id
size: fixed(32)
population: seq(1..10M)
cluster: fixed(5)
- name: status
size: fixed(10)
population: uniform(1000..1010)
- name: amount
population: uniform(500000..10M)
insert:
partitions: fixed(1)
select: fixed(1)/35
queries:
selectall:
cql: select * from record where business_date = ? and region = ?
fields : samerow

Presto Cassandra Connector Clustering Index

CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]

The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra data loading sequence for columns disabling - cassandra

Related

Pyspark: Row count is not matching to the count of records appended

How can we filter rows based on timestamp column?

Cassandra UUID partition key and partition size

Cassandra-stress does not generate random values for every row

Presto Cassandra Connector Clustering Index

Categories

Resources