I red the cassandra docs about Good use of BATCH statement -
single partition batch example
I want to understand about multi/single partition batch.
According to the docs this is a single partition batch.
CREATE TABLE cycling.cyclist_expenses (
cyclist_name text,
balance float STATIC,
expense_id int,
amount float,
description text,
paid boolean,
PRIMARY KEY (cyclist_name, expense_id)
);
BEGIN BATCH
INSERT INTO cycling.cyclist_expenses (cyclist_name, expense_id, amount, description, paid) VALUES ('Vera ADRIAN', 2, 13.44, 'Lunch', true);
INSERT INTO cycling.cyclist_expenses (cyclist_name, expense_id, amount, description, paid) VALUES ('Vera ADRIAN', 3, 25.00, 'Dinner', false);
...
APPLY BATCH;
First partition is - 'Vera ADRIAN', 2
Second partition - 'Vera ADRIAN', 3
Could u explain pls why is it single partition batch?
In another docs I found the example of multi partition batch:
Create table shopping_chart
(cart_id UUID,item_id UUID,price Decimal, total Decimal static,
primary key ((cart_id),item_id));
insert into shopping_chart(cart_id,item_id,price,total)
values (ABC12345,ABCITEM12345,0.01,0.01);
Begin Batch
insert into shopping_chart(cart_id,item_id,price) values ( ABC12345,ABCITEM123451,1.00);
insert into shopping_chart(cart_id,item_id,price) values ( ABC12345,ABCITEM1234512,2.00);
Update …. cart_id=ABC12345 IF total =0.01;
Apply Batch;
And I can’t understand why it's a multi partition batch? Could u pls explain ? There is working only with one partition = ABC12345
First partition is - 'Vera ADRIAN', 2 Second partition - 'Vera ADRIAN', 3
Could u explain pls why is it single partition batch?
Sure. Because the expense_id is not part of the partition key. Therefore, Vera ADRIAN is the same partition key value used in both INSERTs.
For the 2nd part of your question, you're right in that the 2nd example does not appear to be a multi-partition query as the cart_ids are the same. Following your link above, I quickly found a bad use of BATCH (multi-partition): https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useBatchBadExample.html
The single-partition batch is when your queries are targeting the same partitions - in this case, Cassandra packs all queries into a single operation (also called "mutation").
The description of second example is incorrect - it's still single-partition batch.
Related
I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.
Is the performance impacted if I provide only the partition key while querying a table containing both partition key and clustering key?
For example, for a table with partition key p1 and clustering key c1, would
SELECT * FROM table1 where p1 = 'abc';
be less efficient than
SELECT * FROM table1 where p1 = 'abc' and c1 >= 'some range start value' and c1 <= 'some range end value';
My goal is to fetch all rows with p1 = 'abc'.
Main cost in going to particular row vs a particular partition is that theres an extra work and necessity of deserializing the clustering key index at the beginning of the partition. Its a bit old and based on thrift but the gist of it remains true in the following:
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
(note: row level bloom filter was removed)
When reading from a beginning of a partition you can save a little work which will improve the latency.
I wouldn't worry too much about it as long as your queries are not spanning multiple partitions. Then you will generally only have issues if the partitions get to be hundreds of mb or gb's in size.
Given a table
CREATE TABLE sensors_by_id (
id uuid,
time timeuuid,
some_text text,
PRIMARY KEY (id, time)
)
Will this scale when there are a lot of entries? I´m not sure, if a UUID field is sufficient as a good partition key or is there a need to create some artificial key like week_first_day or something similar?
It's really depends on how will you insert your data - if you generate the UUID really randomly for every insert, then the chance of duplicates is very low, and you'll get so-called "skinny rows" (a lot of partitions with 1 row inside). Even if you start to get the duplicates, there will be not so many for every row...
It could be a problem with partition size cause cassandra has limit for disk size per one partition.
Good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB.
It is easy to calculate partition size by using that formula
You can read more about data modeling here.
So in your case with current schema for 1 000 000 rows count per one partition with average size 100 byte for some_text column will be:
Number of Values: (1000000 * (3 - 2 - 0) + 0) = 1000000
Partition Size on Disk: (16 + 0 + (1000000 * 116) + (8 * 1000000))
= 124000016 bytes (118.26 Mb)
So as you can see you out of limit with 118.26 Mb per one partition. So you need optimize your partition keys.
I calculated it using my open source project - cql-calculator.
select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.
After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.
The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.
There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.