Performance of query with only partition key - cassandra

Is the performance impacted if I provide only the partition key while querying a table containing both partition key and clustering key?
For example, for a table with partition key p1 and clustering key c1, would
SELECT * FROM table1 where p1 = 'abc';
be less efficient than
SELECT * FROM table1 where p1 = 'abc' and c1 >= 'some range start value' and c1 <= 'some range end value';
My goal is to fetch all rows with p1 = 'abc'.

Main cost in going to particular row vs a particular partition is that theres an extra work and necessity of deserializing the clustering key index at the beginning of the partition. Its a bit old and based on thrift but the gist of it remains true in the following:
http://thelastpickle.com/blog/2011/07/04/Cassandra-Query-Plans.html
(note: row level bloom filter was removed)
When reading from a beginning of a partition you can save a little work which will improve the latency.
I wouldn't worry too much about it as long as your queries are not spanning multiple partitions. Then you will generally only have issues if the partitions get to be hundreds of mb or gb's in size.

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

cassnadra multi/single partition batch explanation

I red the cassandra docs about Good use of BATCH statement -
single partition batch example
I want to understand about multi/single partition batch.
According to the docs this is a single partition batch.
CREATE TABLE cycling.cyclist_expenses (
cyclist_name text,
balance float STATIC,
expense_id int,
amount float,
description text,
paid boolean,
PRIMARY KEY (cyclist_name, expense_id)
);
BEGIN BATCH
INSERT INTO cycling.cyclist_expenses (cyclist_name, expense_id, amount, description, paid) VALUES ('Vera ADRIAN', 2, 13.44, 'Lunch', true);
INSERT INTO cycling.cyclist_expenses (cyclist_name, expense_id, amount, description, paid) VALUES ('Vera ADRIAN', 3, 25.00, 'Dinner', false);
...
APPLY BATCH;
First partition is - 'Vera ADRIAN', 2
Second partition - 'Vera ADRIAN', 3
Could u explain pls why is it single partition batch?
In another docs I found the example of multi partition batch:
Create table shopping_chart
(cart_id UUID,item_id UUID,price Decimal, total Decimal static,
primary key ((cart_id),item_id));
insert into shopping_chart(cart_id,item_id,price,total)
values (ABC12345,ABCITEM12345,0.01,0.01);
Begin Batch
insert into shopping_chart(cart_id,item_id,price) values ( ABC12345,ABCITEM123451,1.00);
insert into shopping_chart(cart_id,item_id,price) values ( ABC12345,ABCITEM1234512,2.00);
Update …. cart_id=ABC12345 IF total =0.01;
Apply Batch;
And I can’t understand why it's a multi partition batch? Could u pls explain ? There is working only with one partition = ABC12345
First partition is - 'Vera ADRIAN', 2 Second partition - 'Vera ADRIAN', 3
Could u explain pls why is it single partition batch?
Sure. Because the expense_id is not part of the partition key. Therefore, Vera ADRIAN is the same partition key value used in both INSERTs.
For the 2nd part of your question, you're right in that the 2nd example does not appear to be a multi-partition query as the cart_ids are the same. Following your link above, I quickly found a bad use of BATCH (multi-partition): https://docs.datastax.com/en/dse/6.8/cql/cql/cql_using/useBatchBadExample.html
The single-partition batch is when your queries are targeting the same partitions - in this case, Cassandra packs all queries into a single operation (also called "mutation").
The description of second example is incorrect - it's still single-partition batch.

Cassandra OperationTimedOut

select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.

Ranges (intervals) request in Cassandra DB - CQL

Excuse, if it is a duplicate, I've found a few questions about times ranges here, but my case seems a little bit different and not yet discussed.
I would like to store quite big chunks (bins) of data (blobs - 2-4Mb, this is the “black-box data”, I can't change its layout) to access with interval keys:
...
primary key ( bin_id int, from_item_id int, to_item_id int )
...
with ability to select with items ranges, like in this pseudo-code to select all chunks that contains interval of items [110, 200]:
select chunk from tb1 where chunk_id = 100500 and from_item_id >= 110 and to_item_id <= 200;
Attempt to run such a query directly ended with error:
code=2200 [Invalid query] message="PRIMARY KEY column "to_item_id" cannot be restricted (preceding column "from_item_id" is restricted by a non-EQ relation)"
Currently only solution I've found is to implement additional table (tb_map) with reverse mapping from item_id to bin_id and use select to make a query looks something like this:
...
– in tb_map
primary key (dummy_id, item_id)
...
select bin_id from tb_map where dummy_id = SOME_MAGIK and item_id >= 110 and item_id <= 200;
And then use bin_id to retrieve chunks from tb1 with EQ or IN operator like here:
select * from tb1 where bin_id in (...);
But I can't use this model due insert performance issues (application should avoid many inserts to additional table and should avoid maintaining additional data structures, but should be "as simple as nail").
Is it any simple solution to stay within one table (or several simple tables)? I'm stuck with no ideas how to model such behaviour in C* (may be slices should be used?), could local C* gurus provide any hints?
I'm using CQL 3.1
From CQL3 reference:
Moreover, for a given partition key, the clustering columns induce an ordering of rows and relations on them is restricted to the relations that allow to select a contiguous (for the ordering) set of rows.
In your case the query doesn't select a contiguous set of rows, so Cassandra refuses to process it.

fetching timeseries/range data in cassandra

I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.
Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.

Resources