There is a Cassandra table as follows.
Student(eName text, eId int PRIMARY KEY, m1 int, m2 int, average float)
How can the average value be autocalculated on inserting values for other fields of the row?
That is, we need to insert only eName, eId, m1 and m2; average has to be autocalculated and entered in the tuple.
Thanks.
Cassandra has built-in CQL aggregate functions such as AVG() which computes the average of all values returned by a query, that is, the aggregation takes place at read time (as opposed to write time).
You can also write your own user-defined aggregates (UDAs).
It is possible to implement your own CQL TRIGGER which executes Java code when data is written to a table but it has been considered as experimental for a long time and I don't recommend using it.
The general recommendation is to perform the aggregation within your application prior to writing the data to the table. Cheers!
Related
We have a frozen UDT with ~2000 fields as one of the columns in a table.
We use this table to implement append-only writes so that the data is auditable and not overwritten.
We are seeing degradation in write performance when only 1 (out of 2000) field in the UDT is populated.
Trying to understand the performance implication of using sparsely populated frozen UDTs. How are UDTs serialized/deserialized internally? Any documentation of this will be highly appreciated.
We tried to gather some metrics from cass session, but couldn't get much information.
edit: Using the C++ cassandra driver withPrepared Statements for writes
Cassandra version: 3.11.6
Data Model:
CREATE TYPE udt_xyx {
field1 bigint,
field2 ..
..
..
field2000
}
CREATE TABLE table_xyz(
key_1 text,
txn_id int,
fields frozen<udt_xyx>,
PRIMARY KEY ((key_1), txn_id)
)
Workflow:
Request comes in from the caller to write n fields(out of 2000) for a given key_1.
We assign a unique txn_id (transaction_id) to the request.
Then we create a UDT object which has 2000 fields but only populate n of those fields and persist it in the table.
The new request that comes in for the same key_1 with different (or same) fields will be assigned a new txn_id and written to the table as a new record.
That way we are not updating any currently written UDT, but always creating a new record in the table (associated with new txn_id).
When the UDT is sparsely populated, we are experiencing write performance degradation.
EDIT:
After doing some analysis we narrowed down the slowness to this:
https://github.com/datastax/cpp-driver/blob/master/src/data_type.hpp#L352-L380
Basically every time we bind a udt the "check" method runs and compares the string names for every field in the UDT.
Since we have ~2000 fields and we do over 100,000 binds we're doing about 100 Million string comparisons
What performance are you measuring here? Comparing performance to inserting data using non-UDT columns into a table versus inserting data using both non-UDT columns and UDT-type columns?
a column whose type is a frozen collection (set, map, or list) or UDT can only have its value replaced as a whole. In other words, we can't add, update, or delete individual elements from the collection as we can in non-frozen collection types. So, the frozen keyword can be useful, for example, when we want to protect collections against single-value updates.
For example, in case of the below snippet,
CREATE TYPE IF NOT EXISTS race (
race_title text,
race_date date
);
CREATE TABLE IF NOT EXISTS race_data (
id INT PRIMARY KEY,
races frozen<list<race>>
...
);
the UDT nested in the list is frozen, so the entire list will be read when querying the table.
Since you did not provide "how" you're updating the frozen collection, it is hard to triage why there is a performannce concern here.
References for exploration:
Freezing collections
Essentially, you will not be able to do an append-only operation with a frozen type as you will always have to perform read-before-write operation for any upserts.
i am using cassandra as my dumping ground on which i have multiple jobs running to process the data and update different system. below are the job related filters
Job 1. data filter based on active_flag and update_date_time and expiry_time and process the filtered data.
Job 2. data filter based on update_date_time process the data
Job 3. data filter based on created_date_time and active flag
db columns on which where condition would run are (one or many columns in one query)
Active -> yes/no
created_date -> timestamp
expiry_time -> timestamp
updated_date -> timestamp
My question on these conditions :-
how should i form my cassandra primary key? as i dont see any way to acheive uniqueness on this (id is present but thats not required for me to process data).
do i even need the primary key if i use the filtering on spark code using table scan?
considering this for millions of records processing.
Answering to your question - you need to have a primary key, even if it consists only of the partition key :-)
More detailed answer really depends on how often these jobs are running, how much data overall, how many nodes in the cluster, what hardware is used, etc. Usually, we're trying to push as much filtering to Cassandra as possible, so it will return only relevant data, not everything. The most effective this filtering happens on the first clustering column, for example, if I want to process only newly created entries, then I can use the table with following structure:
create table test.test (
pk int,
tm timestamp,
c2 int,
v1 int,
v2 int,
primary key(pk, tm, c2));
and then I can fetch only newly created entries by using:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("test", "test").load()
val filtered = data.filter("tm >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
Or I can fetch entries in the given time period:
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
The effect of the filter pushdown could be checked by executing explain on the dataframe, and checking the PushedFilters section - conditions that are marked with * will be executed on Cassandra side...
But it's not always possible to design tables to match all queries, so you'll need to design primary key for jobs that are executed most often. In your case, update_date_time could be a good candidate for that, but if you put it as clustering column, then you'll need to take care when updating it - you'll need to perform change as batch, something like this:
begin batch
delete from table where pk = ... and update_date_time = old_timestamp;
insert into table (pk, update_date_time, ...) values (..., new_timestamp, ...);
apply batch;
or something like this.
i need to select 'N'th row from cassandra table based on the particular number i'm getting from my logic. i.e: if logic output is 23 means, i need to get 23rd row details. since there is no auto increment in cassandra,i cant able to go with ID key match. In SQL they getting it using OFFSET and LIMIT. i dont know how to achieve this feet in Cassandra.
Can we achieve this by using any UDF concept??? Someone reply me the solution.Thanks in advance.
Table Schema :
CREATE TABLE new_card (
customer_id bigint,
card_number text,
active tinyint,
auto_pay int,
available_credit_limit double,
average_card_spend_half_yearly double,
average_card_spend_monthly double,
average_card_spend_quarterly double,
average_card_spend_yearly double,
avg_half_yearly_spend_mcc double,
PRIMARY KEY (customer_id, card_number)
);
If you are using Java driver, refer Paging
Note, Cassandra does not support direct offsets, pages are read sequentially. If you have to use offsets to be used in your queries, you might want to revisit your data model. You could have created a composite partition key including the row number as an additional column on top of you existing partition key column.
You simply can't select N row from table, because Cassandra table is made from partitions, and you can order your rows within partition, but not the partitions. Going with paging will go throw all tables, but there's will be no chronological order of the rows selected using suck approach (disregarding the fact that the partitions can change while you doing your go-throw-pages stuff).
If you want to select row number N from Cassandra, you need to implement auto increment field on the application level and use it as key.
There's ways to do it with Cassandra, using lightweight transactions for example, but it have high cost from performance perceptive. See several solutions here:
How to create auto increment IDs in Cassandra
I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.
Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
);
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
);
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.