Range based Search on date in astyanax - cassandra

I have however, one more situation. In my Column Family I have rows with column for example name, salary, and dob(date of birth), All the columns are indexed. I want to do a range base Index search on dob. Will appreciate if you can let me know how can we do it.

You could move from astyanax to playOrm and just do
#NoSqlQuery(name="findByDate", query="PARTITIONS p(:partitionId) SELECT p FROM TABLE p where p.date > :date and p.date <= :data");
you need a schema though that can be partitioned if you want to scale and you just query that single partition. Some partition by customer, and some by time. There are many ways to divide up the schema.
OR if you really want to use astyanax, playOrm does a batched ranged query where it gets all the columns in batches each time(so you don't blow out memory). The code is on line 326 for setting up astyanax range builder and line 385 for using the builder to create your query.
https://github.com/deanhiller/playorm/blob/8a4f3405631ad78e6822795633da8c59cb25bb29/input/javasrc/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
Realize playOrm is doing batching as well so you see it setting the batch sizes too.
later,
Dean

Related

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

How to Select Nth Row in Cassandra DB Table

i need to select 'N'th row from cassandra table based on the particular number i'm getting from my logic. i.e: if logic output is 23 means, i need to get 23rd row details. since there is no auto increment in cassandra,i cant able to go with ID key match. In SQL they getting it using OFFSET and LIMIT. i dont know how to achieve this feet in Cassandra.
Can we achieve this by using any UDF concept??? Someone reply me the solution.Thanks in advance.
Table Schema :
CREATE TABLE new_card (
customer_id bigint,
card_number text,
active tinyint,
auto_pay int,
available_credit_limit double,
average_card_spend_half_yearly double,
average_card_spend_monthly double,
average_card_spend_quarterly double,
average_card_spend_yearly double,
avg_half_yearly_spend_mcc double,
PRIMARY KEY (customer_id, card_number)
);
If you are using Java driver, refer Paging
Note, Cassandra does not support direct offsets, pages are read sequentially. If you have to use offsets to be used in your queries, you might want to revisit your data model. You could have created a composite partition key including the row number as an additional column on top of you existing partition key column.
You simply can't select N row from table, because Cassandra table is made from partitions, and you can order your rows within partition, but not the partitions. Going with paging will go throw all tables, but there's will be no chronological order of the rows selected using suck approach (disregarding the fact that the partitions can change while you doing your go-throw-pages stuff).
If you want to select row number N from Cassandra, you need to implement auto increment field on the application level and use it as key.
There's ways to do it with Cassandra, using lightweight transactions for example, but it have high cost from performance perceptive. See several solutions here:
How to create auto increment IDs in Cassandra

How to solve 'Secondary indexes cardinality' for cfs.inode?

In OpsCenter 6.0.3, I got the following problem
The above figure appeared after clicking 'Services' -> 'Best Practice Service' -> 'Performance Service - Table Metrics Advisor' -> 'Secondary indexes cardinality' in turn.
The inode table viewed in DevCenter looks as follows:
As far as I know, [inode]link tracks each files metadata and block locations. But, what can I do to fix this problem ?
OpsCenter Version: 6.0.3 Cassandra Version: 2.1.15.1423 DataStax Enterprise Version: 4.8.10
Don't use Secondary index for high cardinality column.
High-cardinality refers to columns with values that are very uncommon or unique. High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high-cardinality would be a USERS table with a column named USER_ID.
Problems using a high-cardinality column index datastax doc :
If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.
Solution :
Create another table with that column in the partition key

secondary indexes for low cardinality columns cassandra

we have a table with 15 million records, and ours is a 10 node cassandra cluster. We have a column which has close to 20 repeatable values. Is it advisable to build secondary index on this column?
Assuming completely uniform distribution on that column, then each column value would map to 750,000 rows. Now while the DataStax doc on When To Use An Index states that...
built-in indexes are best on a table having many rows that contain the indexed value.
750,000 rows certainly qualifies as "many." But even given that, remember that you're also talking about 14,250,000 rows that Cassandra has to ignore when fulfilling your query.
Also, unless you have a RF of 10 (and I doubt that you would with 10 nodes), you are going to incur network time as Cassandra works between all of the different nodes required to fulfill your query. For 750,000 rows, that's probably going to timeout.
The only way I think this could be efficient, would be to first restrict your query by a partition key. Using the secondary index while also restricting with a partition key will help Cassandra find your rows more quickly. Even so, with a dataset that big, I would re-evaluate your data model and try to figure out a different table to fulfill that query without requiring a secondary index.

Cassandra super column structure

I'm new to Cassandra, and I'm not familiar with super columns.
Consider this scenario: Suppose we have a some fields of a customer entity like
Name
Contact_no
address
and we can store all these values in a normal column. I want to arrange that when a person moves from one location to another location (the representative field could store the longitude and latitude) that values will be stored consecutively with respect to customer location. I think we can do this with super columns but I'm confused how to design the schema to accomplish this.
Please help me to create this schema and come to understand the concepts behind super columns.
supercolumns are really not recommended anymore...still used but more and more have switched to composite columns. For example playOrm uses this concept for indexing. If I am indexing an integer, and indexing row may look like this
rowkey = 10.pk56 10.pk39 11.pk50
Where the column name type is a composite integer and string in this case. These rows can be up to about 10 million columns though I have only run expirements up to 1 million my self. For example, playOrm's queries use these types of indexes to do a query that took 60 ms on 1,000,000 rows.
With playOrm, you can do scalable relational models in noSQL....you just need to figure out how to partition your data correctly as you can have as many partitions as you want in each table, but a partition should really not be over 10 million rows.
Back to the example though, if you have a table with columns numShares, price, username, age, you may wnat to index numShares and the above row would be that index so you could grab the index by key OR better yet, grab all column names with numShares > 20 and numShares < 50
Once you have those columns, you can then get the second half of the column name which is the primary key. The reason primary key is NOT a value is because as in the example above there is two rows pk56 and pk39 with the same 10 and you can't have two columns named 10, but you can have a 10.pk56 and 10.pk39.
later,
Dean

Resources