How to retrieve a very big cassandra table and delete some unuse data from it? - cassandra

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?

Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Related

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?
I have a table definition like this
CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)
My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING
SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;
Then I decided to do it in two queries
1st query:
SELECT DISTINCT name from domain;
2nd query:
Use IN operator to query domains which are not scanned i nlast 24 hours
SELECT * FROM domain where
domain_name IN('domain1','domain2')
AND
last_scanned_date<=<last24hourstimeinmillis>
My second approach works, but comes with an extra overhead of querying first for distinct values.
Is there any better approach than this?
You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.
I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.
While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a
the range you have specified.
Go over following concepts before finalizing your design.
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html
https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.
You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.
Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.
Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Cassandra Altering the table

I have a table in Cassandra say employee(id, email, role, name, password) with only id as my primary key.
I want to ...
1. Add another column (manager_id) in with a default value in it
I know that I can add a column in the table but there is no way i can provide a default value to that column through CQL. I can also not update the value for manager_id later since I need to know the id (Partition key and the values are randomly generated unique values which i don't know) to update the row. Is there any way I can achieve this?
2. Rename this table to all_employee.
I also know that its not allowed to rename a table in cassandra. So I am trying to copy the data of table(employee) to csv and copy from csv to new table (all_employee) and deleting the old table(employee). I am doing this through an automated script with cql queries in it and script works fine but will fail if it gets executed again(Which i can not restrict) since the table employee will not be there once its deleted. Essentially I am looking for "If exists" clause in COPY query which is not supported in cql. Is there any other way I can achieve the outcome?
Please note that the amount of data in the table is very small so performance in not an issue.
For #1
I dont think cassandra support default column . You need to do that from your appliaction. Write some default value every time you insert a row.
For #2
you can check if the table exists before trying to copy from it.
SELECT your_table_name FROM system_schema.tables WHERE keyspace_name='your_keyspace_name';

Cassandra pagination and token function; selecting a partition key

I've been doing a lot of reading lately on Cassandra data modelling and best practices.
What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.
My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.
From this post: https://stackoverflow.com/a/24953331/1224608
I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.
However, if I'm using the token function to page through results, eg:
SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;
That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.
The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:
SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;
But I'm not certain if it's good practice to have a unique partition key for a complex table.
Any help is greatly appreciated!
But I'm not certain if it's good practice to have a unique partition key for a complex table.
It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.
Please follow below link about pagination of Cassandra.
You do not need to use tokens if you are using Cassandra 2.0+.
Cassandra 2.0 has auto paging. Instead of using token function to
create paging, it is now a built-in feature.
Results pagination in Cassandra (CQL)
https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
Saving and reusing the paging state
You can use pagingState object that represents where you are in the result set when the last page was fetched.
EDITED:
Please check the below link:
Paging Resultsets in Cassandra with compound primary keys - Missing out on rows
I recently did a POC for a similar problem. Maybe adding this here quickly.
First there is a table with two fields. Just for illustration we use only few fields.
1.Say we insert a million rows with this
Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.
For this we update the table with a column called page_no.
Create a secondary index for this column.
Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
Since we can query on a secondary index each page can be queried independently.
Code is self explanatory and here - https://github.com/alexcpn/testgo
Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.
"In practice, this means indexing is most useful for returning tens,
maybe hundreds of results. Bear this in mind when you next consider
using a secondary index." From http://www.wentnet.com/blog/?p=77

Astyanax retrieve by UUID

How do we retrieve a row in Cassandra using Astyanax?
I have a web application which requires pagination to be done on the server side, the db is cassandra. The row key is a UUID and I have few columns within a row, so I am trying to do pagination on the row keys.
I have put together a solution with which i am not completely happy. The issue is that when i do my first search based on the search filter i get from UI, I don't know the UUID of the first row, So I prepare a Query which gives me first 6 records, I store the key of the 6th record in a MAP and put it in a session, so when user from UI request for a second page, i retrieve this key(UUID) and take it as a start for the next set of records to be retrieved. I was trying to find a cleaner approach.
EDIT in response to question changes...
In that case, sounds like you are doing it just fine. Also sounds like you are using OOP then as otherwise the rows are not in order. While playOrm's solution is more elegant returning you a cursor that you store in the session, I think what you have is just fine.
EDIT since our code changed
line 74 in this link is how we do it(you can drill down into that cursor class basically to see what we do).
https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
which if you have 10 row keys you just pass in the list of keys.
I am not sure what you mean by pagination???? Are you saying you have a from and to row key and want the rows between them? Are you using order the cluster by row key then?
Another completely different direction to go for ordering is playOrm though which can do S-SQL and if things are partitioned, you can do joins and other such(that link above is to one of the files in playOrm actually).
I am really not sure what you mean "you don't have the row key with you"
later,
Dean

Azure Table Storage: Order by

I am building a web site that has a wish list. I want to store the wish list(s) in azure table storage, but also want the user to be able to sort their wish list, when viewing it, a number of different ways - date added, date added reversed, item name etc. I also want to implement paging which I believe I can implement by making use of the continuation token.
As I understand it, "order by" isn't implemented and the order that results are returned from table storage is based on the partition key and row key. Therefore if I want to implement the paging and sorting that I describe, is the best way to implement this by storing the wish list multiple times with different partition key / row key?
In this simple case, it is likely that the wish list won't be that large and I could in fact restrict the maximum number of items that can appear in the list, then get rid of paging and sort in memory. However, I have more complex cases that I also need to implement paging and sorting for.
On today’ s hardware having 1000’s of rows to hold, in a list, in memory and sort is easily supportable. What the real issue is, how possible is it for you to access the rows in table storage using the Keys and not having to do a table scan. Duplicating rows across multiple tables could get quite cumbersome to maintain.
An alternate solution, would be to temporarily stage your rows into SQL Azure and apply an order by there. This may be effective if your result set is too large to work in memory. For best results the temporary table would need to have the necessary indexes.
Azure Storage keeps entities in lexicographical order, indexed by Partition Key as primary index and Row Key as secondary index. In general for your scenario it sounds like UserId would be a good fit for a partition key, so you have the Row Key to optimize for per each query.
If you want the user to see the wish lists latest on top, then you can use the log tail pattern where your row key will be the inverted Date Time Ticks of the DateTime when the wish list was entered by the user.
https://learn.microsoft.com/azure/storage/tables/table-storage-design-patterns#log-tail-pattern
If you want user to see their wish lists ordered by the item name you could have your item name as your row key, and so the entities will naturally sorted by azure.
When you are writing the data you may want to denormalize the data and do multiple writes with these different row key schemas. Since you will have the same partition key as user id, you can at that stage do a batch insert operation and not worry about consistency since azure table batch operations are atomic.
To differentiate the different rowkey schemas, you may want to prepend each with a const string value. Like your inverted ticks row key value for instance woul dbe something like "InvertedTicks_[InvertedDateTimeTicksOfTheWishList]" and your item names row key value would be "ItemName_[ItemNameOfTheWishList]"
Why not do all of this in .net using a List.
For this type of application I would have thought SQL Azure would have been more appropriate.
Something like this worked just fine for me:
List<TableEntityType> rawData =
(from c in ctx.CreateQuery<TableEntityType>("insysdata")
where ((c.PartitionKey == "PartitionKey") && (c.Field == fieldvalue))
select c).AsTableServiceQuery().ToList();
List<TableEntityType> sortedData = rawData.OrderBy(c => c.DateTime).ToList();

Resources