Cassandra batch select and batch update - cassandra

I have a requirement to update all users with a specific value in a job.
i have million of users in my Cassandra database. is it okay to query million user first and do some kind of batch update? or is there some implementation available to do these kind of work. I am using hector API to interact with Cassandra. What can be the best possible way to do this.?

You never want to fetch 1 million users and keep them locally. Ideally you want to iterate over all those user keys using a range query. Hector calls this RangeSliceQuery. There is a good example here:
http://irfannagoo.wordpress.com/2013/02/27/hector-slice-query-options-with-cassandra/
For start and end key use null and add this also:
rangeQuery.setRowCount(100) to fetch 100 rows at a time.
Do this inside a loop. The first time you fetch with null being start and end key, the last key you get from the first result set should be the start key of your next query. And you continue paginating like this.
You can then use batch mutate and update in batches.
http://hector-client.github.io/hector/source/content/API/core/1.0-1/me/prettyprint/cassandra/service/BatchMutation.html

Related

What is the "correct" way to achieve this postgres + redis data retrieval?

I have latest versions of postgres and redis being used from a nodejs server app.
The app makes thousands of PG queries to retrieve rows from a table of items, each item has a foreign key pointing to the row id of the user in the users table.
When returning a set of data, I need to get the actual username of the person that owns the item row, for every row. Each set of rows is 50 to 250 ish in count and it's possible that many of those are owned by the same person (and therefore the same username).
I am wondering which is the "correct" way to handle this.
Option one: I do a join on my PG query, or make another PG query to fetch each username.
Option two: I store the users and their ids in redis and fetch those once PG has returned the set of items.
I am leaning towards the latter, but if that is what I end up doing, there are likely more than one piece of datum I would like to associate to every user and some of which that should expire after some time. So in the end I might be doing half a dozen redisclient.set() per user and the same amount when fetching that data, is that still more efficient than using a join in my PG query?
If something needs elaborating please let me know,
Thanks in advance !

Azure CosmosDB Document - Bulk Deletion

Recently, I have asked to delete few million records from a total of 14Tb of Cosmos Db data.
When I looked into the internet, I found a stored proc to do the bulk delete and that works based on partition key.
My scenario is, we have the 4 attributes in each document.
1. id
2. number [ Partition Key]
3. startdate
4. enddate
The requirement is to delete the documents based on startdate.
Delete * from c where c.startdate >= '' and c.startdate <=''
The above query goes through all the partition and deletes the records.
I also checked by running the query in Databricks to take the whole CosmosDB records in a temp Dataframe and add TTL attibute and then upsert to Cosmos DB again.
Is there a better way to achieve the same?
Generally speaking, bulk deletion has the methods listed in this article.
Since your data is very huge,maybe bulkDelete.js is not suitable any more. After all, SP has execution time limit.In addition to the solution described in your question, I also suggest that you could use SDK code to encapsulate a method by yourself:
Set the maxItemCount = 100 and EnableCrossPartitionQuery = true in your query request.Meanwhile, you could get continuation token which is for next page data. Process the data in the batch and maybe you could get some snippet of code from .net bulk Delete Library (GeneratePartitionKeyDocumentIdTuplesToBulkDelete and BulkDeleteAsyn)

Managing multiple database connections and data with foreachPartition

Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)

How to read only all row keys in cassandra efficiently...

Accessing all rows from all nodes in cassandra would be inefficient. Is there a way to have some access to index.db which already has row keys? is something of this sort supported in built in cassandra?
There is no way to get all keys with one request without reaching every node in the cluster. There is however paging built-in in most Cassandra drivers. For example in the Java driver: https://docs.datastax.com/en/developer/java-driver/3.3/manual/paging/
This will put less stress on each node as it only fetches a limit amount of data each request. Each subsequent request will continue from the last, meaning you will touch every result for the request you're making.
Edit: This is probably what you want: How can I get the primary keys of all records in Cassandra?
One possible option could be querying all the token ranges.
For example,
SELECT distinct <partn_col_name> FROM <table_name> where token(partn_col_name) >= <from_token_range> and token(partn_col_name) < <to_token_range>
With above query, you can get the all the partition keys available within given token range. Adjust token ranges depending on execution time.

knex.js multiple updates optmised

Right now the way I am doing my workflow is like this:
get a list of rows from a postgres database (let's say 10.000)
for each row I need to call an API endpoint and get a value, so 10.000 values returned from API
for each row that I have a value returned I need to update a field in the database. 10.000 rows updated
Right now I am doing a update after each API fetch but as you can imagine this isn't the most optimized way.
What other option do I have?
Probably bottleneck in that code is fetching the data from API. This trick only allows to send many small queries to DB faster without having to wait roundtrip time between each update.
To do multiple updates in single query you could use common table expressions and pack multiple small queries to single CTE query:
https://runkit.com/embed/uyx5f6vumxfy
knex
.with('firstUpdate', knex.raw('?', [knex('table').update({ colName: 'foo' }).where('id', 1)]))
.with('secondUpdate', knex.raw('?', [knex('table').update({ colName: 'bar' }).where('id', 2)]))
.select(1)
knex.raw trick there is a workaround, since .with(string, function) implementation has a bug.

Resources