Firestore Batch delete while reading document

Firestore Batch delete while reading document - node.js

I was wondering what the best way to delete my collection (which has subcollections) in firestore is. I need to delete the entire collection (using code such as this https://firebase.google.com/docs/firestore/solutions/delete-collections) every day at 20:00 UTC.
My concern is that users will be able to query/write documents to the collection/sub-collection that is being deleted. If they try to read/update/delete a document in the collection while the batch delete is running will this cause any problems?
I have thought of somehow writing some firestore rules that blocks reads if the query time is 20:00 - 20:05 UTC but it seems a bit hacky and I am not sure if it's even possible.
Could anyone provide me some assistance with how to handle potential reads at the same time as the batch delete.
Thanks a lot
Side note : In the delete collections code it mentions a token that is required functions.config().fb.token. Is this always the same If the code is running on cloud functions?

There are two main scenarios I can think of here:
Retry deleting the collection after the first pass, to get any documents created while your code was deleting.
Block the users from writing, with a global lock in security rules.
Even if you do the second, I'd still also do the first - as it's very easy to miss a write when there are enough users.

Related

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.

The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

About speedy mass deletion of users in Kentico10

I want to delete more than 1 million User information in Kentico10.
I tried to delete it with UserInfoProvider.DeleteUser (); (see the following documentation), but it is expected that it will take nearly one year with a simple calculation.
https://docs.kentico.com/api10/configuration/users#Users-Deletingauser
Because it's a simple calculation, I think it's actually a bit shorter, but it still takes time.
Is there any other way to delete users in a short time?

Of course make sure you have a backup of your database before you do any of this.
Depending on the features you're using, you could get away with a SQL statement. Due to the complexities of the references of a user to multiple other tables, the SQL statement can get pretty complex and you need to make sure you remove the other references before removing the actual user record.
I'd highly recommend an API approach and delete users through the API so it removes all the references for you automatically. In your API calls make sure you wrap the delete action in the following so it stops the logging of the events and other labor-intensive activities not needed.
using (var context = new CMSActionContext())
{
context.DisableAll();
// delete your user
}
In your code, I'd only select the top 100 or so at a time and delete them in batches. Assuming you don't need this done all in one run, you could let the scheduled task run your custom code for a week and see where you're at.
If all else fails, figure out how to delete the user and the 70+ foreign key references and you'll be golden.

Why don't you delete them with SQL query? - I believe it will be much faster.

Bulk delete functionality exist starting from version 10.
UserInfoProvider has BulkDelete method. Actually any InfoProvider object inhereted from AbstractInfoProvider has BulkDelete method.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?

Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Node.js - Scaling with Redis atomic updates

I have a Node.js app that preforms the following:
get data from Redis
preform calculation on data
write new result back to Redis
This process may take place several times per second. The issue I now face is that I wish to run multiple instances of this process, and I am obviously seeing out of date date being updated due to each node updating after another has got the last value.
How would I make the above process atomic?
I cannot add the operation to a transaction within Redis as I need to get the data (which would force a commit) before I can process and update.
Can anyone advise?

Apologies for the lack of clarity with the question.
After further reading, indeed I can use transactions however the area I was struggling to understand was that I need separate out the read from the update, and just wrap the update in the transaction along with using WATCH on the read. This causes the update transaction to fail if another update has taken place.
So the workflow is:
WATCH key
GET key
MULTI
SET key
EXEC
Hopefully this is useful for anyone else looking to an atomic get and update.

Redis supports atomic transactions http://redis.io/topics/transactions

Users last-access time with CouchDB

I am new to CouchDB, but that is not related to the problem. The question is simple, yet not clear to me.
For example: Boris was on the site 5 seconds ago and viewing his profile Ivan sees it.
How to correctly implement this feature (users last-access time)?
The problem is that, if we update users profile document in CouchDB, for ex. property last_access_time, each time a page is refreshed, than we will have the most relevant information (with MySQL we did it this way), but on the other hand, we will have _rev of the document somewhere about 100000++ by the end of the day.
So, how do you do that or do you have any ideas?

This is not a full answer but a possible optimization. It will work in addition to any other answers here.
Instead of storing the latest timestamp, update the timestamp only if it has changed by e.g. 5 seconds, or 60 seconds.
Assume a user refreshes every second for a day. That is 86,400 updates. But if you only update the timestamp at 5-second intervals, that is 17,280; for 60-seconds it is 1,440.
You can do this on the client side. When you want to update the timestamp, fetch the current document and check the old timestamp. If it is less than 5 seconds old, don't do anything. Otherwise, update it normally.
You can also do it on the server side. Write an _update function in CouchDB, which you can query like e.g. POST /db/_design/my_app/_update/last-access/the_doc_id?time=2011-01-31T05:05:31.872Z. The update function will do the same thing: check the old timestamp, and either do nothing, or update it, depending on the elapsed time.

If there was (a large) part of a document that is relatively static, and (a small) part that is highly dynamic, I would consider splitting it into two different documents.
Another option might be to use something more suited to the high write throughput of small pieces of data of that nature such as Redis or possibly MongoDB, and (if necessary) have a background task to occasionally write the info to CouchDB.

CouchDB has no problem with rapid document updates. Just do it, like MySQL. High _rev is no problem.
The only thing is, you have to be responsible about your couch from day 1. All CouchDB users must do this anyway, however you may have to do it sooner. (Applications with few updates have lower risk of a full disk, so developers can postpone this work.)
Poll your database and run compaction if it needs it (based on size, document count, seq_id number)
Poll your views and run compaction too
Always have enough disk capacity and i/o bandwidth to support compaction. Mathematical worst-case: you need 2x the database size, and 2x the write speed; however, most applications require less. Since you are updating documents, not adding them, you will need way less.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string