node.js firestore cache snapshot and persist even after reboot [duplicate]

node.js firestore cache snapshot and persist even after reboot [duplicate] - node.js

https://firebase.google.com/docs/firestore/manage-data/enable-offline
How does Firestore work with offline data?
How are writes merged by many clients editing the same data offline, that then come online at the same time?
How long is offline data persisted? If my user uses my app for 5 years offline, then comes back online, will this be an issue? Do offline changes persist after device restarts?
Does query performance of offline data degrade as the data set gets larger?
Im specifically interested in the web Firestore client.
Do all the language clients implement the above in the same manner?
Thanks.

How are writes merged by many clients editing the same data offline, that then come online at the same time?
The write operations that will take place on Firebase servers, will be in the order in which that series of operations happened. The last operation (the most recent one) will be the one that will be available in the database by the time the synchronization occurs.
How long is offline data persisted? If my user uses my app for 5 years offline, then comes back online, will this be an issue?
The problem is not about how long is about how many operations do you make while the device is offline. While offline, Firestore will keep in queue all the write operations. As this queue grows, local operations and app startup will slow down. Nothing major, but over time these may add up. The major problem in this case is that the end result of this will be that the data on the server stays unmodified. Then what is the purpose of a realtime database? Firestore is really designed as an online database that came work for short to intermediate periods of being disconnected, not to stay offline for 5 years. Beside that, in 5 years it might be a problem of compatibility and not of the number of writes.
Do offline changes persist after device restarts?
The offline persistence is also called disk persistence. This type of persistence is enabled by default in Cloud Firestore and it means that recently listened data (as well as any pending writes from the app to the database) are persisted to disk. The data in this cache survives app restarts and device reboots.
Does query performance of offline data degrade as the data set gets larger?
Yes it does, like explained above.
Do all the language clients implement the above in the same manner?
No. For iOS and Android, the offline feature works fine while for web, this feature is still experimental.

Related

Best persistent data storage system for an alternative to global variables?

I am building a Node.js application which uses a few global variables to track data such as online users and statuses, information about other servers, and ongoing events, but having this information be lost in the event of server restart/crash is not ideal.
As these things are frequently read & modified, I figure it would not be a good idea to put that extra strain on my existing MySQL database. I have looked into Redis but unfortunately my application is hosted on a Windows server so I would have to use an old unsupported version of it which isn't ideal.
I'm currently considering setting up a NoSQL database such as MongoDB, but I'm not sure if this is an efficient solution and if it would be too much on my relatively weak server to have an application and 2 different databases running.
What would be the best solution for persistent storage of data that needs to be frequently accessed and updated by an application?

Making my comments into an answer...
If it's a reasonable amount of data, you can just write JSON to a single data file. No database required. Just overwrite the file with a new block of JSON to save the new state. This is very fast, efficient and simple. I've used this before as a quick and easy way to regularly save snapshots of state that you want to be able to reload if your server restarts. Read the state into memory upon server start, then use it from memory, then regularly save a new snapshot to disk however often your application desires.
If some data changes a lot and some data doesn't change very much, you can break the data into multiple files so you're writing less data on the more frequent interval. Obviously, there is a threshold of amount of data or frequency of writes or complexity of data access where a database would be warranted, but you should at least consider the simpler option first and only add a new database when you think you really need it.
If you cluster your servers in the future, that would speak to a multi-user database (one with appropriate concurrency management features) to be your master keeper of state, but you're going to have other design issues to work through if you're trying to share multi-user state (like online status) across all clustered servers as you can no longer keep that in memory for any server unless all state changes are broadcast to all servers so they can update their in-memory copy of the data or unless you make users sticky to a particular server (which complicates load balancing in clustering). That does somewhat call for a redis-like central store that all clustered servers can access.

How to prune history right in a CoreData+CloudKit app?

My app uses CoreData with iCloud as backend. Multiple devices can access the iCloud database which is thus .public.
The local CoreData store is synchronized with iCloud using an NSPersistentCloudKitContainer.
I use history tracking according to Apple’s suggestions.
There, Apple suggests to prune history when possible. They say
Because persistent history tracking transactions take up space on
disk, determine a clean-up strategy to remove them when they are no
longer needed. Before pruning history, a single gatekeeper should
ensure that your app and its clients have consumed the history they
need.
Originally this was also suggested in the WWDC 2017 talk starting at 26:10.
My question is: How do I implement this single gatekeeper?
I assume the idea is that a single instance knows at what time every user of the app has last synchronized their device. If so the history of transactions before this date can be pruned.
But what if a user synchronized the local data and then does no longer use the app for a long time? In this instance the history cannot be pruned until this user again synchronizes the local data. So the history data could grow arbitrarily large. This seems to me as a central problem that I don’t know how to solve.
The Apple docs cited above suggest:
Similar to fetching history, you can use deleteHistory(before:) to
delete history older than a token, a transaction, or a date. For
example, you can delete all transactions older than seven days.
But this does not solve the problem to my mind.
Aside of this general problem, my idea is to have an iCloud record type in the public iCloud database that stores for every device directly (i.e. without CoreData) the last date when the local database was updated. Since all devices can read these records it is easy to identify the last time when all local databases have been updated and I could prune the history before this date.
Is this the right way to handle the problem?
EDIT:
The problem has recently been addressed in this post. The author demonstrates with tests with Apple's demo app that there is indeed a problem, if the history is purged too early. My answer there indicates that with the suggested delay of 7 days, an error is probably extremely rare.

UPDATE:
In this post from a WWDC22 Core Data Lab, an Apple Core Data framework engineer answers the question "Do I ever need to purge the persistent history tracking data?" as follows:
No. We don’t recommend it. NSPersistentCloudKitContainer uses the
persistent history token to track what to sync. If you delete history
the cloud sync is reset and has to upload everything from scratch. It
will recover but it’s not a good customer experience. It shouldn’t
normally be necessary to delete history. For example, the Apple Photos
app doesn’t trim its history, so unless you’re generating massive
amounts of history don’t do it.
By now I think my question was partly based on a misunderstanding:
In CoreData, a persistent store is handled by one or more persistent store coordinators.
If there is only one, the coordinator has complete control over the store, and there is no need for history tracking.
If there is more than one coordinator, one coordinator can change the store while another is not aware of the changes.
Thus, persistent history tracking of the store records all transactions in the store.
The store can then notify other users of the store by sending a NSPersistentStoreRemoteChange notification.
Upon receiving this notification, the transaction history can be fetched and processed.
After processing a transaction, it is no longer needed by the user that processed it.
In a CoreData + CloudKit situation, a persistent store is mirrored to iCloud.
This means there is in the simplest situation one persistent store coordinator of the app, and - invisible to the app - one persistent store coordinator that does the mirroring.
Since both coordinators can change the store independently, history tracking is required.
If the app changes the store, I assume that Apple’s mirroring software receives the NSPersistentStoreRemoteChange notifications, processes the transactions and forward them to iCloud. Normally, i.e. if there is an iCloud connection, this takes only seconds, so that the transaction history is only needed short time.
If iCloud changes are mirrored to the store, the app receives the NSPersistentStoreRemoteChange notifications, and has to process the transactions.
After they have been processed, they are no longer needed neither by the app nor by the mirroring software and can be pruned.
This means that ifs there is only one user of the persistent store on the app’s device, pruning can indeed be done short time after processing the notification.
If the device is offline, e.g. in flight mode or switched off, it will not receive NSPersistentStoreRemoteChange notifications, and will not prune the transaction history.
So it is indeed safe to prune the persistent history after say seven days after it has been processed.
The situation is different if there is more than one user of the store on a device, e.g. an additional app extension. In this case one has to ensure that other targets than the app have also processed the transactions before the history is pruned. This can indeed be done by a single gatekeeper. How this can be done is e.g. described in this post.

Options for getting a CPU intensive job off my web server?

I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?

It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.

I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.

This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access

Azure Websites and stateful webApp

I have a naïve version of a PokerApp running as an Azure Website.
The server stores in its memory the state of the tables, (whose turn it is, blinds value, cards…) etc.
The problem here is that I don't know how much I can rely on the WebServer's memory to be "permanent". A simple restart of the server would cause that memory to be lost and therefore all the games in progress before the restart would get lost / cause trouble.
I've read about using TableStorage to keep session data and share it between instances, but in my case it's not just string of text that I want to share but let's say for example, a Lobby objcet which contains all info associated with the games.
This is very roughly the structure of the object I have in memory
After some of your comments, you can see the object that needs to be stored is quite big and is being almost constantly. I don't know how well serializing and deserializing is going to work for me here...
Should I consider an azure VM which I'm hoping is going to have persistent memory instead of a Website?
Or is there a better approach to achieve something like this?
Thanks all for the answers and comments, you've made it clear that one can't rely on local memory when working on the cloud.
I'm going to do some refactoring and optimize the "state" object and then use a caching service.
Two question come to my mind though, and once you throw some light on these ones I promise I will shut up and accept #astaykov's great answer.
CONCURRENCY AT INSTANCE LEVEL - I have classic thread locks in my app to avoid concurrency problems, so I'm hoping there is something equivalent for those caching services you guys propose?
Also, I have a few timeouts per table (increase blinds, number of seconds the players have to act…). Let's say a user has just folded a hand, he's finished interacting with the state object so I update the cache. While that state object (to which the timers belong) is cached, my timers will stop ticking…
I know I'm not explaining myself very well here but I hope you guys see my point.

I'd suggest using the Azure Redis Cache.
Here is a nice sample how to build MVC App with Redis Cache in 15 minutes.
You can, of course use the Azure Managed Cache. Or end up with Azure Tables. And Azure Tables can hold much more then just a string. But I believe the caching solutions would have lower latency in communication.
In either way, your objects have to be serializable. And yes - the objects will get serialized/deserialized by every access. You can do it manually, or let the framework do it for you. From what I've read, NewtonSoft.JSON is quite good and optimized JSON serializerdeserializer.
UPDATE
When you ask for a VM running in the cloud - a VM will be restarted sooner or later! Application Pool will recycle, a planned maintenance will occur, an unplanned maintenance will occur, a hard disk will fail, a memory module will fail, unforeseen disaster will happen.
Only one thing is for sure - if you want your data to survive server crashes, change the way you think and design software, and take data out of (local) the memory. Or just live the fact that application may lose state sometime.
Second update - for the clocks
Well, you have to play with your imagination and experience. I would question that your clocks work anyway in the context of the ASP.NET app (unless all of them being static properties of a static type, which would be a little hell). My approach would be heavily extend my app to the client as well (JavaScript). There are a lot of great frameworks out there - SignalR, AngularJS, KnockoutJS, none of them to be underestimated! By extending your object model to the client, you can maintain players objects lifetime on the client (keeping the clock ticking) and sending updates from the client to the server for all those events. If you take a look at SignalR, you can keep real time communication between multiple clients (say players) and the server. And the server side of SignalR scales out nicely with Azure Service Bus and even Redis.

Messaging bus + event storage + PubSub

I'm looking at building an application which has many data sources, each of which put events into my system. Events have a well defined data structure and could be encoded using JSON or XML.
I would like to be able to guarantee that events are saved persistently, and that the events are used as a part of a publish/subscribe bus with multiple subscribers possible per event.
For the database, availability is very important even as it scales to multiple nodes, and partition tolerance is important so that I can scale the number of places which can store my events. Eventual consistency is good enough for me.
I was thinking of using a JMS enterprise messaging bus (e.g. Mule) or an AMQP enterprise messaging bus (such as RabbitMQ or ZeroMQ).
But for my application, it seems that if I could set up a publish subscribe system with CouchDB or something similar, it would solve my problem without having to integrate a enterprise messaging bus and a persistent storage system.
Which would work better, CouchDB + scaling + loadbalancing + some kind of PubSub mechanism, or an explicit PubSub messaging system with attached eventually-consistent , Available, partition-tolerant storage? Which one is easier to set up, administer, and operate? Which solution will have high throughput for a given cost? Why?
Also, are there any more questions I should ask before selecting my technologies? (BTW, Java is the server-side and client-side language).

I am using a CouchDB message queue in production. (It is not pub/sub, so I do not consider this answer complete.)
Currently (June 2011), CouchDB has huge potential as a messaging substrate:
Good data persistence
Well-poised for clustering (on a LAN, using BigCouch or Lounge)
Well-poised for distribution (between data centers, world-wide)
Good platform. Despite the shortcomings listed below, I love CQS because I can re-use my DB and it works from Erlang, NodeJS, and every web browser.
The _changes query
Continuous feeds, instant delivery without polling
Network going down is no problem, just retry later from the previous position
Still, even a low-volume message system in CouchDB requires careful planning and maintenance. CouchDB is potentially a great messaging server. (It is inspired by Lotus notes, which handles high email volume.)
However, these are the challenges with CouchDB:
Append-only database files grow fast
Be mindful about disk capacity
Be mindful about disk i/o. Compaction will read and re-write all live documents
Deleted documents are not really deleted. They are marked deleted=true and kept forever, even after compaction! This is in fact uniquely good about CouchDB, because the deleted action will propagate through the cluster, even if the network goes down for a time.
Propagating (replicating) deletes is great, but what about the buildup of deleted docs? Eventually it will outstrip everything else. The solution is to purge them, which actually removes them from disk. Unfortunately, if you do 2 or more purges before querying a map/reduce view, the view will completely rebuild itself. That may take too much time, depending on your needs.
As usual, we hear NoSQL databases shouting "free lunch!", "free lunch!" while CouchDB says "you are going to have to work for this."
Unfortunately, unless you have compelling pressure to re-use CouchDB, I would use a dedicated messaging platform. I had a good experience with ejabberd as a messaging platform and to communicate to/from Google App Engine.)

I think that the best solution would be CouchDB + Jabber/XMPP server (ejabberd) + book: http://professionalxmpp.com
JSON is the natural storing mechanism for CouchDB
Jabber/XMPP server includes pubsub support
The book is a must read

While you can use a database as an alternative to a message queueing system, no database is a message queuing system, not even CouchDB. A message queueing system like AMQP provides more than just persistence of messages, in fact with RabbitMQ, persistence is just an invisible service under the hood that takes care of all of the challenges that you have to deal with by yourself on CouchDB.
Take a good look at the RabbitMQ website where there is lots of information about AMQP and how to make use of it. They have done a great job of collecting together articles and blogs about message queueing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string