RavenDB (a .Net JSON storage storage db with querying) provides aggressive caching / memory management under its own control (via its own storage engine Munin), with config parameters to tweak various cache sizes etc... Google groups suggests that before (may not be the case with latest releases) occasional out-of-memory exceptions as result of un tuned parameters (with sufficient size db / index).
CouchDB seems to take a different approach and leaves the caching to the operating system. Meaning when I GET /db1/doc-id-1 it essential in terms of programming a file read op against the filesystem which the OS can optimize away due to its own caches. Similarly I believe this is same for views and of reduce results (multiple parts of b tree need loading/computed from disk depending on range).
The latter seems superior to me, the OS have gone from years of evolutions in caching/paging etc.. and pressure from other services can balance memory.
Firstly.
Am I correct in my understanding?
Is CouchDB's approach unique to Unix based OSes (although I see they have a Windows port)?
Is there a reason a .Net DB cant relying on the OS to optimize away file reads etc..?
What are the disadvantages and advantages of each approach that would influence choice in building a data store?
Side note: I believe Redis is the same just keeping the index in memory, each GET KEY is a disk hit (which either does hit the disk heads or not depending on the OS file caching)
Jia93,
One of the reasons that we are working the way we do is that we have stronger separation between the layer. CouchDB have much the same optimizations as we do (keeping things in mem), but it is doing that on top of the BTree structure that is directly expose to the application.
Another reason for caching the results is to avoid the costs of parsing the json on every single request.
Related
In a Next.js app, what would be the most efficient (fastest) way to retrieve the users country?
Among other things, I would use it to determine which scripts are loaded using next/script.
I looked in to node-geoip and fast-geoip, but even though fast-geoip has a very thorough explanation below, I do not understand the mechanisms behind Next.js/Node.js to evaluate the methods properly.
Concretely, what geoip-lite does is that, on startup, it reads the whole database from disk, parses it and puts it all on memory, thus this results in the startup time being increased by about ~233 ms along with an increase of memory being used by the process of around ~110 MB, in exchange for any new queries being resolved with low sub-millisecond latencies (~0.02 ms).
This works if you have a long-running process that will need to geolocate a lot of IPs and don't care about the increases in memory usage nor startup time, but if, for example, your use-case requires only geolocating a single IP, these trade-offs don't make much sense as only a small part of the satabase is needed to answer that query, not all of it.
This library tries to provide a solution for these use-cases by separating the database into chunks and building an indexing tree around them, so that IP lookups only have to read the parts of the database that are needed for the query at hand. This results in the first query taking around 9ms and subsequent ones that hit the disk cache taking 0.7 ms, while memory consumption is kept at around 0.7MB.
Wrapping it up, geoip-lite has huge overhead costs but sub-millisecond queries whereas this library doesn't have any overhead costs but its queries are slower (0.7-9 ms).
As geoip would be called for every visitor, I assume it would have to read the whole database on each initialization and thereby making fast-geoip the best choice?
or is there some built in mechanism, that makes sure it is accessed from memory across the subsequent requests, when frequently loaded and hence making node-geoip the best choice?
or am I focused on solving my problem the wrong way, and should rather see if there is some way I can get the location via the users browser?
Would appreciate any feedback, even if there is a completely different path worth exploring:-)
I read the documentation for fast-geoip. It's designed for "serverless" cloud services such as AWS Lambda, GCP Cloud Functions, CF Workers where RAM is limited and expensive.
Note the package author's emphasis on low steady-state RAM use in the graphs below.
In summary, assuming a cloud VM/bare-metal deployment and the need to call the IP to location method on every page request, there is probably no compelling reason to use the above package.
PS: Check if the above packages require you to rotate a DB file on disk every few weeks (or rebuild+redeploy your Node app) to keep data up to date. There are commercial REST APIs such as the one in my bio (I am the developer) that may mitigate this hassle, YMMV.
new to hazelcast, want to understand functionalities of client and server functions in a cluster.
lets say I have 4 different servers(not referring to hazelcast server)/machines and I want to maximize RAM utilization :-
Do I start 4 servers instances, one on each server/machine ?
Do I start 4 clients instances, one on each server/machine ?
Is business logic written only in client instance ? if so, then do server instance not contain any logic apart from managing the lifecycle ?
I know this would vary as per requirement, but I want to get a general idea.
Adding on to Ernest's statements. You would usually expect data to be held in cache and processing to be on the client. However, with hazelcast, it doesn't have to be that way. Check some interesting features like ExecutorService and EntryProcessors in the documentation.
You may also want to look at the concept of Near cache, where you still hold the data on dedicated Hz instance (servers), while maintaining a near-cache in the client. Be wary of data sync challenges around this, though this works well in most cases (again very subjective).
Hope these pointers give some idea to start off with. All the Best !
there is no single answer to your question. There are many factors to be considered. For example one of your questions is where does the business logic reside. This depends heavily on how the hazelcast is used. Lets say Hazelcast is used purly for Caching purposes. The business logic then resides entirly on the client side.
Alternativly if we say that Hazelcast is full of rich Pojos, and domain driven design is used then we can say the logic lies entirly on the hazelcast instance itself. Usualy in real life the truth is somewhere on between
In terms of memory utilization again this depends very much on your setup budget and so on.... We can say that if you have one server with a lot of ram and you don't use any commersial addons from Hazelcast like off memory heap then running several hazelcasts on the same machine with limited amount of memory each would be more beneficial compared to running a single node with a lot of memory.
Also it should be noted the phenomenon where allocating more than 32Gigs of heap will drive you into te 64 bit universe.
Again this depends on many factors. If you have a Live interactive application you can not tolerate big GC pausas so you would incline to usage of more hazelcasts with small heaps. If you have non interactive application tolerant to big GC pauses then it is the other way around you can have big heap. So you see there is no simple answer to your question.
I have an API-service in Node.js, basically what it does is gets id from request, reads record with this id from the database and returns it back in response.
While there are many clients with different ids usually only about 10-20 of them are used in a given timespan.
Is it a good idea to create an object with ids as keys and store the resulting record along with last_requested time to emulate a small database with fast-access? Whenever a record is requested I will update the last_requested field with new Date(). Also, create a setInterval() to delete those keys which were not used for some time.
Records in the database do not change often, and when they do I can restart the service (there are several instances running simultaneously via PM2, so they can be gracefully restarted).
If the required id is not found in this "database" a request to real database will be performed and the result will be stored in the object in a new key.
You're talking about caching. And it's very useful, if
You have a lot of reads, but not a lot of writes. i.e. Lots of people request a record, and it changes rarely.
You have a lot of free memory, or not many records.
You have a good indication of when to invalidate the cache.
For trivial usecases (i.e. under 50 requests / second), you probably don't need an in-memory cache for the database. Moreover, database access is very fast if you use the tools the database gives you (like persistent connection pools, consistent parameterized queries, query cache, etc).
It all depends on your specific usecase. But I wouldn't do it until I actually start encountering performance problems, and determine that the database is the bottleneck.
It's not just a good idea, caching is a necessity in different level of a computational system. Caching start from the CPU level (L1, L2, L3), OS Level up to application level which must be done by the developer.
Even if you have a well structured Database with good indexes, still there is an overhead for TCP-IP communication between your app and database. So if you are going to access some row frequently it's a must to have them in your app process.
The good news is Node.js apps are single process resident in memory (unlike PHP or other scripting programs which come and go). So you can load frequent required data and omit the database access.
The best mechanism to store the record can be an LRU (least-recently-used) cache. There are several LRU cache packages available for node.js:
https://github.com/adzerk/node-lru-native
https://github.com/isaacs/node-lru-cache
https://www.npmjs.com/package/simple-lru-cache
In an LRU cache you can define how much memory the cache can use, expiry age of each item, and how many item it can store! or you can write your own!
We want to implement caching in Azure for two main reasons:
Speed up repetive data access
Reduce stress on the database
Here are the characteristics of the data we are planning to cache:
Relatively small (1 - 100 kb)
Specific to each customer
Not private, but we don't really want random people navigating through our entire cache
XML or JSON
Consumed by C# (i.e. not linked to directly in the html)
Most weeks the data will not change, although some days the data could change several times
For this specific purpose Table storage appears better than Blob storage (we did just implement Blob storage for images, CSS, and JavaScript) and Windows Azure Caching appears better than Windows Azure Shared Cache (perhaps almost always better and the shared caching is mostly a legacy feature at this point).
The programming API of both appears straight forward. Compared to what we pay for cloud sites the cost of each seems to be negligible.
So far we are leaning toward Table Storage due to what we perceive to be the pros and cons of Azure Caching. As old .Net guys we are much more familiar with In-Memory Cache than NoSql style solutions:
Problems with Windows Azure Caching:
If the VM is moved to a different server (by Microsoft for load balancing or whatever reasons) is the in-memory cache moved intact?
We are guessing that whenever we publish changes to the cloud it wipes out the existing in-memory cache
While the users rarely make changes to the cached data when they do make changes it is likely that they may make multiple updates within seconds and we are not sure how this is going to work with cache located across multiple nodes running web roles especially with increased traffic. (this is probably a concern with table storage as well!)
Table storage appears like it will be easier to debug
Advantages of Windows Azure Caching
somewhat faster
Your familiarity with in-memory caching is the model that you need to understand to implement caching on Windows Azure. The 'NoSql style' is not caching, but storage. So table storage rather replaces SQL than it replaces caching. Table storage is for persistent, reliable storage — with all of the latency and other disadvantages of persistence that do not exist with in-memory cache.
Writing to cache is always secondary. When your users 'make changes to the cached data' you will always be writing out the data to disk (e.g. SQL), and then writing out the same data to the cache because you might as well, since you have the data on-hand (although secondary effects on written data may mean that you should invalidate or re-read the cached item).
The wiping out of data when a machine recycles should not be much of a concern, as the data is stored elsewhere. Every read from the cache should be followed by an 'if not found then read from database' kind of statement. You can warm-up the cache when a role starts to pre-populate items that you know that you are going to need.
Caching on Azure is distributed across the nodes and updating an existing item will always update on the node that it resides. Quick updates may be less of a problem than you think.
For in-memory caching use Windows Azure caching (you are right about shared caching being legacy) and, depending on your needs, look at other caching technologies like memcached. Caching and table storage are not comparable. Table storage is for long-term persistence. Don't unnecessarily hack table storage to do caching — making table storage temporary creates a whole bunch of things that you need to worry about yourself, like expiry and invalidation.
Is it worth caching data from Azure Table storage with the Azure Caching Preview?
Or is the table storage fast enough in large scale applications?
Thanks
The short answer is it depends. In the application I am currently working on there is some information that we use caching for to handle both the latency of retrieving data from Table Storage and to accommodate the desired number of transactions per second.
We started out serving the information from Table Storage and moved to caching only when our performance requirements dictated it. I'd recommend a similar approach: make it work, then make it fast.
In addition to what Robert said, you should also consider following points:
Windows Azure Table Storage allows to store up to 100TB in size (in chunks). At first glance, that size of data may seem overwhelming. However, Table Storage can be partitioned. Each partition of Table Storage can be moved to a separate server by the Azure controller thereby reducing the load on any single server and improving performance.
If you have very high load on your application, you cache with frequent inserts will approach the maximum cache size very quickly and then cache items eviction process starts. In most cases frequent inserts into cache and frequent cache items eviction processes end up with performance degradation instead of improvement. Then you would need to increase cache maximum size, which in turn will affect your application cost (sometimes this might be a blocker).
Last but not least, you can access Windows Azure Table Storage data using the OData protocol and LINQ queries with WCF Data Service .NET Libraries; you do not have that ability with Azure Cache.
Please bear in mind that those points may or may not be valid in your case. All depends on your system architecture; expected load etc.
I hope my answer will help you in making good system architecture decisions.