Is there a size limit for object addition in the Geode region - region

We are trying to do a POC to change the way we are storing content in the geode region. We operate on the sketches (sizes can vary from 1GB to 30GB) and currently breaking them into parcels and storing the parcels in the region. We then read these parcels, merge them to create a complete sketch for our processing. We are seeing some inconsistencies in the data due to the cache eviction and trying to come up with an approach of storing the complete object in the region instead of storing the parts.
I was looking at Geode documentation but did not seem to find the size limitation for any entry in the region, but wanted to reach a broader group in case anyone has done anything similar or have some insights into it.
Thanks for your response in advance.
Best Regards,
Amit

According to what I've been investigating, the maximum object size is set as 1GB, you can have a look at GEODE-478 and commit 1e3f89ddcd for further details. It's worth mentioning, as a side note, that objects that big might cause problems with GC, so you might want to stay away from that.
Cheers.

Related

Data Lake Blob Storage

I'm after a bit of understanding, I'm not stuck on anything but I'm trying to understand something better.
When loading a data warehouse why is it always suggested that we load data into blob storage or a data lake first? I understand that it's very quick to pull data from there, however in my experience there are a couple of pitfalls. The first is that there is a file size limit and if you load too much data into 1 file as I've seen happen it causes the load to error at which point we have to switch the load to incremental. This brings me to my second issue, I always thought the point of loading into blob storage was to chuck all the data in there so you can access it in the future without stressing the front end systems, if I can't do that because of file limits then what's the point of even using blob storage, we might as well load data straight into staging tables. It just seems like an unnecessary step to me when I've ran data warehouses in the past without this part involved and to me they have worked better.
Anyway my understanding of this part is not as good as I'd like it to be, and I've tried finding articles that answer these specific questions but none have really explained the concept to me correctly. Any help or links to good articles I could read would be much appreciated.
One reason for placing the data in blob or data lake is so that multiple parallel readers can be used on the data at the same time. The goal of this is to read the data in a reasonable time. Not all data sources support such type of read operations. Given the size of your file, a single reader would take a long long time.
One such example could be SFTP. Not all SFTP servers support offset reads. Some may have further restrictions on concurrent connections. Moving the data first to Azure services provides a known set of capabilities / limitation.
In your case, I think what you need, is to partition the file, like what HDFS might do. If I knew what data source you are using, I could have a further suggestion.

ArangoDB Key/Value Model: value maximum size

With regard to the Key/Value model of ArangoDB, does anyone know the maximum size per Value? I have spent hours searching the Internet for this information but to no avail; you would think that this is a classified information. Thanks in advance.
The answer depends on different things, like the storage engine and whether you mean theoretical or practical limit.
In case of MMFiles, the maximum document size is determined by the startup option wal.logfile-size if wal.allow-oversize-entries is turned off. If it's on, then there's no immediate limit.
In case of RocksDB, it might be limited by some of the server startup options such as rocksdb.intermediate-commit-size, rocksdb.write-buffer-size, rocksdb.total-write-buffer-size or rocksdb.max-transaction-size.
When using arangoimport to import a 1GB JSON document, you will run into the default batch-size limit. You can increase it, but appears to max out at 805306368 bytes (0.75GB). The HTTP API seems to have the same limitation (/_api/cursor with bindVars).
What you should keep in mind: mutating the document is potentially a slow operation because of the append-only nature of the storage layer. In other words, a new copy of the document with a new revision number is persisted and the old revision will be compacted away some time later (I'm not familiar with all the technical details, but I think this is fair to say). For a 500MB document is seems to take a few seconds to update or copy it using RocksDB on a rather strong system. It's much better to have many but small documents.

Running two instances of MongoDB

I am working on a highly I/O Intensive application (A selection based on the availability of seats) using MERN Stack.
The app is expected to get 2000 concurrent users.
I want to know whether it's wise to use two instances of MongoDB, one on the RAM (in memory) and another on the Hard drive.
The RAM one to be used to store the available seats.
And the Hard drive one to backup the data after regular intervals.
But at the same time I know that if the server crashes my MongoDB data on the RAM is lost.
Could anyone guide me please?
I am using Socket IO instead of AJAX...
I don't think you need this. You can get a good server, with a good amount of RAM, and if you create your indexes correctly, everything should work fine.
Also Mongo 3 won't lock the entire database on each update, like Mongo 2 used to do.
I believe the best approach would be using something like Memcached in order to improve reads. Also, in order to improve database performance and have automated failover use sharding and replica sets.
Consider also that you would have headaches when your server restarted and you lose your data...
This seems unnecessary, because MongoDB already behaves exactly like that out-of-the-box.
The old engine (MMAPv1) was using memory-mapped files, which means that if you have as much RAM as you have data, it practically behaves like an in-memory database with automatic hard-drive backing.
The new engine (Wired Tiger) works a bit different in detail, but the same in general. It allows you to set a cache size (config key storage.wiredTiger.engineConfig.cacheSizeGB). When the cache size is as large enough, you again have an in-memory database with automatic hard-drive mirroring.
More about that in the storage FAQ.
What you are talking about is a scaling problem. You have two options when it comes to scaling: Add resources causing the bottleneck to your existing setup (more RAM and faster disks, usually) or expand your setup. You should first add resources, almost up to the point where adding resources does not give you an according bang for the buck.
At some point, this "scaling up" will not be feasible any more and you have to distribute the load amongst more nodes.
MongoDB comes with a feature for distributing load amongst (logical) nodes: sharding.
Basically, it works like this: multiple replica sets each form a logical node called a shard. Each shard in turn only holds a subset of your data. Instead of connecting to the shards directly, you acres your data via a mongos query router which is aware of which shard holds the data to answer the query and where to write new data.
By carefully selecting your shard key, your reads and writes should be evenly distributed between the shards.
Side note: putting production data on a standalone instance instead of a replica set crosses the border of negligence in my book. Given the prices of today's (rented) hardware, it has never been easier to eliminate a single point of failure than with a MongoDB replica set.

Are GridCacheQueue elements also GridCacheElements?

I'm in the process of evaluating GridGain and have read and re-read all the documentation I could find. While much of it is very thorough, you can tell that it's mostly written by the developers. It would be great if there were a reference book written by an outsider's perspective.
Anyway, I have five basic questions I'm hoping someone from GridGain can answer and clarify for me.
It's my understanding that GridCacheQueue (and the other Distributed Data Structures) are built on top of the GridCache implementation. Does that mean that each element of the GridCacheQueue is really just a GridCacheElement of the GridCache map, or is each GridCacheQueue a GridCacheElement, or do I have this totally wrong?
If I set a default TTL on the GridCache, will the elements of a GridCacheQueue expire in the TTL time, or does it only apply to GridCacheElements (which might be answered in #1 above)?
Is there a way to make a GridCacheQueue expire after some period of time without having to remove it manually?
If a cache is set-up to be backed-up onto other nodes and the cache is using off-heap memory and/or swap storage, is the off-heap memory and/or swap storage also replicated onto the back-up nodes?
Is it possible to create a new cache dynamically, or can it only be created via configuration when the node is created?
Thanks for any insightful information!
-Colin
After experimenting with a GridCache and a GridCacheQueue, here's what I've learned about my 5 questions:
I don't know how the GridCacheQueue or its elements are attached to a GridCache, but I know that the elements of a GridCacheQueue DO NOT show up as GridCacheElements of the GridCache.
If you set a TTL on a GridCache and add a GridCacheQueue to it, once the elements of the GridCache begin expiring, the GridCacheQueue becomes unusable and will cause a GridRuntimeException to be thrown.
Yes, see #2 above. However, there doesn't seem to be a safe way to test if the queue is still in existence once the elements of the GridCache start to expire.
Still have no information about this yet. Would REALLY like some feedback on that.
That was a question I never should have asked. A GridCache can be created entirely in code and configured.
Let me first of all say that GridGain supports several queue configuration parameters:
Colocated vs. non-colocated. In colocated mode you can have many queues. Each queue will be assigned to some grid node and all the data in that queue will be cached on that grid node. This way, if you have many queues, each queue may be cached on a different node, but queues themselves should be evenly distributed across all nodes. Non-colocated mode, on the other hand is meant for larger queues, where data for the same queue is partitioned across multiple nodes.
Capacity - this parameter defines maximum queue capacity. When queue reaches this capacity it will automatically start evicting elements oldest elements.
Now, let me try to tackle some of these questions.
I believe each element of GridCacheQuery is a separate element in cache, but implementation marks them as internal elements. That is why you don't see these elements when iterating through cache.
TTL should not be used with elements in the queue (GridGain will be adding this feature soon). For now, you should limit the maximum size of the queue by specifying queue 'capacity' at creation time.
I don't believe so, but I think this feature is being added. For now, you can try using org.gridgain.grid.schedule.GridScheduler to schedule a job that will delete a queue later.
The answer is YES. Both, data in off-heap and swap spaces is backed up and replicated the same way as main on-heap cache data.
A cache should be created in configuration, either from code or XML. However, GridGain has a cool notion of GridCacheProjection which allows to create various sub-caches (cache views) on the same cache. For example, if you store Person and Organization classes in the same cache, then you can use cache projection for type Person when working with Person class, and cache projection of type Organization when working with Organization class.

web development - deletion of user data?

I have finished my first complex web application and I have found out it is probably better to use "isDeleted" flags in db than hard-deleting records. But I wonder what is the recommended approach for data that are stored on filesystem (e.g. photos). Should I delete them when their related entity is (soft-)deleted or keep them as they are? Can junk accumulation cause running out of storage in practice?
It definitely can - you'll need to gather some stats on how much data the typical account generates, and then figure out how many deletions you're seeing to sort out how much junk data will pile up and/or when you'll fill up your storage.
You might also want to try using something like S3 to store your data - at that point, the only reason you would need to delete things would be because it was costing you too much to store it.

Resources