stackexchange redis, reading large file = timeouts

stackexchange redis, reading large file = timeouts - azure

I have an application that uses a large in memory file (just under 2gb).
I'm trying to use redis lists (in Azure) as the storage (vs SQL). Building the list in redis is pretty fast, I can load the redis list in about 5 minutes but then I need to read from the list into the application.
This is extremely slow, I've tried increasing the threads, extending synctimeout etc to no avail.
ThreadPool.SetMinThreads(200, 200);
I'm using a C# implementation of redis list that I discovered online, I'm passing this to the code that builds the in memory collection via foreach loop. Internally this is how it processes the data (I've omitted the rest of the class)
public class RedisList<T> : IList<T>
{
private static ConnectionMultiplexer _cnn;
private readonly string _key;
private static readonly Lazy<ConnectionMultiplexer> LazyConnection = new Lazy<ConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(ConfigurationManager.AppSettings["AzureRedisCacheUrl"]));
public ConnectionMultiplexer Connection => LazyConnection.Value;
public RedisList(string key)
{
this._key = key;
_cnn = Connection;
}
public IEnumerator<T> GetEnumerator()
{
for (var i = 0; i < this.Count; i++)
{
yield return Deserialize<T>(GetRedisDb().ListGetByIndex(_key, i).ToString());
}
}
}
Is there a more efficient way of reading in the data? Am I insane to do it this way? :D Thanks

Am I insane to do it this way?
Absolutely. Redis isn't designed to store large binary data like < 2GB files (or even 100MB files).
Redis is about indexing small chunks of data to retrieve them later in a very optimized and efficient way in terms of both CPU and memory. Remember that Redis is an in-memory database, and the fact that snapshots its data to disk (for example in RDB files) doesn't mean that the data source is your RAM memory.
Instead of storing this large binary data on Redis, just use Redis as file index, and leverage its data structures to get back to them in a breeze.
OP said in some comment:
Hi Matias, the redis list is a collection of individual files ranging
from a few kb to around 5mb. The legacy application I'm working on
loads all this into a huge static object on application start (I know,
awful). The original version was an in memory file loaded from SQL,
this proved too slow when moving to azure so we need a faster
intermediary store of the data
Anyway, Redis isn't designed to be an in-memory file storage.
I would say that you should take a look at memory-mapped files, where you can even load all files into a memory mapped file, and get them using indexes (for example, from byte 0 to 243843, is file1, and so on). This should improve overall performance and you won't need to use wrong tools for the job.

Related

Caching relational data using redis

I'm building a small social network (users have posts and posts have comments - very basic), using clustered nodejs server and redis as a distributed cache.
My approach to cache users posts is to have a sorted set that contains all the user's posts ids ordered by rate(which should be updated every time someone add a like or comment), and actual objects sorted as hash objects.
So the get user's posts flow should look like this:
1. using zrange to get a range of ids from the sorted set.
2. using multi/exec and hgetall to fetch all the objects at once.
I have a couple of questions:
1. in regards of performance issues, will my approach scale when the cache size getting bigger, or maybe I should use lua or something?
1. in case if I want to continue with current approach, where I should save the sorted set in case of redis crash, if I use the redis persistence this will affect the overall performance, I thought about using a dedicated redis server for the sets (I searched If it is possible to backup only part of the redis data but didn't found anything about it.
My approach => getTopObjects({userID}, 0, 20) :
self.zrange = function(setID, start, stop, multi)
{
return execute(this, "zrange", [setID, start, stop], multi);
};
self.getObject = function(key, multi)
{
return execute(this, "hgetall", key, multi);
};
self.getObjects = function(keys)
{
let multi = thisArg.client.multi();
let promiseArray = [];
for (var i = 0, len = keys.length; i < len; i++)
{
promiseArray.push(this.getObject(keys[i], multi));
}
return execute(this, "exec", [], multi).then(function(results)
{
//TODO: do something with the result.
return Promise.all(promiseArray);
});
};
self.getTopObjects = function(setID, start, stop)
{
//TODO: validate the range
let thisArg = this;
return this.zrevrange(setID, start, stop).then(function(keys)
{
return thisArg.getObjects(keys);
});
};

It's an interesting intellectual exercise, but in my opinion this is classic premature optimization.
1) It's probably way too early to have even introduced redis, let alone be thinking about whether redis is fast enough. Your social network is almost certainly just fine up to about 1,000 users running off raw SQL queries against Mysql / Postgres / Random RDS. If it starts to slow down, get data on slow running queries and fix them with query optimizations and appropriate indexes. That'll get you past 10,000 users.
2) Now you can start introducing redis. In general, I'd encourage you to think about your redis as purely caching and not permanent storage; it shouldn't matter if it gets blown away, it just means your site is slower for the next few seconds because your users are getting their page loads from SQL queries instead of redis hits (each query re-populating that user's sorted list of posts in redis, of course).
Your strategy and example code for using redis seem fine to me, but until you have actual data on how users use your site (which may be drastically different than your current expectations), it's simply impossible to know what types of SQL indexes you will need, what keys and lists are ideal for caching in redis, etc.

I faced similar issues, I needed a way to query the data more efficiently. Can't say for sure but I heard Redis being single threaded blocks the main thread when running lua scripts, i'm sure that's not good for a social networking site. I heard about Tarantool and it looks promising, currently trying to wrap my head around it.
If you are concerned about your cache size growing bigger, I think most social networks keep two weeks worth of data in the users cache, anything older than two weeks gets deleted and you simply implement a scrolling feature that works with pagination, once the user scrolls down, fetch the next two weeks worth of data and add it back to memory only for that specific user (don't forget to specify the new ttl for the newly added data). This helps keep your cache size lean.
What happens when redis or any in memory data tool you are using crashes, you simply reload data back into the memory. They all have features where you save data to files as backup. I'm thinking of implementing another database layer don't know lets say Cassandra or Mongodb that holds the timelines of each user since inception. Sure this creates another overhead cause you have to keep three data layers (e.g mysql, redis and mongodb) in sync!
If this looks like a lot of work, feel free to use a 3rd party service to host your in memory data, at least you can sleep easy, but it's gonna cost you.
That said, this is highly opinionated. Got tired of people telling me to wait until my site explodes with users or the so called premature optimization reply you got :)

ServiceStack.OrmLite: Slow write/reads?

UPDATE June 30
This question made a more clean benchmarking, and Mythz found an issue and resolved it:
ServiceStack benchmark continued: why does persisting a simple (complex) to JSON slow down SELECTs?
ARE WRITE/READ SPEEDS REASONABLE?
Im my trials with OrmLite, I am going to test to convert all our current data/objects from our own implementation for saving to database, and switch over to OrmLite.
However, I did a simple benchmark/speedtest, where I compared our current serialization and write to db as well as read from db and deserialize.
What I found was that ServiceStack is much slower than how we currently do it (we currently just serialize the object using FastSerializer, and write the byte[] data to a BLOB field, so its fast to write and read, but of course obvious drawbacks).
The test I did was using the Customer class, that has a bunch of properties (used in our products, so its a class that is used every day in our current versions).
If I create 10 000 such objects, then measure how long it takes to persist those to a MySql database (serialization + write to db), the results are:
UPDATE
As the "current implementation" is cheating (its just BLOBing a byte[] to database), I implemented a simple RelationDbHandler that persists the 10 000 objects in the normal way, with a simple SQL query. Results are added below.
WRITE 10 000 objects:
Current implementation: 33 seconds
OrmLite (using .Save): 94 seconds
Relational approach: 24.7 seconds
READ 10 000 objects:
Current implementation: 1.5 seconds
OrmLite (using Select<>): 28 seconds
Relational approach: 16 seconds
I am running it locally, on a SSD disk, with no other load on CPU or disk.
I expected our current implementation to be faster, but not that much faster.
I read some benchmark-stuff on ServiceStack webpage (https://github.com/ServiceStack/ServiceStack/wiki/Real-world-performance), but most of the links area dead. Some plain that reading 25 000 rows takes 245 ms, but i have no idea what a row looks like.
Question 1: Are there any benchmarks I can read more about?
Question 2: The Customer object is specified below. Does mythz think the write/read times above is reasonable?
TEST CASE:
This is the Customer objects as it looks in the database after OrmLite created the table. I only populate 5 properties, one is "complex" (so only one field has a JSON serialization represenation in the row), but since all fields are written, I dont think that matters much?
Code to save using OrmLite:
public void MyTestMethod<T>(T coreObject) where T : CoreObject
{
long id = 0;
using (var _db = _dbFactory.Open())
{
id = _db.Insert<T>(coreObject, selectIdentity: true);
}
}
Code to read all from table:
internal List<T> FetchAll<T>()
{
using (var _db = _dbFactory.Open())
{
List<T> list = _db.Select<T>();
return list;
}
}

Use Insert() for inserting rows. Save() will check if the existing record exist and update it if it does, it also populates Auto Increment primary keys, if any were created.
There's also InsertAsync() APIs available but Oracle's official MySql.Data NuGet package didn't have a proper async implementation in which case using https://github.com/mysql-net/MySqlConnector can yield better results by installing the
ServiceStack.OrmLite.MySqlConnector NuGet package and using its MySqlConnectorDialect.Provider.
You'll also get better performance using .NET Core which will use the latest 8.x MySql.Data NuGet package.
Note: The results in https://github.com/tedekeroth/OrmLiteVsFastSerializer are not comparable, which is essentially comparing using MySql as a NoSQL blob storage vs a quasi relational model in OrmLite with multiple complex type blobbed fields.
In my tests I've also noticed several serializaation exceptions being swallowed which will negative impact performance, you can have Exceptions bubbled by configuring on Startup:
OrmLiteConfig.ThrowOnError = JsConfig.ThrowOnError = true;

Should I use Hazelcast to detect duplicate requests to a REST service

I have a simple usecase. I have a system where duplicate requests to a REST service (with dozens of instances) are not allowed. However, also difficult to prevent because of a complicated datastore configuration and downstream services.
So the only way I can prevent duplicate "transactions" is to have some centralized place where I write a unique hash of a request data. Each REST endpoint first checks if the hash of a new request already exists and only proceeds if no such hash exists.
For purposes of this question assume that it's not possible to do this with database constraints.
One solution is to create a table in the database where I store my request hashes and always write to this table before proceeding with the request. However, I want something lighter than that.
Another solution is to use something like Redis and write to redis my unique hashes before proceeding with the request. However, I don't want to spin up a Redis cluster and maintain it etc..
I was thinking of embedding Hazelcast in each of my app instances and write my unique hashes there. In theory, all instances will see the hash in the memory grid and will be able to detect duplicate requests. This solves my problem of having a lighter solution than a database and the other requirement of not having to maintain a Redis cluster.
Ok now for my question finally. Is it a good idea to use Hazelcast for this usecase?
Will hazelcast be fast enough to detect duplicate requests that come in milliseconds or microseconds apart ?
If request 1 comes into instance 1 and request 2 comes into instance 2 microseconds apart. Instance 1 writes to hazelcast a hash of the request, instance 2 checks hazelcast for existence of the hash only millyseconds later will the hash have be detected? Is hazelcast going to propagate the data across the cluster in time? Does it even need to do that?
Thanks in advance, all ideas are welcome.

Hazelcast is definitely a good choice for this kind of usecase. Especially if you just use a Map<String, Boolean> and just test with Map::containsKey instead of retrieving the element and check for null. You should also put a TTL when putting the element, so you won't run out of memory. However, same as with Redis, we recommend to use Hazelcast with a standalone cluster for "bigger" datasets, as the lifecycle of cached elements normally interferes with the rest of the application and complicates GC optimization. Running Hazelcast embedded is a choice that should be taken only after serious considerations and tests of your application at runtime.

Yes you can use Hazelcast distributed Map to detect duplicate requests to a REST service as whenever there is put operation in hazelcast map data will be available to all the other clustered instance.
From what I've read and seen in the tests, it doesn't actually replicate. It uses a data grid to distribute the primary data evenly across all the nodes rather than each node keeping a full copy of everything and replicating to sync the data. The great thing about this is that there is no data lag, which is inherent to any replication strategy.
There is a backup copy of each node's data stored on another node, and that obviously depends on replication, but the backup copy is only used when a node crashes.
See the below code which creates two hazelcast clustered instances and get the distributed map. One hazelcast instance putting the data into distibuted IMap and other instance is getting data from the IMap.
import com.hazelcast.config.Config;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.IMap;
public class TestHazelcastDataReplication {
//Create 1st Instance
public static final HazelcastInstance instanceOne = Hazelcast
.newHazelcastInstance(new Config("distributedFisrtInstance"));
//Create 2nd Instance
public static final HazelcastInstance instanceTwo = Hazelcast
.newHazelcastInstance(new Config("distributedSecondInstance"));
//Insert in distributedMap using instance one
static IMap<Long, Long> distributedInsertMap = instanceOne.getMap("distributedMap");
//Read from distributedMap using instance two
static IMap<Long, Long> distributedGetMap = instanceTwo.getMap("distributedMap");
public static void main(String[] args) {
new Thread(new Runnable() {
#Override
public void run() {
for (long i = 0; i < 100000; i++) {
//Inserting data in distributedMap using 1st instance
distributedInsertMap.put(i, System.currentTimeMillis());
//Reading data from distributedMap using 2nd instance
System.out.println(i + " : " + distributedGetMap.get(i));
}
}
}).start();
}
}

Service Fabric - (reaching MaxReplicationMessageSize) Huge amount of data in a reliable dictionary

EDIT question summary:
I want to expose an endpoints, that will be capable of returning portions of xml data by some query parameters.
I have a statefull service (that is keeping the converted to DTOs xml data into a reliable dictionary)
I use a single, named partition (I just cant tell which partition holds the data by the query parameters passed, so I cant implement some smarter partitioning strategy)
I am using service remoting for communication between the stateless WEBAPI service and the statefull one
XML data may reach 500 MB
Everything is OK when the XML only around 50 MB
When data gets larger I Service Fabric complaining about MaxReplicationMessageSize
and the summary of my few questions from below: how can one achieve storing large amount of data into a reliable dictionary?
TL DR;
Apparently, I am missing something...
I want to parse, and load into a reliable dictionary huge XMLs for later queries over them.
I am using a single, named partition.
I have a XMLData stateful service that is loading this xmls into a reliable dictionary in its RunAsync method via this peace of code:
var myDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, List<HospitalData>>>("DATA");
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myDictionary.TryGetValueAsync(tx, "data");
ServiceEventSource.Current.ServiceMessage(this, "data status: {0}",
result.HasValue ? "loaded" : "not loaded yet, starts loading");
if (!result.HasValue)
{
Stopwatch timer = new Stopwatch();
timer.Start();
var converter = new DataConverter(XmlFolder);
List <Data> data = converter.LoadData();
await myDictionary.AddOrUpdateAsync(tx, "data", data, (key, value) => data);
timer.Stop();
ServiceEventSource.Current.ServiceMessage(this,
string.Format("Loading of data finished in {0} ms",
timer.ElapsedMilliseconds));
}
await tx.CommitAsync();
}
I have a stateless WebApi service that is communicating with the above stateful one via service remoting and querying the dictionary via this code:
ServiceUriBuilder builder = new ServiceUriBuilder(DataServiceName);
DataService DataServiceClient = ServiceProxy.Create<IDataService>(builder.ToUri(),
new Microsoft.ServiceFabric.Services.Client.ServicePartitionKey("My.single.named.partition"));
try
{
var data = await DataServiceClient.QueryData(SomeQuery);
return Ok(data);
}
catch (Exception ex)
{
ServiceEventSource.Current.Message("Web Service: Exception: {0}", ex);
throw;
}
It works really well when the XMLs do not exceeds 50 MB.
After that I get errors like:
System.Fabric.FabricReplicationOperationTooLargeException: The replication operation is larger than the configured limit - MaxReplicationMessageSize ---> System.Runtime.InteropServices.COMException
Questions:
I am almost certain that it is about the partitioning strategy and I need to use more partitions. But how to reference a particular partition while in the context of the RunAsync method of the Stateful Service? (Stateful service, is invoked via the RPC in WebApi where I explicitly point out a partition, so in there I can easily chose among partitions if using the Ranged partitions strategy - but how to do that while the initial loading of data when in the Run Async method)
Are these thoughts of mine correct: the code in a stateful service is operating on a single partition, thus Loading of huge amount of data and the partitioning of that data should happen outside the stateful service (like in an Actor). Then, after determining the partition key I just invoke the stateful service via RPC and pointing it to this particular partition
Actually is it at all a partitioning problem and what (where, who) is defining the Size of a Replication Message? I.e is the partiotioning strategy influencing the Replication Message sizes?
Would excerpting the loading logic into a stateful Actor help in any way?
For any help on this - thanks a lot!

The issue is that you're trying to add a large amount of data into a single dictionary record. When Service Fabric tries to replicate that data to other replicas of the service, it encounters a quota of the replicator, MaxReplicationMessageSize, which indeed defaults to 50MB (documented here).
You can increase the quota by specifying a ReliableStateManagerConfiguration:
internal sealed class Stateful1 : StatefulService
{
public Stateful1(StatefulServiceContext context)
: base(context, new ReliableStateManager(context,
new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
{
MaxReplicationMessageSize = 1024 * 1024 * 200
}))) { }
}
But I strongly suggest you change the way you store your data. The current method won't scale very well and isn't the way Reliable Collections were meant to be used.
Instead, you should store each HospitalData in a separate dictionary item. Then you can query the items in the dictionary (see this answer for details on how to use LINQ). You will not need to change the above quota.
PS - You don't necessarily have to use partitioning for 500MB of data. But regarding your question - you could use partitions even if you can't derive the key from the query, simply by querying all partitions and then combining the data.

How do I optimize working with large datasets in MongoDB

We have multiple collections of about 10,000 documents (this will become increasingly more in the future) that are generated in node.js, and need to be stored/queried/filtered/projected multiple times for which we have a mongodb aggregation pipeline. Once certain conditions are met, the documents are regenerated and stored.
Everything worked fine when we had 5,000 documents. We inserted them as an array in a single document, and used unwind in the aggregation pipeline. However, at a certain point the documents no longer fits in a single document because it exceeds the 16 MB document size limit. We needed to store everything in bulk, and add some identifiers to know what 'collection' they belong to so we can use the pipeline on those documents only.
Problem: Writing the files, which is necessary before we can query them in a pipeline, is problematically slow. The bulk.execute() part can easily take 10 - 15 seconds. Adding them to an array in node.js and writing the <16 MB doc to MongoDB only takes a fraction of a second.
bulk = col.initializeOrderedBulkOp();
for (var i = 0, l = docs.length; i < l; i++) {
bulk.insert({
doc : docs[i],
group : group.metadata
});
}
bulk.execute(bulkOpts, function(err, result) {
// ...
}
How can we address the bulk writing overhead latency?
Thoughts so far:
A memory based collection temporarily handling queries while data is being written to disk.
Figure if Memory Storage Engine (Alert: considered beta and not for production) is worth MongoDB Enterprise licensing.
Perhaps the WiredTiger storage engine has improvements over MMAPv1 other than compression and encryption.
Storing a single (array) document anyway, but split it into <16 MB chunks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string