arangodb truncate fails on large a collection - arangodb

I get a timeout in arangosh and the arangodb service gets unresponsive if I try to truncate a large collection of ~40 million docs. Message:
arangosh [database_xxx]> db.[collection_yyy].truncate() ; JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 2001: Error reading from: 'tcp://127.0.0.1:8529' 'timeout during read'] !
throw new ArangoError(requestResult); ! ^ stacktrace: Error
at Object.exports.checkRequestResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
at ArangoCollection.truncate (/usr/share/arangodb/js/client/modules/org/arangodb/arango-collection.js:468:12)
at <shell command>:1:11
ArangoDB 2.6.9 on Debian Jessie, AWS ec2 m4.xlarge, 16G RAM, SSD.
The service gets unresponsive. I suspect it got stuck (not just busy), because it doesn't work until after I stop, delete database in /var/lib/arangodb/databases/, then start again.
I know I may be leaning towards the limits of performance due to the size, but I would guess that it is the intention not to fail regardless of size.
However on a non cloud Windows 10, 16GB RAM, SSD the same action succeeded well - after a while.
Is it a bug? I have some python code that loads dummy data into a collection if it helps. Please let me know if I shall provide more info.
Would it help to fiddle with --server.request-timeout ?

Increasing --server.request-timeout for the ArangoShell will only increase the timeout that the shell will use before it closes an idle connection.
The arangod server will also shut down lingering keep-alive connections, and that may happen earlier. This is controlled via the server's --server.keep-alive-timeout setting.
However, increasing both won't help much. The actual problem seems to be the truncate() operation itself. And yes, it may be very expensive. truncate() is a transactional operation, so it will write a deletion marker for each document it removes into the server's write-ahead log. It will also buffer each deletion in memory so the operation can be rolled back if it fails.
A much less intrusive operation than truncate() is to instead drop the collection and re-create it. This should be very fast.
However, indexes and special settings of the collection need to be recreated / restored manually if they existed before dropping it.
For a document collection, it can be achieved like this:
function dropAndRecreateCollection (collectionName) {
// save state
var c = db._collection(collectionName);
var properties = c.properties();
var type = c.type();
var indexes = c.getIndexes();
// drop existing collection
db._drop(collectionName);
// restore collection
var i;
if (type == 2) {
// document collection
c = db._create(collectionName, properties);
i = 1;
}
else {
// edge collection
c = db._createEdgeCollection(collectionName, properties);
i = 2;
}
// restore indexes
for (; i < indexes.length; ++i) {
c.ensureIndex(indexes[i]);
}
}

Related

Throttling EF queries to save DTUs

We have an asp.Net application using EF 6 hosted in Azure. The database runs at about 20% DTU usage for most of the time except for certain rare actions.
These are almost like db dumps in Excel format, like having all orders of the last X years etc. which the (power) users can trigger and then get the result later by email.
The problem is that these queries use up all DTU and the whole application goes into a crawl. We would like to kind of throttle these non-critical queries as it doesn't matter if this takes 10-15min longer.
Googling I found the option to reduce the DEADLOCK_PRIORITY but this wont fix the issue of using up all resources.
Thanks for any pointers, ideas or solutions.
Optimizing is going to be hard as it is more or less a db dump.
Azure SQL Database doesn't have Resource Governor available, so you'll have to handle this in code.
Azure SQL Database runs in READ COMMITTED SNAPSHOT mode, so slowing down the session that dumps the data from a table (or any streaming query plan) should reduce its DTU consumption without adversely affecting other sessions.
To do this put waits in the loop that reads the query results, either an IEnumerable<TEntity> returned from a LINQ query or a SqlDataReader returned from an ADO.NET SqlCommand.
But you'll have to directly loop over the streaming results. You can't copy the query results into memory first using IQueryable<TEntity>.ToList() or DataTable.Load(), SqlDataAdapter.Fill(), etc as that would read as fast as possible.
eg
var results = new List<TEntity>();
int rc = 0;
using (var dr = cmd.ExecuteReader())
{
while (dr.Read())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
}
or
var results = new List<TEntity>();
int rc = 0;
foreach (var e in db.MyTable.AsEnumerable())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
For extra credit, use async waits and stream the results directly to the client without batching in memory.
Alternatively, or in addition, you can limit the number of sessions that can concurrently perform the dump to one, or one per table, etc using named Application Locks.

What happens when you do a ToList() on a gigantic table storage table?

I had some really old code somewhere on my application that I accidently triggered:
var json = table.CreateQuery<ActionLog>().ToList().ToJson();
another suspect is:
var action_log_list = await table.CreateQuery<ActionLog>()
.Where(log => log.StartTime > startTime)
.AsTableQuery()
.[...]
The problem is that this table is gigantic - probably hundreds of millions.
About the same time I hit this code it took out one instance of my application and that one didn't come back for more then one hour. Even after Restarts.
Now I was actually investigating some mild performance problems, so I'm wondering; was this a coincidence or could the code above bring down a table storage - like a 'really long running query' and after that blocking i.e. inserts or reads on that table?
About the same time I hit this code it took out one instance of my application and that one didn't come back for more then one hour. Even after Restarts.
Based on my knowledge, we could use the ExecuteSegmentedAsync to improve the peformance. The following is demo code.
var query = table.CreateQuery<ActionLog>().AsTableQuery();
TableContinuationToken continuationToken = null;
do
{
// Execute the query async until there is no more result
var queryResult = await query.ExecuteSegmentedAsync(continuationToken);
// to do something
continuationToken = queryResult.ContinuationToken;
} while (continuationToken != null);
As it is gigantic table, it may still need long time to do that. I don't test it on my side.
But based on my experience, if we want to deal with so huge records, I recommand that you could use Azure Data factory to do that.

How to perform massive data uploads to firebase firestore

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

How to trigger the pre-load of Hazelcast NearCache?

I understand that the NearCache gets loaded only after first get operation is performed on that key on the IMap. But I am interested in knowing if there is any way to trigger the pre-load of the NearCache with all the entries from its cluster.
Use Case:
The key is a simple bean object and the value is a DAO object of type TIntHashMap containing lot of entries.
Size:
The size of value object ranges from 0.1MB to 24MB (and >90% of the entries have less than 5MB). The number of entries range from 150-250 in the IMap.
Benchmarks:
The first call to the get operation is taking 2-3 seconds and later calls are taking <10 ms.
Right now I have created below routine which reads the IMap and reads each entries to refresh the NearCache.
long startTime = System.currentTimeMillis();
IMap<Object, Object> map = client.getMap("utility-cache");
log.info("Connected to the Cache cluster. Starting the NearCache refresh.");
int i = 0;
for (Object key : map.keySet()) {
Object value = map.get(key);
if(log.isTraceEnabled()){
SizeOf sizeOfKey = new SizeOf(key);
SizeOf sizeOfValue = new SizeOf(value);
log.info(String.format("Size of %s Key(%s) Object = %s MB - Size of %s Value Object = %s MB", key.getClass().getSimpleName(), key.toString(),
sizeOfKey.sizeInMB(), value.getClass().getSimpleName(), sizeOfValue.sizeInMB()));
}
i++;
}
log.info("Refreshed NearCache with " + i + " Entries in " + (System.currentTimeMillis() - startTime) + " ms");
As you said, the Near Cache gets populated on get() calls on IMap or JCache data structures. At the moment there is no system to automatically preload any data.
For efficiency you can use getAll() which will get the data in batches. This should improve the performance of your own preloading functionality. You can vary your batch sizes until you find the optimum for your use case.
With Hazelcast 3.8 there will be a Near Cache preloader feature, which will store the keys in the Near Cache on disk. When the Hazelcast client is restarted the previous data set will be pre-fetched to re-populate the previous hot data set in the Near Cache as fast as possible (only the keys are stored, the data is fetched again from the cluster). So this won't help for the first deployment, but for all following restarts. Maybe this is already what you are looking for?
You can test the feature in the 3.8-EA or the recent 3.8-SNAPSHOT version. The documentation for the configuration can be found here: http://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#configuring-near-cache
Please be aware that we changed the configuration parameter from file-name to filename between EA and the actual SNAPSHOT. I recommend the SNAPSHOT version, since we also made some other improvements in the preloader code.

Alternatives to MongoDB cursor.toArray() in node.js

I am currently using MongoDB cursor's toArray() function to convert the database results into an array:
run = true;
count = 0;
var start = process.hrtime();
db.collection.find({}, {limit: 2000}).toArray(function(err, docs){
var diff = process.hrtime(start);
run = false;
socket.emit('result', {
result: docs,
time: diff[0] * 1000 + diff[1] / 1000000,
ticks: count
});
if(err) console.log(err);
});
This operation takes about 7ms on my computer. If I remove the .toArray() function then the operation takes about 0.15ms. Of course this won't work because I need to forward the data, but I'm wondering what the function is doing since it takes so long? Each document in the database simply consists of 4 numbers.
In the end I'm hoping to run this on a much smaller processor, like a Raspberry Pi, and here the operation where it fetches 500 documents from the database and converts it to an array takes about 230ms. That seems like a lot to me. Or am I just expecting too much?
Are there any alternative ways to get data from the database without using toArray()?
Another thing that I noticed is that the entire Node application slows remarkably down while getting the database results. I created a simple interval function that should increment the count value every 1 ms:
setInterval(function(){
if(run) count++;
}, 1);
I would then expect the count value to be almost the same as the time, but for a time of 16 ms on my computer the count value was 3 or 4. On the Raspberry Pi the count value was never incremented. What is taking so much CPU usage? The monitor told me that my computer was using 27% CPU and the Raspberry Pi was using 92% CPU and 11% RAM, when asked to run the database query repeatedly.
I know that was a lot of questions. Any help or explanations are much appreciated. I'm still new to Node and MongoDB.
db.collection.find() returns a cursor, not results, and opening a cursor is pretty fast.
Once you start reading the cursor (using .toArray() or by traversing it using .each() or .next()), the actual documents are being transferred from the database to your client. That operation is taking up most of the time.
I doubt that using .each()/.next() (instead of .toArray(), which—under the hood—uses one of those two) will improve the performance much, but you could always try (who knows). Since .toArray() will read everything in memory, it may be worthwhile, although it doesn't sound like your data set is that large.
I really think that MongoDB on Raspberry Pi (esp a Model 1) is not going to work well. If you don't depend on the MongoDB query features too much, you should consider using an alternative data store. Perhaps even an in-memory storage (500 documents times 4 numbers doesn't sound like lots of RAM is required).

Resources