Why aren't the unused segment files being deleted? - azure

I don't know what changed--things were working relatively well with our Lucene implementation. But now, the number of files in the index directory just keeps growing. It started with _0 files, then _1 files appeared, then _2 and _3 files. I am passing in false to the IndexWriter's constructor for the 'create' parameter, if there are existing files in that directory when it begins:
indexWriter = new IndexWriter(azureDirectory, analyzer, (azureDirectory.ListAll().Length == 0), IndexWriter.MaxFieldLength.UNLIMITED);
if (indexWriter != null)
{
// Set the number of segments to save in memory before writing to disk.
indexWriter.MergeFactor = 1000;
indexWriter.UseCompoundFile = false;
indexWriter.SetRAMBufferSizeMB(800);
...
indexWriter.Dispose(); indexWriter = null;
}
Maybe it's realated to the UseCompoundFile flag?
Every couple of minutes, I create a new IndexWriter, process 10,000 documents, then dispose the IndexWriter. The index works, but the growing number of files is very bad, because I'm using AzureDirectory which copies every file out of Azure into a cache directory before starting the Lucene write.
Thanks.

This is the normal behavior. If you want a single index segment you have some options:
Use compound files
Use a MergeFactor of 1 if you use LogMergePolicy, which is the default policy for lucene 3.0. Note that the method you use on the IndexWriter is just a convenience method that calls mergePolicy.MergeFactor as long as mergePolicy is an instance of LogMergePolicy.
Run an optimization after each updates to your index
Low merge factors and optimizations after each updates can have serious drawbacks on the performance of your app which will depend on the type of indexing you do.
See this link which documents a little bit the effects of MergeFactor :
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/index/LogMergePolicy.html#setMergeFactor%28%29

Related

Throttling EF queries to save DTUs

We have an asp.Net application using EF 6 hosted in Azure. The database runs at about 20% DTU usage for most of the time except for certain rare actions.
These are almost like db dumps in Excel format, like having all orders of the last X years etc. which the (power) users can trigger and then get the result later by email.
The problem is that these queries use up all DTU and the whole application goes into a crawl. We would like to kind of throttle these non-critical queries as it doesn't matter if this takes 10-15min longer.
Googling I found the option to reduce the DEADLOCK_PRIORITY but this wont fix the issue of using up all resources.
Thanks for any pointers, ideas or solutions.
Optimizing is going to be hard as it is more or less a db dump.
Azure SQL Database doesn't have Resource Governor available, so you'll have to handle this in code.
Azure SQL Database runs in READ COMMITTED SNAPSHOT mode, so slowing down the session that dumps the data from a table (or any streaming query plan) should reduce its DTU consumption without adversely affecting other sessions.
To do this put waits in the loop that reads the query results, either an IEnumerable<TEntity> returned from a LINQ query or a SqlDataReader returned from an ADO.NET SqlCommand.
But you'll have to directly loop over the streaming results. You can't copy the query results into memory first using IQueryable<TEntity>.ToList() or DataTable.Load(), SqlDataAdapter.Fill(), etc as that would read as fast as possible.
eg
var results = new List<TEntity>();
int rc = 0;
using (var dr = cmd.ExecuteReader())
{
while (dr.Read())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
}
or
var results = new List<TEntity>();
int rc = 0;
foreach (var e in db.MyTable.AsEnumerable())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
For extra credit, use async waits and stream the results directly to the client without batching in memory.
Alternatively, or in addition, you can limit the number of sessions that can concurrently perform the dump to one, or one per table, etc using named Application Locks.

Node: Check a Firebase db and execute a function when an objects time matches the current time

Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?
There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.

Task server on ML

I have a query that may return up to 2000 documents.
Within these documents I need six pcdata items return as string values.
There is a possiblity, since the documents size range from small to very large,
exp tree cache error.
I am looking at spawn-function to break up my result set.
I will pass wildcard values, based on known "unique key structure", and will know the max number of results possible;each wildcard values will return 100 documents max.
Note: The pcdata for the unique key structure does have a range index on it.
Am I on the right track with below?
The task server will create three tasks.
The task server will allow multiple queries to run, but what stops them all running simultaneously and blowing out the exp tree cache?
i.e. What, if anything, forces one thread to wait for another? Or one task to wait for another so they all do not blow out the exp tree cache together?
xquery version "1.0-ml";
let $messages :=
(:each wildcard values will return 100 documents max:)
for $message in ("WILDCARDVAL1","WILDCARDVAL2", "WILDCARDVAL3")
let $_ := xdmp:log("Starting")
return
xdmp:spawn-function(function() {
let $_ := xdmp:sleep(5000)
let $_ := xdmp:log(concat("Searching on wildcard val=", $message))
return concat("100 pcdata items from the matched documents for ", $message) },
<options xmlns="xdmp:eval">
<result>true</result>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)
return $messages
The Task Server configuration listed in the Admin UI defines the maximum number of simultaneous threads. If more tasks are spawned than there are threads, they are queued (FIFO I think, although ML9 has task priority options that modify that behavior), and the first queued task takes the next available thread.
The <result>true</result> option will force the spawning query to block until the tasks return. The tasks themselves are run independently and in parallel, and they don't wait on each other to finish. You may still run into problems with the expanded tree cache, but by splitting up the query into smaller ones, it could be less likely.
For a better understanding of why you are blowing out the cache, take a look at the functions xdmp:query-trace() and xdmp:query-meters(). Using the Task Server is more of a brute force solution, and you will probably get better results by optimizing your queries using information from those functions.
If you can't make your query more selective than 2000 documents, but you only need a few string values, consider creating range indexes on those values and using cts:values to select only those values directly from the index, filtered by the query. That method would avoid forcing the database to load documents into the cache.
It might be more efficient to use MarkLogic's capability to return co-occurrences, or even 3+ tuples of value combinations from within documents using functions like cts:values. You can blend in a (cts:uri-reference](http://docs.marklogic.com/cts:uri-reference) to get the document uri returned as part of the tuples.
It requires having range indexes on all those values though..
HTH!

How to trigger the pre-load of Hazelcast NearCache?

I understand that the NearCache gets loaded only after first get operation is performed on that key on the IMap. But I am interested in knowing if there is any way to trigger the pre-load of the NearCache with all the entries from its cluster.
Use Case:
The key is a simple bean object and the value is a DAO object of type TIntHashMap containing lot of entries.
Size:
The size of value object ranges from 0.1MB to 24MB (and >90% of the entries have less than 5MB). The number of entries range from 150-250 in the IMap.
Benchmarks:
The first call to the get operation is taking 2-3 seconds and later calls are taking <10 ms.
Right now I have created below routine which reads the IMap and reads each entries to refresh the NearCache.
long startTime = System.currentTimeMillis();
IMap<Object, Object> map = client.getMap("utility-cache");
log.info("Connected to the Cache cluster. Starting the NearCache refresh.");
int i = 0;
for (Object key : map.keySet()) {
Object value = map.get(key);
if(log.isTraceEnabled()){
SizeOf sizeOfKey = new SizeOf(key);
SizeOf sizeOfValue = new SizeOf(value);
log.info(String.format("Size of %s Key(%s) Object = %s MB - Size of %s Value Object = %s MB", key.getClass().getSimpleName(), key.toString(),
sizeOfKey.sizeInMB(), value.getClass().getSimpleName(), sizeOfValue.sizeInMB()));
}
i++;
}
log.info("Refreshed NearCache with " + i + " Entries in " + (System.currentTimeMillis() - startTime) + " ms");
As you said, the Near Cache gets populated on get() calls on IMap or JCache data structures. At the moment there is no system to automatically preload any data.
For efficiency you can use getAll() which will get the data in batches. This should improve the performance of your own preloading functionality. You can vary your batch sizes until you find the optimum for your use case.
With Hazelcast 3.8 there will be a Near Cache preloader feature, which will store the keys in the Near Cache on disk. When the Hazelcast client is restarted the previous data set will be pre-fetched to re-populate the previous hot data set in the Near Cache as fast as possible (only the keys are stored, the data is fetched again from the cluster). So this won't help for the first deployment, but for all following restarts. Maybe this is already what you are looking for?
You can test the feature in the 3.8-EA or the recent 3.8-SNAPSHOT version. The documentation for the configuration can be found here: http://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#configuring-near-cache
Please be aware that we changed the configuration parameter from file-name to filename between EA and the actual SNAPSHOT. I recommend the SNAPSHOT version, since we also made some other improvements in the preloader code.

arangodb truncate fails on large a collection

I get a timeout in arangosh and the arangodb service gets unresponsive if I try to truncate a large collection of ~40 million docs. Message:
arangosh [database_xxx]> db.[collection_yyy].truncate() ; JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 2001: Error reading from: 'tcp://127.0.0.1:8529' 'timeout during read'] !
throw new ArangoError(requestResult); ! ^ stacktrace: Error
at Object.exports.checkRequestResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
at ArangoCollection.truncate (/usr/share/arangodb/js/client/modules/org/arangodb/arango-collection.js:468:12)
at <shell command>:1:11
ArangoDB 2.6.9 on Debian Jessie, AWS ec2 m4.xlarge, 16G RAM, SSD.
The service gets unresponsive. I suspect it got stuck (not just busy), because it doesn't work until after I stop, delete database in /var/lib/arangodb/databases/, then start again.
I know I may be leaning towards the limits of performance due to the size, but I would guess that it is the intention not to fail regardless of size.
However on a non cloud Windows 10, 16GB RAM, SSD the same action succeeded well - after a while.
Is it a bug? I have some python code that loads dummy data into a collection if it helps. Please let me know if I shall provide more info.
Would it help to fiddle with --server.request-timeout ?
Increasing --server.request-timeout for the ArangoShell will only increase the timeout that the shell will use before it closes an idle connection.
The arangod server will also shut down lingering keep-alive connections, and that may happen earlier. This is controlled via the server's --server.keep-alive-timeout setting.
However, increasing both won't help much. The actual problem seems to be the truncate() operation itself. And yes, it may be very expensive. truncate() is a transactional operation, so it will write a deletion marker for each document it removes into the server's write-ahead log. It will also buffer each deletion in memory so the operation can be rolled back if it fails.
A much less intrusive operation than truncate() is to instead drop the collection and re-create it. This should be very fast.
However, indexes and special settings of the collection need to be recreated / restored manually if they existed before dropping it.
For a document collection, it can be achieved like this:
function dropAndRecreateCollection (collectionName) {
// save state
var c = db._collection(collectionName);
var properties = c.properties();
var type = c.type();
var indexes = c.getIndexes();
// drop existing collection
db._drop(collectionName);
// restore collection
var i;
if (type == 2) {
// document collection
c = db._create(collectionName, properties);
i = 1;
}
else {
// edge collection
c = db._createEdgeCollection(collectionName, properties);
i = 2;
}
// restore indexes
for (; i < indexes.length; ++i) {
c.ensureIndex(indexes[i]);
}
}

Resources