How ManifoldCF job scheduling behaves? - search

I am working on integrating manifoldcf or mcf with alfresco cms as repository connector using CMIS query and using solr as output channel where all index are stored. I am able to do it fine & can search documents in solr index.
Now as part of implementation, i am planing to introduce multiple repository such as sharepoint, file systems etc. so now i have three document repositories : alfresco, sharepoint & filesystem. I am planning to have scheduled jobs which run through each of repositories and crawl these at particular intervals. But i have following contentions.
Although i am scheduling jobs for frequent intervals, i want to make sure that mcf jobs pick only those content which are either added new or updated say i have 100 docs dring current job run but say 110 at next job run so i only want to run jobs for new 10 docs not entire 110 docs.
As there are relatively lesser mcf tutorials available, i have no means to ensure that mcf jobs behaves this way but i assume it is intelligent enough to behave this way but again no proof to substantiate it.
I want to know more about mcf job schedule type : scan every document once/rescan documents directly. Similarly i want to know more about job invocation : complete/minimal. i would be sorry for being a newbie.
Also i am considering about doing some custom coding to ensure that only latest/updated docs are eligible for processing but again going thru code only as less documentation available.
Is it wise to doc custom coding in this case or mcf provides all these features OOTB.
Many thanks in advance.

ManifoldCF schedules the job based on what you have configured for the Job.
it depends on how you repository connector is written, usually when when job runs it runs the getDocumentVersion() of repository connector, if the version of a document specification is different that earlier version, manifold indexes that document else not. Usually your document version string is the last modified date of the document
Unfortunately, manifold does not contain much of the document from the developer perspective side, your probable bet is to go through the code. It is quite explanatory.
This is what minimal is presented as per the mcf documentation
Using the "minimal" variant of the listed actions will perform the minimum possible amount of work, given the model that the connection type for the job uses. In some cases, this will mean that additions and modifications are indexed, but deletions are not detected mcf doc jobs
you should implement your logic in public String[] getDocumentVersions(..)
OOTB feature, is quite enough. But one thing to consider additionally the permission of the documents. if the permission of the document is changed you can choose change the version of document.

Related

Elasticsearch indexing speed with Nodejs

I have an Elasticsearch indice with a large amount of documents. I've been using javascript up until this point with Node.js. I made a cron job to run every 24 hours to update the documents individually based on any metadata changes. As some of you may know, this is probably the slowest possible way to do it. Single threaded Node.js with individual indexing on Elasticsearch. When I run the cron job, it runs at a snails pace. I can update a document every maybe 1.5-2 seconds. This means it would take around 27 hours to update all the docuemnts. I'm using a free-tier AWS ES instance so I don't have access to certain features that would help me speed up the process.
Does anyone know of a way to speed up the indexing? If I were to call for a bulk update, how would that manifest in javascript? If I were to use another language to multi-thread it, what would be the fastest option?
I did not understand your question "If I were to call for a bulk update, how would that manifest in javascript?".
Bulk update should be the best solution irrespective of any language/framework. Of-course, you can explore other languages like Ruby to leverage threads to make bulk update more distributed and fast.
From experience, a bulk update with a batch size b/w 4-7k works just fine. You may want to fine tune the size in this range.
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

Azure Search | Total ordering of index updates

I've read through this excellent feedback on Azure Search. However, I have to be a bit more explicit in questioning one the answers to question #1 from that list...
...When you index data, it is not available for querying immediately.
...Currently there is no mechanism to control concurrent updates to the same document in an index.
Eventual consistency is fine - I perform a few updates and eventually I will see my updates on read/query.
However, no guarantee on ordering of updates is really problematic. Perhaps I'm misunderstanding Let's assume this basic scenario:
1) update index entry E.fieldX w/ foo at time 12:00:01
2) update index entry E.fieldX w/ bar at time 12:00:02
From what I gather, it's entirely possible that E.fieldX will contain "foo" after all updates have been processed?
If that is true, it seems to severely limit the applicability of this product.
Currently, Azure Search does not provide document-level optimistic concurrency, primarily because overwhelming majority of scenarios don't require it. Please vote for External Version UserVoice suggestion to help us prioritize this ask.
One way to manage data ingress concurrency today is to use Azure Search indexers. Indexers guarantee that they will process only the current version of a source document at each point of time, removing potential for races.
Ordering is unknown if you issue multiple concurrent requests, since you cannot predict in which order they'll reach the server.
If you issue indexing batches in sequence (that is, start the second batch only after you saw an ACK from the service from the first batch) you shouldn't see reordering.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Running agent against a 200K+ documents view (log.nsf)

I have an agent to be manually run against the log file. The documents in the view (Usage/By Date) that I'll be using is more than 200,000. The view is categorized twice. I heard somewhere that you cannot run an agent against a view with more than 200K docs. But I cannot confirm it in my research. Is this true? If ever yes, is there a way I can query in this particular view? Thanks a lot!
There is a list of limits of a Lotus Notes database mentioned in the documentation. There's nothing about the maximum number of documents you can operate on in an Agent nor any Agent timeout. I think it'll just take a long time but should work.
http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=%2Fcom.ibm.notes85.help.doc%2Ffram_limits_of_notes_r.html
In 20 years of working with Notes and Domino, I've never heard of a limit like that. However, there are time limits that the Agent Manager imposes on agent runtime. These are configurable, so you should check with your server admins to determine whether your agent will habve sufficient time to complete the job. Since these limits might change, and the number of documents might increase, it might be prudent to write your code under the assumption that it will need to divide the workload across several runs,

Resources