Scheduled incremental crawl behavior on a paused crawl question

Scheduled incremental crawl behavior on a paused crawl question - sharepoint

Quick question, I've started by mistake an incremental crawl on one of my content source. I've then pause it so it doesn't impact users.
Will it resume on the next scheduled incremental crawl or I need to resume it manually?

Yes - this should resume in the next schedule but even though incremental crawl is scheduled, we can run the incremental crawl anytime in between, there is no harm in that - actually that is the purpose of incremental crawl. In between the crawl scheduling - if we want to bring some items in the search result on demand based on the business needs we can run the incremental crawl manually anytime:-)

Related

How to schedule millions of jobs in a node js properly?

I am using NodeJS,MongoDB and node-cron npm module to schedule jobs. For 10K of jobs it is taking less time and less memory. But when i am scheduling 100k jobs it is taking more than 10 minutes to schedule jobs and taking nearly 1.5GB of RAM and some times out of memory. Is there any best way achieve this like using activemq or rabbitmq?

One strategy is that you only schedule the next job to run. When it runs, you query the database and find the next job and schedule it.
If you add a new job, you check if it wants to run sooner than the now current next job and, if so, you schedule it and deschedule the previous next job (it will get rescheduled later after this new job runs).
If you remove a job, you check if it is the current next job. If it is, you deschedule it and find the next job in the database and schedule it.
If your database is configured for efficiently querying by job run time, this can be very efficient, uses hardly any memory and scales to an infinitely large number of jobs.

Cron Job sometimes "missed | Too late for the schedule" sometimes running

Sometimes Crons is working sometimes getting missed. I have attached all setting and result. Anyone can check and revert.

It's completely normal behaviour. Some jobs are skipped caused the time frame is out of scheduled time for specified cron job. In your case the reindex process is scheduled every 1 minute. If there is more things to index (lot of changes on products, categories etc.) one minute is's not enough to complete. Also there is only one process per cron group, in your case index. Use Separate Process in cron configuration means that indexes process will run as separate process in relation to other cron groups.

Netsuite Map Reduce yielding

I read in documentation that soft limits on governance cause map reduce scripts to yield and reschedule. My problem is I cannot see in docs where it explains what happens in the yield. Is the getInputData called again to regather the same data set ok to be mapped or is the initial data set persisted somewhere and already mapped and reduced records are Excluded from processing?

With yielding, the getInputData stage is not called again. From the docs;
If a job monopolizes a processor for too long, the system can
naturally finish the job after the current map or reduce function has
completed. In this case, the system creates a new job to continue
executing remaining key/value pairs. Based on its priority and
submission timestamp, the new job either starts right after the
original job has finished, or it starts later, to allow
higher-priority jobs processing other scripts to execute. For more
details, see SuiteScript 2.0 Map/Reduce Yielding.
This is different from server restarts or interruptions, however.

How to get job status of crawl tasks in nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job.
Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ?
To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.

You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

How to process scheduled, recurring jobs with Kue?

In my webapp, users can create recurring invoices that need to be generated and sent out at certain dates each month. For example, an invoice might need to be sent out on the 5th of every month.
I am using Kue to process all my background jobs so I want to do it in this case as well.
My current solution is to use setInterval() to create a processRecurringInvoices job every hour. This job will then find all recurring invoices from database and create a separate generateInvoice job for each recurring invoice.
The generateInvoice job will then actually generate the invoice, and if needed, will also in turn create a sendInvoiceToEmail job that will email the invoice.
At the moment this solution looks good to me, because it has a nice separation of concerns, however, I have the following questions:
I am not sure if I should wait for all the 'child' jobs to complete before I call done() on the main processRecurringInvoices job?
Where should I handle errors? Should i pass them back to the processRecurringInvoices job or should I handle them separately for each job?
How can I make sure that if processing takes extra long time (more than an hour), and either processRecurringInvoices or any of the child jobs are still runnning, the processRecurringInvoices job is not created again? Kind of like a unique job, or mutual exclusion?

Instead of "processRecurringInvoices" it might be easier to think of it as a job that initiates other, separate invoice-processing jobs. Thinking of it this way, once the invoice processing jobs have been enqueued, you can safely call done() on the job that kicks them all off.
Thinking of the problem in the way described in question 1, errors should be handled within each of the individual invoice processing jobs. If an error occurred finding potential invoice jobs, then that would probably be handled in the processRecurringInvoices jobs.
you can use kue.Job.rangeByType() to search for currently active jobs. If a job is active, you can skip kicking it off again.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scheduled incremental crawl behavior on a paused crawl question - sharepoint

Quick question, I've started by mistake an incremental crawl on one of my content source. I've then pause it so it doesn't impact users. Will it resume on the next scheduled incremental crawl or I need to resume it manually?

Related

How to schedule millions of jobs in a node js properly?

Cron Job sometimes "missed | Too late for the schedule" sometimes running

Netsuite Map Reduce yielding

How to get job status of crawl tasks in nutch

How to process scheduled, recurring jobs with Kue?

Categories

Resources