Configuring RAM in Nutch

Configuring RAM in Nutch - nutch

I am using Nutch 1.10 to crawl websites for my organization. I use a system with 16Gb RAM to do this crawl. As of now, my nutch file uses only 3-4Gb of RAM while crawling the data and it takes almmost 10 hours to finish it. Is there some way where i can configure the nutch to use more than 12Gb of RAM to finish the same task ? All Suggestions are most welcome !

Under the assumption that the script bin/nutch or bin/crawl is used for crawling in local mode (no Hadoop cluster): the environment variable NUTCH_HEAPSIZE defines the heap size in MB.

Related

Ambari dashboard memory usage explanation for spark cluster

I am using Ambari to monitor my spark cluster, and I'm a little confused by all the memory categories; Can somebody with expertise explain what these terms mean? Thanks in advance!
Here is a screen shot of the Ambari Memory Usage zoom out:
Basically what do swap, Share, Cache and Buffer memory usage stand for? (I think I understand Total well)

There is nothing specific to Spark or Ambari here. These are basic Linux / Unix memory management terms:
In short:
Swap is a part of memory written to disk. See Wikipedia and What is swap memory?.
Buffer and cache are used for caching filesystem data and file data. See What is the difference between buffer vs cache memory in Linux? and Overview of memory management
Shared memory is a part of virtual memory used for shared libraries.

Buffer/cache exhaustion Spark standalone inside a Docker container

I have a very weird memory issue (which is what a lot of people will most
likely say ;-)) with Spark running in standalone mode inside a Docker
container. Our setup is as follows: We have a Docker container in which we have a Spring boot application that runs Spark in standalone mode. This Spring boot app also contains a few scheduled tasks (managed by Spring). These tasks trigger Spark jobs. The Spark jobs scrape a SQL database, shuffles the data a bit and then writes the results to a different SQL table (writing the results doesn't go through Spark). Our current data set is very small (the table contains a few million rows).
The problem is that the Docker host (a CentOS VM) that runs the Docker
container crashes after a while because the memory gets exhausted. I currently have limited the Spark memory usage to 512M (I have set both executor and driver memory) and in the Spark UI I can see that the largest job only takes about 10 MB of memory. I know that Spark runs best if it has 8GB of memory or more available. I have tried that as well but the results are the same.
After digging a bit further I noticed that Spark eats up all the buffer / cache memory on the machine. After clearing this manually by forcing Linux to drop caches (echo 2 > /proc/sys/vm/drop_caches) (clearing the dentries and inodes) the cache usage drops considerably but if I don't keep doing this regularly I see that the cache usage slowly keeps going up until all memory is used in buffer/cache.
Does anyone have an idea what I might be doing wrong / what is going on here?
Big thanks in advance for any help!

How to run nutch in production enviornment

I was experimenting some crawl cycles with nutch and would like to setup a distributed crawl environment. But I wonder how can I trigger nutch for incoming crawl requests in a production system. I read about nutch REST api. Is that the real option that I have ? Or can I run nutch as a continuously running distributed server by any other option ?
My preferred nutch version is nutch 1.12.

As sujen stated, there are two options for this :-
Use REST api if you want to submit crawl requests to nutch remotely.
Steps to get this running is described here :-
How to run nutch server on distributed environment
Otherwise you can run bin/crawl script from runtime/deploy to launch requests to nutch distributed using hadoop.

Buffer/Cache use 100% Memory

I have a Linux box installed centos 6.6 with 7 GB RAM running Apache on top of it, every night Buffer and cache consume 6 GB memory out of 7 GB but when i check it through top command no process use that much RAM but only Buffer/Cache does...please help.

Linux tries to make good use of all the free memory, so it is used to cache the system I/O (files transfered to/from memory) in order to reduce further disk access (in your case, serving faster the static content.)
It dynamically reduces the buffer/cache when the processes require more space. For example, changing the Apache configuration to use more modules or spawn more workers.

node.js out of memory when run as a cron job

I've written some JavaScript code which runs under node.js to backup a large (~20 MB) file to an Azure blob. This works when run from a bash shell, but fails with the following error when run as a cron job (run as root in both contexts):
FATAL ERROR: v8::Context::New() V8 is no longer usable
Presumably this means that it runs out of heap space, but where is the limit set for cron jobs?
(This is on a 64-bit RHEL 6.2 server, with 8GB RAM and 426 GB free disk space. Node.js is version 0.8.1 and Azure is from the file azure-2012-06.tar.gz.)
Thanks
Keith

When you say large do you really mean ~20MB as large file? How long does your cron job runs before return OOM? Also when you run cron job have you checked the memory usage to verify the actual usage to verify if that is the cause.
About your memory/heap limit in cron job, cron just act as job scheduling engine so node.js still uses its own memory/heap setting, this should not be specific to cron scheduler. If you want to change/modify V8 memory you can use --max-old-space-size option to boost it to higher value.
I am interested to see how do you run your node.js code as cron job. There is some possibility that the way you are writing cron job, you are consuming lots of memory while making call to Azure Storage.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string