Nutch Multithreading

Nutch Multithreading - nutch

Iam trying to configure nutch for running multi-threaded crawling.
However , Iam facing an issue. I am not able to run crawl with multiple threads , I have modified the nutch-site.xml to use 25 threads but still I can see only 1 Threads running.
<property>
<name>fetcher.threads.fetch</name>
<value>25</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
I always get the value of
activeThreads=25, spinWaiting=24, fetchQueues.totalSize=some value.
Whats the meaning of this, can you please explain whats the issue and how can I solve it.
I will highly appreciate your help.
Thanks,
Sumit

I think your issue is related to a known bug w/the new Nutch fetcher. See NUTCH-721.
You can try using OldFetcher (if you have Nutch 1.0) to see if that solves your problem.
-- Ken

Related

Where to check the pool stats of FAIR scheduler in Spark Web UI

I see my Spark application is using FAIR scheduler:
But I can't confirm whether it is using two pools I set up (pool1, pool2). Here is a thread function I implemented in PySpark which is called twice - one with "pool1" and the other with "pool2".
def do_job(f1, f2, id, pool_name, format="json"):
spark.sparkContext.setLocalProperty("spark.scheduler.pool", pool_name)
...
I thought the "Stages" menu is supposed to show the pool info but I don't see it. Does that mean the pools are not set up correctly or am I looking at the wrong place?
I am using PySpark 3.3.0 on top of EMR 6.9.0

You can confirm like this diagram.
pls refer my article I created 3 pools like module1 module2 module3 based on certin logic.
Each one is using specific pool.like above.. based on this I created below diagrams
Note : Please see the verification steps in the article I gave

infinispan 9 '<eviction strategy="LRU" />' isn't an allowed element

Wildfly 18 eviction tag is not parsing giving Failed to parse configuration error.
this is coming when i upgrade Wildfly 11 to 18. In wildfly 11 (infinispan 4) its working fine
<subsystem xmlns="urn:jboss:domain:infinispan:4.0">
<cache-container name="security" default-cache="auth-cache">
<local-cache name="auth-cache">
<locking acquire-timeout="${infinispan.cache-container.security.auth-cache.locking.acquire-timeout}"/>
<eviction strategy="LRU" max-entries="${infinispan.cache-container.security.auth-cache.eviction.max-entries}"/>
<expiration max-idle="-1"/>
</local-cache>
</cache-container>
</subsystem>
In Wildfly 18 having below section (NOT WORKING)
<subsystem xmlns="urn:jboss:domain:infinispan:9.0">
<cache-container name="security" default-cache="auth-cache">
<local-cache name="auth-cache">
<locking acquire-timeout="${infinispan.cache-container.security.auth-cache.locking.acquire-timeout}"/>
<eviction strategy="LRU" max-entries="${infinispan.cache-container.security.auth-cache.eviction.max-entries}"/>
<expiration max-idle="-1"/>
</local-cache>
</cache-container>
</subsystem>
Its giving ^^^^ 'eviction' isn't an allowed element here.infinispan:9.4 its says Eviction is configured by adding the but even that gives unrecognized tag memory.
how to add eviction strategy=LRU or replacement to strategy:"LRU"=?

According to the docs in infinispan 9.0 eviction is configured by adding the <memory/> element to your <*-cache/> configuration sections. Eviction is handled by Caffeine utilizing the TinyLFU algorithm with an additional admission window. This was chosen as provides high hit rate while also requiring low memory overhead. This provides a better hit ratio than LRU while also requiring less memory than LIRS.
In general there are two types:
COUNT (This type of eviction will remove entries based on how many there are in the cache. Once the count of entries has grown larger than the size then an entry will be removed to make room.
MEMORY - This type of eviction will estimate how much each entry will take up in memory and will remove an entry when the total size of all entries is larger than the configured size. This type only works with primitive wrapper, String and byte[] types, thus if custom types are desired you must enable storeAsBinary. Also MEMORY based eviction only works with LRU policy.
So I would think you define it like that:
<cache-container name="security" default-cache="auth-cache">
<local-cache name="auth-cache">
<...your other options...>
<object-memory/>
</local-cache>
</cache-container>
OR
<binary-memory eviction-type="MEMORY/COUNT"/>
OR
off-heap-memory eviction-type="MEMORY/COUNT"/>
AND you can always specify the size:
size="${infinispan.cache-container.security.auth-cache.eviction.max-entries}"
Storeage types:
object-memory (Stores entries as objects in the Java heap. This is the default storage type. Storage type supports COUNT only so you do not need to explicitly set the eviction type. Policy=TinyLFU)
binary-memory (Stores entries as bytes[] in the Java heap. Eviction Type: COUNT OR MEMORY. Policy=TinyLFU)
off-heap-memory (Stores entries as bytes[] in native memory outside the Java. Eviction Type: COUNT OR MEMORY. Policy=LRU)

Lonzak's reponse is correct.
Additionally, you can just use your "urn:jboss:domain:infinispan:4.0" configuration from WildFly 9 in WildFly 19. WildFly will automatically update the configuration to its equivalent in the current schema version.

How to get job status of crawl tasks in nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job.
Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ?
To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.

You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.

Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.

I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

What do the various numbers mean on the Hawtio profile page

Hawtio has a lovely view of a running Camel application but I can't find any information on what the numbers refer to on the profile page.
E.g. we see Mean, Min and Max.
What do these refer to? Processes per second? Throughput?
thanks

Yeah these numbers are from Apache Camel, from the MBeans that Camel offers. You can see some details here:
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/api/management/mbean/package-summary.html
And yeah min / max / mean are what they say, eg the lowest / highest and average processing time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nutch Multithreading - nutch

I think your issue is related to a known bug w/the new Nutch fetcher. See NUTCH-721. You can try using OldFetcher (if you have Nutch 1.0) to see if that solves your problem. -- Ken

Related

Where to check the pool stats of FAIR scheduler in Spark Web UI

infinispan 9 '<eviction strategy="LRU" />' isn't an allowed element

How to get job status of crawl tasks in nutch

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

What do the various numbers mean on the Hawtio profile page

Categories

Resources