How to speed up crawling in Nutch - nutch

I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments.
I have crawled by giving the depth 1 as I am no way concerned about the outlinks or inlinks in the webpage. I only need the contents of that webpages in the urls file.
But performing this crawl takes time. So, suggest me a way to decrease the crawl time and increase the speed of crawl. I also dont need indexing because I am not concerned about the search part.
Does anyone have suggestions on how to speed up the crawl?

The main thing for getting speed is configuring the nutch-site.xml
<property>
<name>fetcher.threads.per.queue</name>
<value>50</value>
<description></description>
</property>

You can scale up the threads in nutch-site.xml. Increasing fetcher.threads.per.host and fetcher.threads.fetch will both increase the speed at which you crawl. I have noticed drastic improvements. Use caution when increasing these though. If you do not have the hardware or connection to support this increased traffic, the amount of errors in crawling can signifigantly increase.

For me, this property helped me so much, because a slow domain can slow down all the fetch phase :
<property>
<name>generate.max.count</name>
<value>50</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
For example, if you respect the robots.txt (default behaviour) and a domain is too long to crawl, the delay will be : fetcher.max.crawl.delay. And a lot of this domain in a queue will slow down all the fetch phase, so it's better to limit the generate.max.count.
You can add this property for limit the time of the fetch phase in the same way :
<property>
<name>fetcher.throughput.threshold.pages</name>
<value>1</value>
<description>The threshold of minimum pages per second. If the fetcher downloads less
pages per second than the configured threshold, the fetcher stops, preventing slow queue's
from stalling the throughput. This threshold must be an integer. This can be useful when
fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
</description>
</property>
But please, dont touch to the fetcher.threads.per.queue property, you will finish in a black list... It's not a good solution to improve the crawl speed...

Hello I am also new for this crawling but I have used some methods I got some good results may it will you
I have changed my nutch-site.xml with these properties
<property>
<name>fetcher.server.delay</name>
<value>0.5</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>400</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
kindly suggest some more options
Thanks

I have similar issues and could improve the speed with the help of
https://wiki.apache.org/nutch/OptimizingCrawls
It has useful information with what can be slowing down your crawl and what you can do to improve each of those issues.
Unfortunately in my case I have the queues quite unbalanced and can't request too fast to the bigger one otherwise I get blocked so I probably need to go to cluster solution or TOR before i speed up the threads further.

If you don't need to follow links, I see no reason to use Nutch. You can simply take your list of urls and fetch those with an http client library or a simple script using curl.

Related

ArangoDB Key/Value Model: value maximum size

With regard to the Key/Value model of ArangoDB, does anyone know the maximum size per Value? I have spent hours searching the Internet for this information but to no avail; you would think that this is a classified information. Thanks in advance.
The answer depends on different things, like the storage engine and whether you mean theoretical or practical limit.
In case of MMFiles, the maximum document size is determined by the startup option wal.logfile-size if wal.allow-oversize-entries is turned off. If it's on, then there's no immediate limit.
In case of RocksDB, it might be limited by some of the server startup options such as rocksdb.intermediate-commit-size, rocksdb.write-buffer-size, rocksdb.total-write-buffer-size or rocksdb.max-transaction-size.
When using arangoimport to import a 1GB JSON document, you will run into the default batch-size limit. You can increase it, but appears to max out at 805306368 bytes (0.75GB). The HTTP API seems to have the same limitation (/_api/cursor with bindVars).
What you should keep in mind: mutating the document is potentially a slow operation because of the append-only nature of the storage layer. In other words, a new copy of the document with a new revision number is persisted and the old revision will be compacted away some time later (I'm not familiar with all the technical details, but I think this is fair to say). For a 500MB document is seems to take a few seconds to update or copy it using RocksDB on a rather strong system. It's much better to have many but small documents.

Cassandra High client read request latency compared to local read latency

We have a 20 nodes Cassandra cluster running a lot of read requests (~900k/sec at peak). Our dataset is fairly small, so everything is served directly from memory (OS Page Cache). Our datamodel is quite simple (just a key/value) and all of our reads are performed with consistency level one (RF 3).
We use the Java Datastax driver with TokenAwarePolicy, so all of our reads should go directly to one node that has the requested data.
These are some metrics extracted from one of the nodes regarding client read request latency and local read latency.
org_apache_cassandra_metrics_ClientRequest_50thPercentile{scope="Read",name="Latency",} 105.778
org_apache_cassandra_metrics_ClientRequest_95thPercentile{scope="Read",name="Latency",} 1131.752
org_apache_cassandra_metrics_ClientRequest_99thPercentile{scope="Read",name="Latency",} 3379.391
org_apache_cassandra_metrics_ClientRequest_999thPercentile{scope="Read",name="Latency",} 25109.16
org_apache_cassandra_metrics_Keyspace_50thPercentile{keyspace=“<keyspace>”,name="ReadLatency",} 61.214
org_apache_cassandra_metrics_Keyspace_95thPercentile{keyspace="<keyspace>",name="ReadLatency",} 126.934
org_apache_cassandra_metrics_Keyspace_99thPercentile{keyspace="<keyspace>",name="ReadLatency",} 182.785
org_apache_cassandra_metrics_Keyspace_999thPercentile{keyspace="<keyspace>",name="ReadLatency",} 454.826
org_apache_cassandra_metrics_Table_50thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 105.778
org_apache_cassandra_metrics_Table_95thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 1131.752
org_apache_cassandra_metrics_Table_99thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 3379.391
org_apache_cassandra_metrics_Table_999thPercentile{keyspace="<keyspace>",scope="<table>",name="CoordinatorReadLatency",} 25109.16
Another important detail is that most of our queries (~70%) don't return anything, i.e., they are for records not found. So, bloom filters play an important role here and they seem to be fine:
Bloom filter false positives: 27574
Bloom filter false ratio: 0.00000
Bloom filter space used:
Bloom filter off heap memory used: 6760992
As it can be seen, the reads in each one of the nodes are really fast, the 99.9% is less than 0.5 ms. However, the client request latency is way higher, going above 4ms on the 99%. If I'm reading with CL ONE and using TokenAwarePolicy, shouldn't both values be similar to each other, since no coordination is required? Am I missing something? Is there anything else I could check to see what's going on?
Thanks in advance.
#luciano
there are various reasons why the coordinator and the replica can report different 99th percentiles for read latencies, even with token awareness configured in the client.
these can be anything that manifests in between the coordinator code to the replica's storage engine code in the read path.
examples can be:
read repairs (not directly related to a particular request, as is asynchronous to the read the triggered it, but can cause issues),
host timeouts (and/or speculative retries),
token awareness failure (dynamic snitch simply not keeping up),
GC pauses,
look for metrics anomalies per host, overlaps with GC, and even try to capture traces for some of the slower requests and investigate if they're doing everything you expect from C* (eg token awareness).
well-tuned and spec'd clusters may also witness the dynamic snitch simply not being able to keep up and do its intended job. in such situations disabling the dynamic snitch can fix the high latencies for top-end read percentiles. see https://issues.apache.org/jira/browse/CASSANDRA-6908
be careful though, measure and confirm hypotheses, as mis-applied solutions easily have negative effects!
Even if using TokenAwarePolicy, the driver can't work with the policy when the driver doesn't know which partition key is.
If you are using simple statements, no routing information is provided. So you need additional information to the driver by calling setRoutingKey.
The DataStax Java Driver's manual is a good friend.
http://docs.datastax.com/en/developer/java-driver/3.1/manual/load_balancing/#requirements
If TokenAware is perfectly working, CoordinatorReadLatency value is mostly same value with ReadLatency. You should check it too.
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=coordinatorreadlatency
thanks for your reply and sorry about the delay in getting back to you.
One thing I’ve found out is that our clusters had:
dynamic_snitch_badness_threshold=0
in the config files. Changing that to the default value (0.1) helped a lot in terms of the client request latency.
The GC seems to be stable, even under high load. The pauses are constant (~10ms / sec) and I haven’t seen spikes (not even full gcs). We’re using CMS with a bigger Xmn (2.5GB).
Read repairs happen all the time (we have it set to 10% chance), so when the system is handling 800k rec/sec, we have ~80k read repairs/sec happening in background.
It also seems that we’re asking too much for the 20 machines cluster. From the client point of view, latency is quite stable until 800k qps, after that it starts to spike a little bit, but still under a manageable threshold.
Thanks for all the tips, the dynamic snitch thing was really helpful!

azure search lookup document count as query

If i look up document to bring data from azure search - does it affect the queries per second per index indicate in here
I want to know if i can use the azure search to host some data and access it without affecting the search performance.
thanks
Yes a lookup is considered a query. Please note that we do not throttle your queries and this number listed in the page you point to is only meant as a very rough indication of what a single search unit with an "average" index and an "average" set of queries could handle. In some cases (for example, if you were just doing lookups which are very simple queries), you might very likely get more than 15 QPS with a very good latency rate. In some cases (for example, if you have queries with a huge number of facets), you might get less. Please note, that although we do not throttle you, it is also possible that you could exceed the resources of the units allocated to you and will start to receive throttling http responses.
In general, the best thing to do is track the latency of your queries. If you start seeing the latency go higher then what you find acceptable, that is typically a good time to consider adding another replica.
Ultimately, the only way to know for sure is to test your specific index with the types of queries and load you expect.
I hope that helps.
Liam

Regarding maxIndexingThreads config in solrconfig.xml

I have a solr cluster with 8 server(4 shards with one replica for each). I
have 80 client threads indexing to this cluster. Client is running on a
different machine. I am trying to figure out optimal number of indexing
threads.
Now, solrconfig.xml have a config for maxIndexingThreads:
"The maximum number of simultaneous threads that may be indexing documents
at once in IndexWriter; if more than this many threads arrive they will wait
for others to finish. Default in Solr/Lucene is 8. "
I want to know whether this configuration is per solr instance or per
core(or collection).
Also is there a way to specify number of threads used by queries?

how to avoid crawling shared disk without bring it down?

I am using Nutch. I plan to crawl shared disk instead of internet website.
One thing I am worry is that crawling it will make that disk become really slow.
how to avoid crawling shared disk without bring it down?
You can set the number of threads and wait time between requests in conf/nutch-site.xml.
Try overrinding these properties and set them to a value that you feel comfortable with:
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>1</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time.
</description>
</property>

Resources