how to avoid crawling shared disk without bring it down? - nutch

I am using Nutch. I plan to crawl shared disk instead of internet website.
One thing I am worry is that crawling it will make that disk become really slow.
how to avoid crawling shared disk without bring it down?

You can set the number of threads and wait time between requests in conf/nutch-site.xml.
Try overrinding these properties and set them to a value that you feel comfortable with:
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>1</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time.
</description>
</property>

Related

Cassandra concurrent read and write

I am trying to understand the Cassandra concurrent read and writes. I come across the property called
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
So as per the definition, Correct me If am wrong, 4 threads can access the database concurrently. So let's say I am trying to run the following query,
SELECT max(column1) from 'testtable' WHERE duration = 'month';
I am just trying to execute this query, What will be the use of concurrent read in executing this query?
Thats how many active reads can run at a single time per host. This is viewable if you type nodetool tpstats under the read stage. If the active is at pegged at the number of concurrent readers and you have a pending queue it may be worth trying to increase this. Its pretty normal for people to have this at ~128 when using decent sized heaps and SSDs. This is very hardware dependent so defaults are conservative.
Keep in mind that the activity on this thread is very fast, usually measured in sub ms but assuming they take 1ms even with only 4, given little's law you have a maximum of 4000 (local) reads per second per node max (1000/1 * 4), with RF=3 and quorum consistency that means your doing a minimum of 2 reads per request so can divide in 2 to think of a theoretical (real life is ickier) max throughput.
The aggregation functions (ie max) are processed on the coordinator, after fetching the data of the replicas (each doing a local read and sending response) and are not directly impacted by the concurrent reads since handled in the native transport and request response stages.
From cassandra 2.2 onward, the standard aggregate functions min, max, avg, sum, count are built-in. So, I don't think concurrent_reads will have any effect on your query.

Regarding maxIndexingThreads config in solrconfig.xml

I have a solr cluster with 8 server(4 shards with one replica for each). I
have 80 client threads indexing to this cluster. Client is running on a
different machine. I am trying to figure out optimal number of indexing
threads.
Now, solrconfig.xml have a config for maxIndexingThreads:
"The maximum number of simultaneous threads that may be indexing documents
at once in IndexWriter; if more than this many threads arrive they will wait
for others to finish. Default in Solr/Lucene is 8. "
I want to know whether this configuration is per solr instance or per
core(or collection).
Also is there a way to specify number of threads used by queries?

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

What is the maximum number of keyspaces in Cassandra?

What is the maximum number of keyspaces allowed in a Cassandra cluster? The wiki page on limitations doesn't mention one. Is there such a limit?
A keyspace is basically just a Map entry to Cassandra... you can have as many as you have memory for. Millions, easily.
ColumnFamilies are more expensive, since Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
You should have a look to : https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html
We recommend a maximum of 200 tables total per cluster across all
keyspaces (regardless of the number of keyspaces). Each table uses 1MB
of memory to hold metadata about the tables so in your case where 1GB
is allocated to the heap, 500-600MB is used just for table metadata
with hardly any heap space left for other operations.
It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.
More importantly, applications take a long time to startup since the
drivers request the cluster metadata (including the schema) during the
initialisation/discovery phase. Retrieving the schema for 200 tables
is significantly less than it would take to load 500, 1000 or 3000.
This may not be important to you but there are lots of use cases where
short startup times are crucial, most notably for short-lived
serverless functions where execution time costs money and reducing
execution where possible results in thousands of dollars in savings.

How to speed up crawling in Nutch

I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments.
I have crawled by giving the depth 1 as I am no way concerned about the outlinks or inlinks in the webpage. I only need the contents of that webpages in the urls file.
But performing this crawl takes time. So, suggest me a way to decrease the crawl time and increase the speed of crawl. I also dont need indexing because I am not concerned about the search part.
Does anyone have suggestions on how to speed up the crawl?
The main thing for getting speed is configuring the nutch-site.xml
<property>
<name>fetcher.threads.per.queue</name>
<value>50</value>
<description></description>
</property>
You can scale up the threads in nutch-site.xml. Increasing fetcher.threads.per.host and fetcher.threads.fetch will both increase the speed at which you crawl. I have noticed drastic improvements. Use caution when increasing these though. If you do not have the hardware or connection to support this increased traffic, the amount of errors in crawling can signifigantly increase.
For me, this property helped me so much, because a slow domain can slow down all the fetch phase :
<property>
<name>generate.max.count</name>
<value>50</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
For example, if you respect the robots.txt (default behaviour) and a domain is too long to crawl, the delay will be : fetcher.max.crawl.delay. And a lot of this domain in a queue will slow down all the fetch phase, so it's better to limit the generate.max.count.
You can add this property for limit the time of the fetch phase in the same way :
<property>
<name>fetcher.throughput.threshold.pages</name>
<value>1</value>
<description>The threshold of minimum pages per second. If the fetcher downloads less
pages per second than the configured threshold, the fetcher stops, preventing slow queue's
from stalling the throughput. This threshold must be an integer. This can be useful when
fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
</description>
</property>
But please, dont touch to the fetcher.threads.per.queue property, you will finish in a black list... It's not a good solution to improve the crawl speed...
Hello I am also new for this crawling but I have used some methods I got some good results may it will you
I have changed my nutch-site.xml with these properties
<property>
<name>fetcher.server.delay</name>
<value>0.5</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>400</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
kindly suggest some more options
Thanks
I have similar issues and could improve the speed with the help of
https://wiki.apache.org/nutch/OptimizingCrawls
It has useful information with what can be slowing down your crawl and what you can do to improve each of those issues.
Unfortunately in my case I have the queues quite unbalanced and can't request too fast to the bigger one otherwise I get blocked so I probably need to go to cluster solution or TOR before i speed up the threads further.
If you don't need to follow links, I see no reason to use Nutch. You can simply take your list of urls and fetch those with an http client library or a simple script using curl.

Resources