NUTCH does not crawl a particular website

NUTCH does not crawl a particular website - nutch

I'm using Apache NUTCH version 2.2.1 to crawl some websites. Everything works fine except one website which is the http://eur-lex.europa.eu/homepage.html website.
I tried with the Apache NUTCH version 1.8, I have the same behaviour, nothing is fetched.
It fetches and parses the entry page but after that it is as if it can not extract its links.
I see always the following:
------------------------------
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread3, activeThreads=2
-finishing thread FetcherThread2, activeThreads=1
0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 1 queues
-finishing thread FetcherThread0, activeThreads=0
-----------------
Any idea?

This might be because the site's robots.txt file restricts your crawler's access to the site.
By default nutch checks the robots.txt file, which is located in http://yourhostname.com/robots.txt, and if it's not allowed to crawl that site it will not fetch any page.

Related

What happens to gunicorn pending requests?

I have a python application that was deployed using gunicorn. The system configuration is 8 core CPU and 64 GB RAM. I have worker-thread combination of 2:3 and sent 500 requests at once. I need to understand how gunicorn manages these requests as after the completion of the process, out of 500 only 371 requests were successfully completed while other requests are lost as if I have never sent them. I could not even find those requests in logs as well.

Reasons for Cloudfront ClientCommError and ways to get around it

I am serving few website assets from Cloudfront (backed by S3) and periodically seeing errors like this
2022-02-09 21:20:48 LAX3-C4 0 208.48.9.194 GET my_distribution.cloudfront.net /my/assets/3636.23f5cbf8445b5e7edb3f.js 000 https://my.site.com/ Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64;%20rv:96.0)%20Gecko/20100101%20Firefox/96.0 - - Error 7z652evl8PjlvQ65TxEtHHK3qoTU7Tf9F6CW3yHGYxRUYFGxjTlKAw== my_distribution.cloudfront.net https 61 0.003 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error HTTP/2.0 - - 62988 0.000 ClientCommError - - - -
Cloudfront's explanation of ClientCommError: The response to the viewer was interrupted due to a communication problem between the server and the viewer
I have already introduced retries to try and load the resource 3 times before giving up , but it doesn't help for the most part. Also, looking at the location from which resources are requested they are often close by (meaning not from overseas and even on the same coast in US), and my files are pretty small , so the issue can't be the size of a file (ex: 475 B)
What are ways to mitigate such load errors and ensure all resources can be downloaded.

I wasted two hours on the same thing... Turns out I naively used curl to test it and as curl (sensibly) refused to output binary data to my console nothing was actually pulled from s3 to cloudfront. Once I added --output to curl I started getting hits from Cloudfront.

How mod_jk handling node failure

We have configured mod_jk with two tomcat servers with 2 apache web servers. We wanted to know how mod_jk handling node failure or how it will do a health check.?

You can use a watchdog by setting the JkWatchdogInterval directive. From the documentation:
This directive configures the watchdog thread interval in seconds. The workers are maintained periodically by a background thread running periodically every watchdog_interval seconds. Worker maintenance checks for idle connections, corrects load status and is able to detect backend health status.
The maintenance only happens, if since the last maintenance at least worker.maintain seconds have passed. So setting the JkWatchdogInterval much smaller than worker.maintain is not useful.
The default value is 0 seconds, meaning the watchdog thread will not be created, and the maintenance is done in combination with normal requests instead.

In Advanced Worker Directives, use "redirect", this will Set to the name of the preferred failover worker. Eg : worker.server-four.redirect=server-two
If worker matching SESSION ID is in error state then the redirect worker will be used instead.
This feature has been added in jk 1.2.9.
Status of mod_jk
Please add the below tag in mod_jk.conf file
JkMount status
Order deny,allow
Deny from all
Allow from 127.0.0.1
Then you can find the status of mod_jk by using below URL
http://webserverIP:port(from httpd.conf)/status

Apache 2.4.6 on Centos 7 - Process concurrent POST requests in parallel

I have httpd server running on Centos7. The apache details - Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16
I've configured the below for multi threaded behavior
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 300
MaxClients 300
MaxRequestsPerChild 0
</IfModule>
From the front end, I can see that, once the user clicks "submit" button, 30 ajax POST calls are made in parallel & are waiting for response. Each of these ajax calls are meant to run a SQL on the DB. The SQL query execution time on the DB once the query hits the DB is very less.
Though there are 30 AJAX requests fired in parallel, the Apache server doesn't seem to fire them all off to the DB at the same time for some reason.
Is there any configuration that needs to be enabled on apache to do this?
I've tried to explain what I've observed. Please let me know if my question could be worded better please. I'll do my best.
I look forward to some guidance here please.

Apache server doesn't seem to fire them all off to the DB
Apache does not do that. It just handles the request and gives it to resource that is requested by client. It is your script/program that is requested that connect to DB.
The SQL query execution time on the DB once the query hits the DB is very less.
If your 30 AJAX queries running in parallel almost at the same time and they are all same, I believe DB server will cache its result and just returns cached result.

Tomcat Web Application threads stuck in Service Stage - causing app hang-ups

We are using Tomcat 6 / IIS to host our Java MVC web applications (Spring MVC and Frontman). We started running into problems recently when we see threads stuck in the Service stage for hours.
Using Lambda Probe we see the threads start to pile up and eventually the app becomes unresponsive. The processing time increases, zero bytes in or out. The url is reachable and the logs show that it starts but never finishes.
IP Stage processing time bytes-in bytes-out url
111.11.111.111 Service 00:57:26.0 0b 0b GET /Application/command/monitor
All of this is on a test server set up as follows:
ISAPI filter worker:
worker.testuser.type=ajp13
worker.testuser.host=localhost
worker.testuser.port=8009
worker.testuser.socket_timeout=300
worker.testuser.connection_pool_timeout=600
Server.xml:
<
Connector
port="8009"
protocol="AJP/1.3"
redirectPort="8443"
tomcatAuthentication="false"
connectionTimeout="6000"
/>
Any thoughts on why this happens or how to configure Tomcat to kill ancient application threads?

Can use java monitoring package to get the all the thread and thread dumps and kill using the thread id (though thread stop is deprecated it does the work)
http://docs.oracle.com/javase/1.5.0/docs/guide/management/overview.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

NUTCH does not crawl a particular website - nutch

This might be because the site's robots.txt file restricts your crawler's access to the site. By default nutch checks the robots.txt file, which is located in http://yourhostname.com/robots.txt, and if it's not allowed to crawl that site it will not fetch any page.

Related

What happens to gunicorn pending requests?

Reasons for Cloudfront ClientCommError and ways to get around it

How mod_jk handling node failure

Apache 2.4.6 on Centos 7 - Process concurrent POST requests in parallel

Tomcat Web Application threads stuck in Service Stage - causing app hang-ups

Categories

Resources