Multithreading crawler get slower and slower after running for some time - python-3.x

I wrote a multithreaded web crawler under Windows. The libraries that I used were requests and threading. I found the program became slower and slower after running for some time (about 500 pages). When I stop the program and run again, the program speeds up again. It seems that there are many pending connections, causing the slowdown. How should I manage the problem?
My code:
import requests, threading,queue
req = requests.Session()
urlQueue = queue.Queue()
pageList = []
urlList = [url1,url2,....url500]
[urlQueue.put(i) for i in urlList]
def parse(urlQueue):
try:
url = urlQueue.get_nowait()
except:
break
try:
page = req.get(url)
pageList.append(page)
except:
continue
if __name__ == '__main__':
threadNum = 4
threadList = []
for i in threadNum:
t = threading.Thread(target=(parse),args=(urlQueue,))
threadList.append(t)
for thread in threadList:
thread.start()
for thread in threadList:
thread.join()
I searched for the problem. An answer told that it was the reuse and recycling problem of TCP under Linux. I don't understand that answer very well. The answer is below. I translated the answer from the Chinese.
Type command in Linux shell: netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
Found the TIME_WAIT is nearly 2W. So, there must be many TCP connections.
Use the following code to set the reuse time and recycling time, respectively of TCP:
echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse, echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
That answer seems correct. It should be a network problem. How should I solve this under Windows.

The multithreaded crawler will exhaust the TCP connections. We need to set the TcpTimedWaitDelay to quickly reuse and recycle the TCP connections. We can solve the problem by manually changing the regedit or typing the code.
How to do it on Windows with code:
(You need to run the code as an administrator, or otherwise, an error would be raised.)
import win32api,win32con
key = win32api.RegOpenKey(win32con.HKEY_LOCAL_MACHINE, r'SYSTEM\CurrentControlSet\Services\Tcpip\Parameters', 0, win32con.KEY_SET_VALUE)
win32api.RegSetValueEx(key, 'TcpTimedWaitDelay', 0, win32con.REG_SZ, '30')
win32api.RegCloseKey(key)
How to do it on Windows manually:
Open RUN, and type regedit
Find: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters
Click Edit - New - Expandable String Value
Create TcpTimedWaitDelay (if this entry already existed, you do not need to create)
Change the value to 30. (The TCP value ranges from 30 to 300 seconds, and the default is 120 seconds. The default value is too
long for multithreading crawler.)
Thank you for all of your guys' contribute to the questions. This helps a lot of people.
Reference site

Related

How can you ensure a viable endpoint for a stanza CoreNLPClient?

I would like to use the stanza CoreNLPClient to extract noun phrases, similar to this method.
However, I cannot seem to find a good port to start the server on. The default is 9000, but this is often occupied, as indicated by the error message:
PermanentlyFailedException: Error: unable to start the CoreNLP server
on port 9000 (possibly something is already running there)
EDIT: Port 9000 is in use by python.exe, which is why I can't just shut the process down to make space for the CoreNLPClient.
Then, when I select other ports such as 7999, 8000, or 8080, the server keeps listening indefinetely, not executing the consecutive code lines, showing only the following:
2021-07-19 12:05:55 INFO: Starting server with command: java -Xmx8G -cp C:\Users\timjo\stanza_corenlp* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 7998 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-2e15724b8064491b.props -preload -outputFormat serialized
I have the latest version of stanza installed, and am running the following code from an .ipynb file in VS Code:
# sample sentence
sentence = "Albert Einstein was a German-born theoretical physicist."
# start the client as indicated in the docs
with CoreNLPClient(properties='corenlp_server-2e15724b8064491b.props', endpoint='https://localhost:7998', memory='8G', be_quiet=True) as client:
matches = client.tregex(text=sentence, pattern = 'NP')
# extract the noun phrases and their indices
noun_phrases = [[text, begin, end] for text, begin, end in
zip([sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence],
[sentence[match_id]['characterOffsetBegin'] for sentence in matches['sentences'] for match_id in sentence],
[sentence[match_id]['characterOffsetEnd'] for sentence in matches['sentences'] for match_id in sentence])]
Main question: How can I ensure that the server starts on an open port, and closes afterwards? I would prefer having a semi-automatic way to finding open / shutting down occupied ports for the client to run on.
In general it is sufficient to choose another number that nothing else is using – maybe 9017? There are lots of numbers to choose from! But the more careful choice would be to create the CoreNLPClient in a while loop with a try/catch and to increment the port number till you found one that was open.
After 2 hours of working on this, I now know the following:
Taking port 9000 is not an option, given that it is used by python. Informal evidence points towards this having to do something with using a jupyter notebook as opposed to a 'regular' python .py file.
Regarding the Client not closing when using other endpoints: I should've simply used http://localhost:port' instead of https://....
Hopefully this can help someone else struggling with this problem. I guess this was my non-computer science background seeping through.
(edited to resolve typos)

Detecting when a child process is waiting for stdin

I am making a terminal program that is able to run any executable (please ignore safety concerns). I need to detect when the child process is waiting for the user input (from stdin). I start the child process using:
process = subprocess.Popen(command, close_fds=False, shell=True, **file_descriptors)
I can think of 2 ways of detecting if the child process is waiting for stdin:
Writing a character then backspace and checking if the child has processed those 2 bytes. But here it says that "CMD does support the backspace key". So I need to find a character that when printed to the screen will delete what ever is in the stdin buffer in the command prompt.
The second method is to use the pywin32 library and use the WaitForInputIdle function as described here. I looked at the source code for the subprocess library and found that it uses pywin32 and it keeps a reference to the process handle. So I tried this:
win32event.WaitForInputIdle(proc._handle, 100)
But I got this error:
(1471, 'WaitForInputIdle', 'Unable to finish the requested operation because the specified process is not a GUI process.')
Also in the windows api documentation here it says: "WaitForInputIdle waits only once for a process to become idle; subsequent WaitForInputIdle calls return immediately, whether the process is idle or busy.". I think that means that I can't use the function for its purpose more than once which wouldn't solve my problem
Edit:
This only needs to work on Windows but later I might try to make my program computable with Linux as well. Also I am using pipes for the stdin/stdout/stderr.
Why I need to know if the child is waiting for stdin:
Currently, when the user presses the enter key, I send all of the data, that they have written so far, to stdin and disable the user from changing it. The problem is when the child process is sleeping/calculating and the user writes some input and wants to change it before the process starts reading from stdin again.
Basically lets take this program:
sleep(10)
input("Enter value:")
and lets say that I enter in "abc\n". When using cmd it will allow me to press backspace and delete the input if the child is still sleeping. Currently my program will mark all of the text as read only when it detects the "\n" and send it to stdin.
class STDINHandle:
def __init__(self, read_handle, write_handle):
self.handled_write = False
self.working = Lock()
self.write_handle = write_handle
self.read_handle = read_handle
def check_child_reading(self):
with self.working:
# Reset the flag
self.handled_write = True
# Write a character that cmd will ignore
self.write_handle.write("\r")
thread = Thread(target=self.try_read)
thread.start()
sleep(0.1)
# We need to stop the other thread by giving it data to read
if self.handled_write:
# Writing only 1 "\r" fails for some reason.
# For good measure we write 10 "\r"s
self.write_handle.write("\r"*10)
return True
return False
def try_read(self):
data = self.read_handle.read(1)
self.handled_write = False
def write(self, text):
self.write_handle.write(text)
I did a bit of testing and I think cmd ignores "\r" characters. I couldn't find a case where cmd will interpret it as an actual character (like what happened when I did "\b"). Sending a "\r" character and testing if it stays in the pipe. If it does stay in the pipe that means that the child hasn't processed it. If we can't read it from the pipe that means that the child has processed it. But we have a problem - we need to stop the read if we can't read from stdin otherwise it will mess with the next write to stdin. To do that we write more "\r"s to the pipe.
Note: I might have to change the timing on the sleep(0.1) line.
I am not sure this is a good solution but you can give it a try if interested. I just assumed that we execute the child process for its output given 2 inputs data and TIMEOUT.
process = subprocess.Popen(command, close_fds=False, shell=True, **file_descriptors)
try:
output, _ = process.communicate(data, TIMEOUT)
except subprocess.TimeoutExpired:
print("Timeout expires while waiting for a child process.")
# Do whatever you want here
return None
cmd_output = output.decode()
You can find more examples for TimeoutExpired here.

Write pcap file about TCP traffic of a web-crawler

url request and sniff(count=x) don't work together. sniff(count) is waiting for x packets to sniff, and though I have to put the line before the url-request it blocks the program, the url-request never starts and it never sniffs any packet.
When I opened 2 Windows in ubuntu command line, it worked. In the first window I activated the interactive mode of python and activated the sniffer. After doing that, I started the web-crawler int the second window and the sniffer in the 1st window received the packets correctly and put it on the screen / into a pcap-file.
Now the easiest way would be to write 2 scripts and start them from 2 different Windows, but I want to do the complete work in one script: Webcrawling, sniffing the packets and putting them into a pcap-file
Here is the code that does not work:
class spider():
…
def parse():
a = sniff(filter="icmp and host 128.65.210.181", count=1)
req = urllib.request.urlopen(self.next_url.replace(" ",""))
a.nsummary()
charset = req.info().get_content_charset()
Now the first line blocks the program, waiting 4 the packet to come in, what cannot do so because only in the next line the request is done. Swapping the lines also doesn't work. I think that the only way to resolve the problem is to work with paralessisms, so I've also tried this:
class protocoller():
...
def run(self):
self.pkt = sniff(count=5) # and here it blocks
…
prot = protocoller()
Main.thr = threading.Thread(target=prot.run())
Main.thr.start()
I Always thought that the thread is running indipendently from the main program, but it blocks it as if it would be part of it. Any suggestions?
So what I would need is a solution in which the web-crawler and the IP/TCP protocoller based on scapy are running independently from each other.
Could the sr()-function of scapy be an alternative?
https://scapy.readthedocs.io/en/latest/usage.html
Is it possible to put the request manually in the packet and to put the received packet into the pcap-file?
Your example doesn't show what's going on in other threads so I assume you've got a second thread to do the request etc. If all that is in order the obvious error is here:
Main.thr = threading.Thread(target=prot.run())
This executes the function prot.run and passes the result to the target parameter of Thread. It should be:
Main.thr = threading.Thread(target=prot.run)
This passes the function itself into Thread
The other answer works great.
FYI, Scapy 2.4.3 also has a native way of doing this:
https://scapy.readthedocs.io/en/latest/usage.html#asynchronous-sniffing

Is there anyway to stop implicity wait during try/except?

I have a selenium script that automates signing up on a website. During the process, I have driver.implicity_wait(60) BUT there is a segment of code where I have a try/except statement where it tries to click something but if it can't be found, it continues. The issue is that if the element isn't there to be clicked, it waits 60 seconds before doing the except part of code. Is there anyway I can have it not wait the 60 seconds before doing the except part? Here is my code:
if PROXYSTATUS==False:
driver.find_element_by_css_selector("img[title='中国大陆']").click()
else:
try:
driver.find_element_by_css_selector("img[title='中国大陆']").click()
except:
pass
In other words if a proxy is used, a pop up will occasionally display, but sometimes it won't. That's why I need the try/except.
You can use set_page_load_timeout to change the default timeout to a lower value that suits you.
You will still need to wait for some amount of time, otherwise you might simply never click on the element you are looking for, because your script will be faster than the page load.
In the try block u can lower the timeout say 10 by using driver.implicity_wait(10) or even to 0. Place this before the find element statement in the try block. Add a finally block and set this back to 60 driver.implicity_wait(60).

X3270 Connection and Programming

I'm looking at using a X3270 terminal emulator. I have http://x3270.bgp.nu/ looked over this source material and still don't see how to start using the tool or configure it.
I'm wonder how I can open a terminal and connect. Another question is how could I integrate this into a python program?
edit:
here is a snippet:
em = Emulator()
em.connect(ip)
em.send_string('*user name*')
em.exec_command('Tab')
em.send_string('*user password*')
em.send_enter()
em.send_enter()
em.wait_for_field()
em.save_screen("{0}screenshot".format(*path*))
looking at the save screen i see that the cursor hasn't moved? I can move the cursor using
em.move_to(7,53)
but after that i don't get any text sent through. Any Ideas?
Here's what I do; it works 100% of the time:
from py3270 import *
import sys, os
host = "%s" % sys.argv[1].upper()
try:
e = Emulator()
e.connect(host)
e.wait_for_field()
except WaitError:
print "py3270.connect(%s) failed" % (host)
sys.exit(1)
print "--- connection made to %s ---" % (host)`
If you haven't got a network connection to your host, that wait_for_field() call is going to wait for a full 120 seconds. No matter what I do, I don't seem to be able to affect the length of that timeout.
But your user doesn't have to wait that long, just have him kill your script with a KeyboardInterrupt. Hopefully, your user will grow accustomed to success equaling the display of that "--- connection made ..." message so he'll know he's in trouble when/if the host doesn't respond.
And that's a point I need to make: you don't connect to a terminal (as you described), rather you connect to a host. That host can be either a VTAM connection or some kind of LPAR, usually TSO or z/VM, sometimes CICS or IMS, that VTAM will take you to. Each kind of host has differing prompts & screen content you might need to test for, and sometimes those contents are different depending on whose system you're trying to connect to. Your script becomes the "terminal", depending on what you want to show your user.
What you need to do next depends on what kind of system you're trying to talk to. Through VTAM? (Need to select a VTAM application first?) To z/VM? TSO? Are you logging on or DIALing? What's the next keystroke/field you have to use when you're working with a graphic x3270/c3270 terminal? You need to know that in order to choose your next command.
Good luck!
Please read my comment above first - it would be helpful to have more detail as to what you need to do.
After considering that…have you looked at the py3270 package at https://pypi.python.org/pypi/py3270/0.1.5 ? The summary says it talks to x3270.

Resources