Python Threading Issue, Is this Right? - multithreading

I am attempting to make a few thousand dns queries. I have written my script to use python-adns. I have attempted to add threading and queue's to ensure the script runs optimally and efficiently.
However, I can only achieve mediocre results. The responses are choppy/intermittent. They start and stop, and most times pause for 10 to 20 seconds.
tlock = threading.Lock()#printing to screen
def async_dns(i):
s = adns.init()
for i in names:
tlock.acquire()
q.put(s.synchronous(i, adns.rr.NS)[0])
response = q.get()
q.task_done()
if response == 0:
dot_net.append("Y")
print(i + ", is Y")
elif response == 300:
dot_net.append("N")
print(i + ", is N")
tlock.release()
q = queue.Queue()
threads = []
for i in range(100):
t = threading.Thread(target=async_dns, args=(i,))
threads.append(t)
t.start()
print(threads)
I have spent countless hours on this. I would appreciate some input from expedienced pythonista's . Is this a networking issue ? Can this bottleneck/intermittent responses be solved by switching servers ?
Thanks.

Without answers to the questions, I asked in comments above, I'm not sure how well I can diagnose the issue you're seeing, but here are some thoughts:
It looks like each thread is processing all names instead of just a portion of them.
Your Queue seems to be doing nothing at all.
Your lock seems to guarantee that you actually only do one query at a time (defeating the purpose of having multiple threads).
Rather than trying to fix up this code, might I suggest using multiprocessing.pool.ThreadPool instead? Below is a full working example. (You could use adns instead of socket if you want... I just couldn't easily get it installed and so stuck with the built-in socket.)
In my testing, I also sometimes see pauses; my assumption is that I'm getting throttled somewhere.
import itertools
from multiprocessing.pool import ThreadPool
import socket
import string
def is_available(host):
print('Testing {}'.format(host))
try:
socket.gethostbyname(host)
return False
except socket.gaierror:
return True
# Test the first 1000 three-letter .com hosts
hosts = [''.join(tla) + '.com' for tla in itertools.permutations(string.ascii_lowercase, 3)][:1000]
with ThreadPool(100) as p:
results = p.map(is_available, hosts)
for host, available in zip(hosts, results):
print('{} is {}'.format(host, 'available' if available else 'not available'))

Related

Multiple HTTP request to the same page without consuming much CPU

Currently, I'm trying to improve a code that sends multiple HTTP requests to a webpage until it can capture some text (which the code localizes through a known pattern) or until 180 seconds runs out (the time we wait for the page to give us an answer).
This is the part of the code (a little edited for privacy purposes):
if matches == None:
txt = "No answer til now"
print(txt)
Solution = False
start = time.time()
interval = 0
while interval < 180:
response = requests.get("page address")
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches != None:
Solution =matches.group(1)
time = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer "+ Solution + "time needed : "+ time
print(txt)
break
interval = time.time()-start
else:
Solution = matches.group(1)
It runs OK, but I was told that doing "infinite requests in a loop" could cause an hight CPU usage of the server. Do you guys know of something I can use in order to avoid that?
Ps: I heard that in PHP people use curl_multi_select() for things like these. Don't know if I'm correct though.
Usually an HTTP REST API will specify in the documentation how many requests you can make in a given time period against which endpoint resources.
For a website, if you are not hitting a request limit and getting flagged/banned for too many requests, then you should be okay to continuously loop like this, but you may want to introduce a time.sleep call into your while loop.
An alternative to the 180 second timeout:
Since HTTP requests are I/O operations and can take a variable amount of time, you may want to change your exit case for the loop to a certain amount of requests (like 25 or something) and then incorporate the aforementioned sleep call.
That could look like:
# ...
if matches is None:
solution = None
num_requests = 25
start = time.time()
while num_requests:
response = requests.get("page address")
if response.ok: # It's good to attempt to handle potential HTTP/Connectivity errors
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches:
solution = matches.group(1)
elapsed = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer " + solution + "time needed : " + elapsed
print(txt)
break
else:
# Maybe raise an error here?
pass
time.sleep(2)
num_requests -= 1
else:
solution = matches.group(1)
Notes:
Regarding PHP's curl_multi_select - (NOT a PHP expert here...) it seems that this function is designed to allow you to watch multiple connections to different URLs in an asynchronous manner. Async doesn't really apply to your use case here because you are only scraping one webpage (URL), and are just waiting for some data to appear there.
If the response.text you are searching through is HTML and you aren't already using it somewhere else in your code, I would recommend Beautiful Soup or scrapy to (before regex) for searching for string patterns in webpage markup.

Memory efficient massive http requests

I need to do an unlimited HTTP requests from a web API one after another and make it work efficiently and quite fast. (I need it for a utility so it should work no matter how many time im using it, also it should be able to be used on a web server(people use at the same time))
right now I'm using a threading with a queue but after a while of doing it I'm getting errors like:
'cant start a new thread'
'MemoryError'
or it may work a bit, but pretty slow.
this is a part of my code:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
q.join()
*get_urls() is a simple function that returns a list of urls(unknown length)
this is my recieveJson(thread target):
def receiveJson():
while True:
url = q.get()
res = request.get(url).json()
q.task_done()
The problem is coming from your Threads never ending, notice that there is no exit condition in your receiveJson function. The simplest way to signal it should end is usually by enqueuing None:
def receiveJson():
while True:
url = q.get()
if url is None: # Exit condition allows thread to complete
q.task_done()
break
res = request.get(url).json()
q.task_done()
and then you can change the other code as follows:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
for i in range(concurrent):
q.put(None) # Add a None for each thread to be able to get and complete
q.join()
There are other ways of doing this, but this is the how to do it with the least amount of change to your code. If this is happening often, it might be worth looking into the concurrent.futures.ThreadPoolExecutor class to avoid the cost of opening threads very often.

Proper way to start a Trio server that manages multiple TCP Connections

I recently finished a project using a mix of Django and Twisted and realized it's overkill for what I need which is basically just a way for my servers to communicate via TCP sockets. I turned to Trio and so far I'm liking what I see as it's way more direct (for what I need). That said though, I just wanted to be sure I was doing this the right way.
I followed the tutorial which taught the basics but I need a server that could handle multiple clients at once. To this end, I came up with the following code
import trio
from itertools import count
PORT = 12345
BUFSIZE = 16384
CONNECTION_COUNTER = count()
class ServerProtocol:
def __init__(self, server_stream):
self.ident = next(CONNECTION_COUNTER)
self.stream = server_stream
async def listen(self):
while True:
data = await self.stream.receive_some(BUFSIZE)
if data:
print('{} Received\t {}'.format(self.ident, data))
# Process data here
class Server:
def __init__(self):
self.protocols = []
async def receive_connection(self, server_stream):
sp: ServerProtocol = ServerProtocol(server_stream)
self.protocols.append(sp)
await sp.listen()
async def main():
await trio.serve_tcp(Server().receive_connection, PORT)
trio.run(main)
My issue here seems to be that each ServerProtocol runs listen on every cycle instead of waiting for data to be available to be received.
I get the feeling I'm using Trio wrong in which case, is there a Trio best practices that I'm missing?
Your overall structure looks fine to me. The issue that jumps out at me is:
while True:
data = await self.stream.receive_some(BUFSIZE)
if data:
print('{} Received\t {}'.format(self.ident, data))
# Process data here
The guarantee that receive_some makes is: if the other side has closed the connection already, then it immediately returns an empty byte-string. Otherwise, it waits until there is some data to return, and then returns it as a non-empty byte-string.
So your code should work fine... until the other end closes the connection. Then it starts doing an infinite loop, where it keeps checking for data, getting an empty byte-string back (data = b""), so the if data: ... block doesn't run, and it immediately loops around to do it again.
One way to fix this would be (last 3 lines are new):
while True:
data = await self.stream.receive_some(BUFSIZE)
if data:
print('{} Received\t {}'.format(self.ident, data))
# Process data here
else:
# Other side has gone away
break

How to set timeout for a block of code which is not a function python3

After spending a lot of hours looking for a solution in stackoverflow, I did not find a good solution to set a timeout for a block of code. There are approximations to set a timeout for a function. Nevertheless, I would like to know how to set a timeout without having a function. Let's take the following code as an example:
print("Doing different things")
for i in range(0,10)
# Doing some heavy stuff
print("Done. Continue with the following code")
So, How would you break the for loop if it has not finished after x seconds? Just continue with the code (maybe saving some bool variables to know that timeout was reached), despite the fact that the for loop did not finish properly.
i think implement this efficiently without using functions not possible , look this code ..
import datetime as dt
print("Doing different things")
# store
time_out_after = dt.timedelta(seconds=60)
start_time = dt.datetime.now()
for i in range(10):
if dt.datetime.now() > time_started + time_out:
break
else:
# Doing some heavy stuff
print("Done. Continue with the following code")
the problem : the timeout will checked in the beginning of every loop cycle, so it may be take more than the specified timeout period to break of the loop, or in worst case it maybe not interrupt the loop ever becouse it can't interrupt the code that never finish un iteration.
update :
as op replayed, that he want more efficient way, this is a proper way to do it, but using functions.
import asyncio
async def test_func():
print('doing thing here , it will take long time')
await asyncio.sleep(3600) # this will emulate heaven task with actual Sleep for one hour
return 'yay!' # this will not executed as the timeout will occur early
async def main():
# Wait for at most 1 second
try:
result = await asyncio.wait_for(test_func(), timeout=1.0) # call your function with specific timeout
# do something with the result
except asyncio.TimeoutError:
# when time out happen program will break from the test function and execute code here
print('timeout!')
print('lets continue to do other things')
asyncio.run(main())
Expected output:
doing thing here , it will take long time
timeout!
lets continue to do other things
note:
now timeout will happen after exactly the time you specify. in this example code, after one second.
you would replace this line:
await asyncio.sleep(3600)
with your actual task code.
try it and let me know what do you think. thank you.
read asyncio docs:
link
update 24/2/2019
as op noted that asyncio.run introduced in python 3.7 and asked for altrnative on python 3.6
asyncio.run alternative for python older than 3.7:
replace
asyncio.run(main())
with this code for older version (i think 3.4 to 3.6)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
You may try the following way:
import time
start = time.time()
for val in range(10):
# some heavy stuff
time.sleep(.5)
if time.time() - start > 3: # 3 is timeout in seconds
print('loop stopped at', val)
break # stop the loop, or sys.exit() to stop the script
else:
print('successfully completed')
I guess it is kinda viable approach. Actual timeout is greater than 3 seconds and depends on the single step execution time.

How can i use multithreading (or multiproccessing?) for faster data upload?

I have a list of issues (jira issues):
listOfKeys = [id1,id2,id3,id4,id5...id30000]
I want to get worklogs of this issues, for this I used jira-python library and this code:
listOfWorklogs=pd.DataFrame() (I used pandas (pd) lib)
lst={} #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
for j in range(len(worklogs)):
lst = {
'self': worklogs[j].self,
'author': worklogs[j].author,
'started': worklogs[j].started,
'created': worklogs[j].created,
'updated': worklogs[j].updated,
'timespent': worklogs[j].timeSpentSeconds
}
listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################
so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link:
https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link
The problem is that there are more than 30,000 such issues.
and the loop is sooo slow (approximately 3 sec for 1 issue)
Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?
I recycled a piece of code I made into your code, I hope it helps:
from multiprocessing import Manager, Process, cpu_count
def insert_into_list(worklog, queue):
lst = {
'self': worklog.self,
'author': worklog.author,
'started': worklog.started,
'created': worklog.created,
'updated': worklog.updated,
'timespent': worklog.timeSpentSeconds
}
queue.put(lst)
return
# Number of cpus in the pc
num_cpus = cpu_count()
index = 0
# Manager and queue to hold the results
manager = Manager()
# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()
listOfWorklogs=pd.DataFrame()
lst={}
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
# This loop replaces your "for j in range(len(worklogs))" loop
while index < len(worklogs):
processes = []
elements = min(num_cpus, len(worklogs) - index)
# Create a process for each cpu
for i in range(elements):
process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
processes.append(process)
# Run the processes
for i in range(elements):
processes[i].start()
# Wait for them to finish
for i in range(elements):
processes[i].join(timeout=10)
index += num_cpus
# Dump the queue into the dataframe
while queue.qsize() != 0:
listOfWorklogs.append(q.get(), ignore_index=True)
This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.
PS: I couldn't try the code because I have no examples, it probably has some bugs
I have some troubles((
1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)
for i in range(len(listOfKeys)-99):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
....
2) cmd, conda prompt and Spyder did not allow your code to work for a reason:
Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec'
After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared.
By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.
3) when I corrected indents and set __spec __ = None, a new error occurred in this place:
processes[i].start ()
error like this:
"PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"
if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((
I ask again for your help!)
How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.

Resources