Scraping with a multithreaded queue + urllib3 suffers a drastic slowdown - multithreading

I am trying to scrape a huge number of URLs (approximately 3 millions) that contains JSON-formatted data in the shortest time possible. To achieve this, I have a Python code (python 3) that uses Queue, Multithreading and Urllib3. Everything works fine during the first 3 min, then the code begins to slow down, then it appears to be totally stuck. I have read everything I could find on this issue but unfortunately the solution seems to requires a knowledge which lies far beyond me.
I tried to limit the number of threads : it did not fix anything. I also tried to limit the maxsize of my queue and to change the socket timeout but it did no help either. The distant server is not blocking me nor blacklisting me, as I am able to re-launch my script any time I want with good results in the beggining (the code starts to slow down at pretty random time). Besides, sometimes my internet connection seems to be cut - as I cannot surf on any website - but this specific issue does not appear every time.
Here is my code (easy on me please, I'm a begginer):
#!/usr/bin/env python
import urllib3,json,csv
from queue import Queue
from threading import Thread
csvFile = open("X.csv", 'wt',newline="")
writer = csv.writer(csvFile,delimiter=";")
writer.writerow(('A','B','C','D'))
def do_stuff(q):
http = urllib3.connectionpool.connection_from_url('http://www.XXYX.com/',maxsize=30,timeout=20,block=True)
while True:
try:
url = q.get()
url1 = http.request('GET',url)
doc = json.loads(url1.data.decode('utf8'))
writer.writerow((doc['A'],doc['B'], doc['C'],doc['D']))
except:
print(url)
finally:
q.task_done()
q = Queue(maxsize=200)
num_threads = 15
for i in range(num_threads):
worker = Thread(target=do_stuff, args=(q,))
worker.setDaemon(True)
worker.start()
for x in range(1,3000000):
if x < 10:
url = "http://www.XXYX.com/?i=" + str(x) + "&plot=short&r=json"
elif x < 100:
url = "http://www.XXYX.com/?i=tt00000" + str(x) + "&plot=short&r=json"
elif x < 1000:
url = "http://www.XXYX.com/?i=0" + str(x) + "&plot=short&r=json"
elif x < 10000:
url = "http://www.XXYX.com/?i=00" + str(x) + "&plot=short&r=json"
elif x < 100000:
url = "http://www.XXYX.com/?i=000" + str(x) + "&plot=short&r=json"
elif x < 1000000:
url = "http://www.XXYX.com/?i=0000" + str(x) + "&plot=short&r=json"
else:
url = "http://www.XXYX.com/?i=00000" + str(x) + "&plot=short&r=json"
q.put(url)
q.join()
csvFile.close()
print("done")

As shazow said, it's not the matter of threads, but timeouts at which each thread is getting data from server. Try to include some timout in your code:
finally:
sleep(50)
q.task_done()
it also could be improved by generating adaptive timeouts, for example You could measure how much data you successfully got, and if that number decreases, increase sleep time, and vice versa

Related

Python 3 not opening in console anymore, but in Powershell (Windows 11)

I have previously used this StackOverflow question to write the following code to my program:
import os
os.system('mode con: cols=20 lines=5')
When I double click this Python script from File Explorer, this used to open the console window (command window I believe) in the specified size. However, since recently, it is not doing that anymore. Instead, it now opens Windows PowerShell, and the resizing doesn't function.
My desired result is to resize the window that my print statements are printed to. Also, I should be able to resize the window after making some new print statements. For full code, see below. If I should reduce the code for readability, please let me know.
import os
import math
import sys
# defining round_half_up
def round_half_up(n, decimals=0):
multiplier = 10 ** decimals
return math.floor(n * multiplier + 0.5) / multiplier
def main():
IDLE = "idlelib" in sys.modules
if not IDLE:
os.system('mode con: cols=20 lines=5')
while True: # get some input until it is valid
try:
points = float(input("Total points: "))
if points==int(points):
points = int(points)
break
except ValueError:
print("That's not a number.\n")
while True: # get some more input while it is valid
try:
grade = int(input("Maximum grade: "))
break
except ValueError:
print("That's not a whole number.\n")
if not IDLE:
# calculate the size of the window
cols = 5 + 4*(4+4+10) -10
lines = math.ceil(points) + 5
# apply the new size of the window
os.system('mode con: cols={:d} lines={:d}'.format(cols, lines))
# print all the results
for i in range(math.ceil(points)):
lst = [0.]*8
for j in range(4):
lst[j*2] = i+j*0.25
lst[j*2+1] = round_half_up((i+j*0.25)*grade/points, 1)
print(("{:5.2f} -->{:4.1f}{:10.2f} -->{:4.1f}{:10.2f}" + \
"-->{:4.1f}{:10.2f} -->{:4.1f}").format(*lst))
print("{:5.2f} -->{:4.1f}".format(math.ceil(points), grade))
print()
if points==int(points):
tickets = "90% is either {:.2f}/{:d} or {:.1f}/{:d}"
else:
tickets = "90% is either {:.2f}/{:.1f} or {:.1f}/{:d}"
print(tickets.format(0.9*points, points, 0.9*grade, grade))
print()
if __name__=='__main__':
while True: # keep window open and/or get new inputs until enter is pressed
main()
again = input("Type anything to start again, or just press enter to close:")
if again == "":
break

Python asynchronous requests with timer

I'm trying to run a list of http get request to an url, following a trace of interarrival times between the requests (occasionally with concurrent requests at the same time).
I'm currently using asyncio and aiohttp, anyway, I have to wait a lot more than the expected time (the one defined by the list of interarrival times) because it seems that the requests are still blocking the execution.
My greatest issue is that I have also to try to get the response time for every request.
Here's the snippet of code
async def main():
iat = #list of seconds
times = {}
start_time = time.time()
url = 'http://10.250.0.12:31112/function/weather-station'
async with aiohttp.ClientSession() as session:
for i in range(len(iat)): ##iat == list of requests interarrival times
s = iat[i] / 1000
n = countlist[i]
await asyncio.sleep(s)
j = 1
t0 = time.time()
results[i] = []
elapsed[i] = []
while j <= n: ## n>1 if there are mutliple requests to be sent at the same time
async with session.get(url) as response:
r = await response.json()
tr = time.time()
results[i].append(r['direction'])
times[i].append(tr - t0)
j += 1
end_time = time.time()
print("exit, time:")
print(end_time - start_time)
asyncio.run(main())
Is it possible to achieve this thing? Am I using the wrong methods?
I'm using Python 3.7 on top of Windows 10.

Python 3.6 Bitonic Sort with Multiprocessing library and multiple processes

I am trying to implement bitonic with the python multiprocessing library and a shared resource array that will be sorted at the end of the program.
The problem I am running into is that when I run the program, I get an prompt that asks "Your program is still running! Are you sure you want to cancel it?" and then when I click cancel N - 1 times (where N is the amount of processes I am trying to spawn) then it just hangs.
When this is run from the command line, it just outputs the unsorted array. Of course, I expect it to be sorted at the program's finish.
I've been using this resource to try and get a firm grasp on how I can mitigate my errors but I haven't had any luck, and now I am here.
ANY help would be appreciated, as I really don't have anywhere else to turn to.
I wrote this using Python 3.6 and here is the program in its entirety:
from multiprocessing import Process, Array
import sys
from random import randint
# remember to move this to separate file
def createInputFile(n):
input_file = open("input.txt","w+")
input_file.write(str(n)+ "\n")
for i in range(n):
input_file.write(str(randint(0, 1000000)) + "\n")
def main():
# createInputFile(1024) # uncomment this to create 'input.txt'
fp = open("input.txt","r") # remember to read from sys.argv
length = int(fp.readline()) # guaranteed to be power of 2 by instructor
arr = Array('i', range(length))
nums = fp.read().split()
for i in range(len(nums)):
arr[i]= int(nums[i]) # overwrite shared resource values
num_processes = 8 # remember to read from sys.argv
process_dict = dict()
change_in_bounds = len(arr)//num_processes
low_b = 0 # lower bound
upp_b = change_in_bounds # upper bound
for i in range(num_processes):
print("Process num: " + str(i)) # are all processes being generated?
process_dict[i] = Process(target=bitonic_sort, args=(True, arr[low_b:upp_b]) )
process_dict[i].start()
low_b += change_in_bounds
upp_b += change_in_bounds
for i in range(num_processes):
process_arr[i].join()
print(arr[:]) # Print our sorted array (hopefully)
def bitonic_sort(up, x):
if len(x) <= 1:
return x
else:
first = bitonic_sort(True, x[:len(x) // 2])
second = bitonic_sort(False, x[len(x) // 2:])
return bitonic_merge(up, first + second)
def bitonic_merge(up, x):
# assume input x is bitonic, and sorted list is returned
if len(x) == 1:
return x
else:
bitonic_compare(up, x)
first = bitonic_merge(up, x[:len(x) // 2])
second = bitonic_merge(up, x[len(x) // 2:])
return first + second
def bitonic_compare(up, x):
dist = len(x) // 2
for i in range(dist):
if (x[i] > x[i + dist]) == up:
x[i], x[i + dist] = x[i + dist], x[i] #swap
main()
I won't go into all the syntax errors in your code since I am sure your IDE tells you about those. The problem that you have is that you are missing an if name==main. I changed your def main() to def sort() and wrote this:
if __name__ == '__main__':
sort()
And it worked (after solving all the syntax errors)

Stopwatch program

I need to make a stop watch program, I need Start, Stop, Lap, Reset and Quit functions. The program needs print elapsed times whenever the Stop or Lap key is pressed. When the user chooses to quit the program should write a log file containing all the timing data (event and time) acquired during the session in human readable format.
import os
import time
log = ' '
def cls():
os.system('cls')
def logFile(text):
logtime = time.asctime( time.localtime(time.time()) )
f = open('log.txt','w')
f.write('Local current time :', logtime, '\n')
f.write(text, '\n\n')
f.close()
def stopWatch():
import time
p = 50
a = 0
hours = 0
while a < 1:
cls()
for minutes in range(0, 60):
cls()
for seconds in range(0, 60):
time.sleep(1)
cls()
p +=1
print ('Your time is: ', hours, ":" , minutes, ":" , seconds)
print (' H M S')
if p == 50:
break
hours += 1
stopWatch()
I have it ticking the time, however they way I have it wont allow me to stop or lap or take any input. I worked to a few hours to find a way to do it but no luck. Any Ideas on how im going to get the functions working?

breaking a loop in the middle python

i am trying to have a loop that keeps functioning until something is inputted which will break the loop. However if i put in 'stop = input()' in the loop then it has to go through that first before doing anything else. my code is like this: (it uses some minecraft commands. basically im trying to make a block move down a 20X20 square and have it be able to stop in the middle on command)
from mcpi import minecraft
mc=minecraft.Minecraft.create()
from time import sleep
pos=mc.player.getTilePos
x=pos.x
y=pos.y
z=pos.z
mc.setBlocks(x-10,y,z+30,x+10,y+20,z+30,49)
BB=0
while BB<20:
BB=BB+1
sleep(.7)
mc.setBlock(x,y+(20-BB),z+30,35)
mc.setBlock(x,y+(21-BB),z+30,49)
stop=input()
if stop=='stp':
break
how do i keep the loop going until someone inputs 'stp'? because currently it will move one block then stop and wait until i input something. The loop works if i take out the last three lines.
Whenever you run into input() in your code, Python will stop until it receives an input. If you're running the script in your console, then pressing Ctrl+C will stop execution of the program. (I'll assume you are, because how else would you be able to input 'stp'?)
You can run your logic in a different thread and signal this thread whenever you get an input.
from mcpi import minecraft
import threading
mc=minecraft.Minecraft.create()
from time import sleep
pos=mc.player.getTilePos
stop = False
def play():
x=pos.x
y=pos.y
z=pos.z
mc.setBlocks(x-10,y,z+30,x+10,y+20,z+30,49)
BB=0
while BB<20:
BB=BB+1
sleep(.7)
mc.setBlock(x,y+(20-BB),z+30,35)
mc.setBlock(x,y+(21-BB),z+30,49)
if stop:
break
t = threading.Thread(target=play)
t.start()
while True:
s = input()
if s == 'stp':
stop = True # the thread will see that and act appropriately
I know this is kind of old but I programed something similar earlier so here is a quick adaptation to your needs, hope you don't need it though xD
from mcpi.minecraft import Minecraft
from threading import Thread
from time import sleep
mc = Minecraft.create()
class Placer(Thread):
def __init__(self):
self.stop = False
Thread.__init__(self)
def run(self):
x, y, z = mc.player.getPos()
mc.setBlocks(x - 10, y, z + 30, x + 10, y + 20, z + 30, 49)
for i in range(20, -1, -1):
mc.setBlock(x, y + i, z + 30, 35)
mc.setBlock(x, y + i + 1 - 1, z + 30, 49)
sleep(0.7)
if self.stop:
break
placer = Placer()
placer.start()
while True:
text = input(": ")
if text == "stp":
placer.stop = True

Resources