Memory efficient massive http requests

Memory efficient massive http requests - python-3.x

I need to do an unlimited HTTP requests from a web API one after another and make it work efficiently and quite fast. (I need it for a utility so it should work no matter how many time im using it, also it should be able to be used on a web server(people use at the same time))
right now I'm using a threading with a queue but after a while of doing it I'm getting errors like:
'cant start a new thread'
'MemoryError'
or it may work a bit, but pretty slow.
this is a part of my code:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
q.join()
*get_urls() is a simple function that returns a list of urls(unknown length)
this is my recieveJson(thread target):
def receiveJson():
while True:
url = q.get()
res = request.get(url).json()
q.task_done()

The problem is coming from your Threads never ending, notice that there is no exit condition in your receiveJson function. The simplest way to signal it should end is usually by enqueuing None:
def receiveJson():
while True:
url = q.get()
if url is None: # Exit condition allows thread to complete
q.task_done()
break
res = request.get(url).json()
q.task_done()
and then you can change the other code as follows:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
for i in range(concurrent):
q.put(None) # Add a None for each thread to be able to get and complete
q.join()
There are other ways of doing this, but this is the how to do it with the least amount of change to your code. If this is happening often, it might be worth looking into the concurrent.futures.ThreadPoolExecutor class to avoid the cost of opening threads very often.

Related

Queue and thread from file customize working threads

I am planing to write a python script that reads urls from a file and checks the status code from these urls using requests. To speed up the process my intention is to use multiple threads at the same time.
import threading
import queue
q = queue.Queue()
def CheckUrl():
while True:
project = q.get()
#Do the URL checking here
q.task_done()
threading.Thread(target=CheckUrl, daemon=True).start()
file = open("TextFile.txt", "r")
while True:
next_line = file.readline()
q.put(next_line)
if not next_line:
break;
file.close()
print('project requests sent\n', end='')
q.join()
print('projects completed')
My problem. Now the code is reading all the text at once making as many threads as there are lines in the text file if I understand correctly. I i would like to do something like read 20 lines at the same time, check status code from the 20 urls, if one or more checks are done go to the next.
is there something like
threading.Thread(target=CheckUrl, daemon=True, THREADSATSAMETIME=20).start()

Seems i have to stick with this one
def threads_run():
for i in range(20): #create 20 threads
(i) = threading.Thread(target=CheckUrl, daemon=True).start()
threads_run()

How can I ensure each gRPC stream gets updated once and avoids race conditions?

What I'm trying to do: When I make an update to the state of an object, all gRPC clients should be given the update via a gRPC stream. It's important that each client gets every update, and that they get it exactly once.
What I expect to happen: When I do event.set() and then event.clear() immediately after, all of the clients will run one time, yielding the new status.
What actually happens:the clients are missing updates. For example, my serve function I'm sending out 10 updates to the version. On the client side it's missing these updates, I'll see where it has update 1 2 then misses 3 or some other update, then starts getting them again.
Server version 1, this doesn't work because clients are missing some updates:
class StatusStreamer(pb2_grpc.StatusServiceServicer):
def __init__(self, status, event):
self.continue_running = True
self.status = status
self.event = event
def StatusSubscribe(self, request, context):
while self.continue_running:
self.event.wait()
yield self.status
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
status = status_builder()
event = threading.Event()
status_streamer = StatusStreamer(status, event)
pb2_grpc.add_StatusServiceServicer_to_server(status_streamer, server)
server.add_insecure_port('[::]:50051')
server.start()
print('server started')
try:
while True:
_ = input('enter a key to update')
for _ in range(10):
#make an update and send it out to all clients
status.version = str(int(status.version) + 1)
print('update:',status.version)
event.set()
event.clear()
except KeyboardInterrupt:
print('\nstopping...')
event.set()
status_streamer.continue_running = False
server.stop(0)
Server version 2, this one works but I think there's a race condition:
In this second version instead of using a threading.Event I use a boolean, new_update which is shared among all of the threads. Inside the serve function I set it to true and then all of the threads set it to False.
class StatusStreamer(pb2_grpc.StatusServiceServicer):
def __init__(self, status):
self.continue_running = True
self.new_update = False
self.status = status
def StatusSubscribe(self, request, context):
while self.continue_running:
if self.new_update:
yield self.status
self.new_update = False #race condition I believe, that maybe doesn't occur because of the GIL.
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
status = status_builder()
status_streamer = StatusStreamer(status)
pb2_grpc.add_StatusServiceServicer_to_server(status_streamer, server)
server.add_insecure_port('[::]:50051')
server.start()
print('server started')
try:
while True:
_ = input('enter a key to update')
for _ in range(10):
#make an update and send it out to all clients
status.version = str(int(status.version) + 1)
print('update:', status.version)
status_streamer.new_update = True #Also a race condition I believe.
except KeyboardInterrupt:
print('\nstopping...')
status_streamer.continue_running = False
server.stop(0)
I believe the second version only works because it relies on CPython's global interpreter lock, ensuring that no thread will be mutating new_update at the same time. I do not like this solution, what are my options? Also, I'm aware that I can create a queue or list and store all of the changes and then keep track of where each connected client is at, I do not want to allocate the memory to do that.

For server version 1, the reason of missing update is that once the main thread held the GIL, it may execute multiple event.set() before yield GIL to other threads. So other thread may not blocked by event.wait(), and results in missing updates. An potential fix will be keeping a counter of connection, and blocking the update of version until the server has sent update to all connections.
For server version 2, use a threading.Lock or threading.RLock may solve your race condition. Also, this version will consume a lot of CPU cycles in the flag checking, may impair your business logic in other threads. And it is also possible that the main thread is holding GIL too long that server is yet to sent messages to all connections.
Unfortunately, I don't have a perfect solution to satisfy your requirement. The gRPC team has a servicer implementation with similar functionality at https://github.com/grpc/grpc/blob/v1.18.x/src/python/grpcio_health_checking/grpc_health/v1/health.py.
In the implementation, the servicer will keep the reference of the returned response iterators. When the status is updated, the servicer will explicitly add message to corresponding response iterators. Hence, the status update will not miss.
Hope this can answer your question.

How to have my defined refresh function running in the background of my twisted server

I have a simple twisted TCP server running absolutely fine, it basically deals with database requests and displays the right things its just an echo client with a bunch of functions, the database that is being read also updates I have this refresh function to open the database and refresh it however if I add this to the message functions it'll take too long to respond as the refresh function takes around 6/7 seconds to complete, my initial idea was to have this function in a while loop and running constantly refreshing every 5/10 mins but after reading about the global interpreter lock its made me think that that isn't possible, any suggestions on how to run this function in the background of my code would be greatly appreciated
I've tried having it in a thread but it doesn't seem to run at all when I start the thread, I put it under the if name == 'main': function and no luck!
Here is my refresh function
def refreshit()
Application = win32com.client.Dispatch("Excel.Application")
Workbook = Application.Workbooks.open(database)
Workbook.RefreshAll()
Workbook.Save()
Application.Quit()
xlsx = pd.ExcelFile(database)
global datess
global refss
df = pd.read_excel(xlsx, sheet_name='Sheet1')
datess = df.groupby('documentDate')
refss = df.groupby('reference')
class Echo(Protocol):
global Picked_DFS
Picked_DFS = None
label = None
global errors
global picked
errors = []
picked = []
def dataReceived(self, data):
"""
As soon as any data is received, write it back.
"""
response = self.handle_message(data)
print('responding with this')
print(response)
self.transport.write(response)
def main():
f = Factory()
f.protocol = Echo
reactor.listenTCP(8000, f)
reactor.run()
if __name__ == '__main__':
main()
I had tried this to no avail
if __name__ == '__main__':
main()
thread = Thread(target = refreshit())
thread.start()
thread.join()

You have an important error on this line:
thread = Thread(target = refreshit())
Though you have not included the definition of refreshit (perhaps a function to consider renaming), I assume refreshit is a function that performs your refresh.
In this case, what you are doing here is calling refreshit and waiting for it to return a value. Then, the value it returns is used as the target of the Thread you create here. This is probably not what you meant. Instead:
thread = Thread(target = refreshit)
That is, refreshit itself is what you want the target of the thread to be.
You also need to be sure to sequence your operations so that everything gets to run concurrently:
if __name__ == '__main__':
# Start your worker/background thread.
thread = Thread(target = refreshit)
thread.start()
# Run Twisted
main()
# Cleanup/wait on your worker/background thread.
thread.join()
You may also just want to use Twisted's thread support instead of using the threading module directly (but this is not mandatory).
if __name__ == '__main__':
# Start your worker/background thread.
thread = Thread(target = refreshit)
thread.start()
# Run Twisted
main()
# Cleanup/wait on your worker/background thread.
thread.join()

Python27 Is it able to make timer without thread.Timer?

So, basically I want to make timer but I don't want to use thread.Timer for
efficiency
Python produces thread by itself, it is not efficient and better not to use it.
I search the essay related to this. And checked It is slow to use thread.
e.g) single process was divided into N, and made it work into Thread, It was slower.
However I need to use Thread for this.
class Works(object):
def __init__(self):
self.symbol_dict = config.ws_api.get("ASSET_ABBR_LIST")
self.dict = {}
self.ohlcv1m = []
def on_open(self, ws):
ws.send(json.dumps(config.ws_api.get("SUBSCRIPTION_DICT")))
everytime I get the message form web socket server, I store in self.dict
def on_message(self,ws,message):
message = json.loads(message)
if len(message) > 2 :
ticker = message[2]
pair = self.symbol_dict[(ticker[0])]
baseVolume = ticker[5]
timestmap = time.time()
try:
type(self.dict[pair])
except KeyError as e:
self.dict[pair] = []
self.dict[pair].append({
'pair':pair,
'baseVolume' : baseVolume,
})
def run(self):
websocket.enableTrace(True)
ws = websocket.WebSocketApp(
url = config.ws_api.get("WEBSOCK_HOST"),
on_message = self.on_message,
on_open = self.on_open
)
ws.run_forever(sslopt = {"cert_reqs":ssl.CERT_NONE})
'once in every 60s it occurs. calculate self.dict and save in to self.ohlcv1m
and will sent it to db. eventually self.dict and self.ohlcv1m initialized again to store 1min data from server'
def every60s(self):
threading.Timer(60, self.every60s).start()
for symbol in self.dict:
tickerLists = self.dict[symbol]
self.ohlcv1m.append({
"V": sum([
float(ticker['baseVolume']) for ticker in tickerLists]
})
#self.ohlcv1m will go to database every 1m
self.ohlcv1 = [] #init again
self.dict = {} #init again
if __name__ == "__main__":
work=Works()
t1 = threading.Thread(target=work.run)
t1.daemon = True
t1.start()
work.every60s()
(sorry for the indention)
I am connecting to socket by running run_forever() and getting realtimedata
Every 60s I need to check and calculate the data
Is there any way to make 60s without thread in python27?
I will be so appreciate you answer If you give me any advice.
Thank you

The answer comes down to if you need the code to run exactly every 60 seconds, or if you can just wait 60 seconds between runs (i.e. if the logic takes 5 seconds, it'll run every 65 seconds).
If you're happy with just a 60 second gap between runs, you could do
import time
while True:
every60s()
time.sleep(60)
If you're really set on not using threads but having it start every 60 seconds regardless of the last poll time, you could time the last execution and subtract that from 60 seconds to get the sleep time.
However, really, with the code you've got there you're not going to run into any of the issues with Python threads you might have read about. Those issues come in when you've got multiple threads all running at the same time and all CPU bound, which doesn't seem to be the case here unless there's some very slow, CPU intensive work that's not in your provided code.

Python Threading Issue, Is this Right?

I am attempting to make a few thousand dns queries. I have written my script to use python-adns. I have attempted to add threading and queue's to ensure the script runs optimally and efficiently.
However, I can only achieve mediocre results. The responses are choppy/intermittent. They start and stop, and most times pause for 10 to 20 seconds.
tlock = threading.Lock()#printing to screen
def async_dns(i):
s = adns.init()
for i in names:
tlock.acquire()
q.put(s.synchronous(i, adns.rr.NS)[0])
response = q.get()
q.task_done()
if response == 0:
dot_net.append("Y")
print(i + ", is Y")
elif response == 300:
dot_net.append("N")
print(i + ", is N")
tlock.release()
q = queue.Queue()
threads = []
for i in range(100):
t = threading.Thread(target=async_dns, args=(i,))
threads.append(t)
t.start()
print(threads)
I have spent countless hours on this. I would appreciate some input from expedienced pythonista's . Is this a networking issue ? Can this bottleneck/intermittent responses be solved by switching servers ?
Thanks.

Without answers to the questions, I asked in comments above, I'm not sure how well I can diagnose the issue you're seeing, but here are some thoughts:
It looks like each thread is processing all names instead of just a portion of them.
Your Queue seems to be doing nothing at all.
Your lock seems to guarantee that you actually only do one query at a time (defeating the purpose of having multiple threads).
Rather than trying to fix up this code, might I suggest using multiprocessing.pool.ThreadPool instead? Below is a full working example. (You could use adns instead of socket if you want... I just couldn't easily get it installed and so stuck with the built-in socket.)
In my testing, I also sometimes see pauses; my assumption is that I'm getting throttled somewhere.
import itertools
from multiprocessing.pool import ThreadPool
import socket
import string
def is_available(host):
print('Testing {}'.format(host))
try:
socket.gethostbyname(host)
return False
except socket.gaierror:
return True
# Test the first 1000 three-letter .com hosts
hosts = [''.join(tla) + '.com' for tla in itertools.permutations(string.ascii_lowercase, 3)][:1000]
with ThreadPool(100) as p:
results = p.map(is_available, hosts)
for host, available in zip(hosts, results):
print('{} is {}'.format(host, 'available' if available else 'not available'))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Memory efficient massive http requests - python-3.x

Related

Queue and thread from file customize working threads

How can I ensure each gRPC stream gets updated once and avoids race conditions?

How to have my defined refresh function running in the background of my twisted server

Python27 Is it able to make timer without thread.Timer?

Python Threading Issue, Is this Right?

Categories

Resources