Why is this Jupyter cell executing ahead of asynchronous loop cell?

Why is this Jupyter cell executing ahead of asynchronous loop cell? - python-3.x

I'm trying to send 6 API requests in one session and record how long it took. Do this n times. I store the amount of time in a list. After that I print out some list info and visualize the list data.
What works: downloading as .py, removing ipython references, and running the code as a command line script.
What also works: manually running the cells, and the erroring cell after the loop cell completes.
What doesn't work: restarting and running all cells within the jupyter notebook. The last cell seems to not wait for the prior one; the latter appears to execute first, and complain about an empty list. Error in image below.
Cell 1:
# submit 6 models at the same time
# using support / first specified DPE above
auth = aiohttp.BasicAuth(login=USERNAME, password=API_TOKEN)
async def make_posts():
for i in range(0, 6):
yield df_input['deployment_id'][i]
async def synch6():
#url = "%s/predApi/v1.0/deployments/%s/predictions" % (PREDICTIONSENDPOINT,DEPLOYMENT_ID)
async with aiohttp.ClientSession(auth=auth) as session:
post_tasks = []
# prepare the coroutines that post
async for x in make_posts():
post_tasks.append(do_post(session, x))
# now execute them all at once
await asyncio.gather(*post_tasks)
async def do_post(session, x):
url = "%s/predApi/v1.0/deployments/%s/predictions" % (PREDICTIONSENDPOINT, x)
async with session.post(url, data = df_scoreme.to_csv(), headers=PREDICTIONSHEADERS_csv) as response:
data = await response.text()
#print (data)
Cell 2:
chonk_start = (datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z"))
perf1 = []
n = 100
for i in range(0, n):
start_ts = round(time.time() * 1000)
loop = asyncio.get_event_loop()
loop.run_until_complete(synch6())
end_ts = round(time.time() * 1000)
perf1.append(end_ts - start_ts)
Cell 3:
perf_string(perf1, 'CHONKS')
The explicit error (see image) appears to simply be the result of trying to work on an empty list. The list appears to be empty because that cell is executing before the loop test cell actually populates the list - although I don't know why. This appears to only be a problem inside the notebook.
EDIT: In further testing... this appears to work fine on my local (python3, mac) jupyter notebook. Where it is failing is on a AWS Sagemaker conda python3 notebook.

Related

Python - Speeding up Web Scraping using multiprocessing

I have the following function to scrape a webpage.
def parse(link: str, list_of_samples: list, index: int) -> None:
# Some code to scrape the webpage (link is given)
# The code will generate a list of strings, say sample
list_of_samples[index] = sample
I have another script that calls the above script for all URLs present in a list
def call_that_guy(URLs: list) -> list:
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
parse(URLs[i], samples, i)
return samples
Some other function that calls the above function
def caller() -> None:
URLs = [url_1, url_2, url_3, ..., url_n]
# n will not exceed 12
samples = call_thay_guy(URLs)
print(samples)
# Prints the list of samples, but is taking too much time
One thing I noticed is that the parse function is taking around 10s to parse a single webpage (I am using Selenium). So, parsing all the URLs present in the list, it is taking around 2 minutes. I want to speed it up, probably using multithreading.
I tried doing the following instead.
import threading
def call_that_guy(URLs: list) -> list:
threads = [None for i in range(len(URLs))]
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
threads[i] = threading.Thread(target = parse, args = (URLs[i], samples, i))
threads[i].start()
return samples
But, when I printed the returned value, all of its contents were None.
What am I trying to Achieve:
I want to asynchronously Scrape a list of URLs and populate the list of samples. Once the list is populated, I have some other statements to execute (they should execute only after samples is populated, else they'll cause Exceptions). I want to scrape the list of URLs faster (asynchronously is allowed) instead of scraping them one after another.
(I can explain something more clearly with image)

Why don't you use concurrent.futures module?
Here is a very simple but super fast code using concurrent.futures:
import concurrent.futures
def scrape_url(url):
print(f'Scraping {url}...')
scraped_content = '<html>scraped content!</html>'
return scraped_content
urls = ['https://www.google.com', 'https://www.facebook.com', 'https://www.youtube.com']
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = executor.map(scrape_url, urls)
print(list(results))
# Expected output:
# ['<html>scraped content!</html>', '<html>scraped content!</html>', '<html>scraped content!</html>']
If you want to learn threading, I recommend watching this short tutorial: https://www.youtube.com/watch?v=IEEhzQoKtQU
Also note that this is not multiprocessing, this is multithreading and the two are not the same. If you want to know more about the difference, you can read this article: https://realpython.com/python-concurrency/
Hope this solves your problem.

How can i use multithreading (or multiproccessing?) for faster data upload?

I have a list of issues (jira issues):
listOfKeys = [id1,id2,id3,id4,id5...id30000]
I want to get worklogs of this issues, for this I used jira-python library and this code:
listOfWorklogs=pd.DataFrame() (I used pandas (pd) lib)
lst={} #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
for j in range(len(worklogs)):
lst = {
'self': worklogs[j].self,
'author': worklogs[j].author,
'started': worklogs[j].started,
'created': worklogs[j].created,
'updated': worklogs[j].updated,
'timespent': worklogs[j].timeSpentSeconds
}
listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################
so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link:
https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link
The problem is that there are more than 30,000 such issues.
and the loop is sooo slow (approximately 3 sec for 1 issue)
Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?

I recycled a piece of code I made into your code, I hope it helps:
from multiprocessing import Manager, Process, cpu_count
def insert_into_list(worklog, queue):
lst = {
'self': worklog.self,
'author': worklog.author,
'started': worklog.started,
'created': worklog.created,
'updated': worklog.updated,
'timespent': worklog.timeSpentSeconds
}
queue.put(lst)
return
# Number of cpus in the pc
num_cpus = cpu_count()
index = 0
# Manager and queue to hold the results
manager = Manager()
# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()
listOfWorklogs=pd.DataFrame()
lst={}
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
# This loop replaces your "for j in range(len(worklogs))" loop
while index < len(worklogs):
processes = []
elements = min(num_cpus, len(worklogs) - index)
# Create a process for each cpu
for i in range(elements):
process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
processes.append(process)
# Run the processes
for i in range(elements):
processes[i].start()
# Wait for them to finish
for i in range(elements):
processes[i].join(timeout=10)
index += num_cpus
# Dump the queue into the dataframe
while queue.qsize() != 0:
listOfWorklogs.append(q.get(), ignore_index=True)
This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.
PS: I couldn't try the code because I have no examples, it probably has some bugs

I have some troubles((
1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)
for i in range(len(listOfKeys)-99):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
....
2) cmd, conda prompt and Spyder did not allow your code to work for a reason:
Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec'
After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared.
By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.
3) when I corrected indents and set __spec __ = None, a new error occurred in this place:
processes[i].start ()
error like this:
"PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"
if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((
I ask again for your help!)

How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.

running in parallel requests.post over a pandas data frame

So I am trying to run a defined function that is a requests.post that gets the input from a pandas dataframe and save it to the same dataframe but different column
import requests, json
import pandas as pd
import argparse
def postRequest(input, url):
'''Post response from url'''
headers = {'content-type': 'application/json'}
r = requests.post(url=url, json=json.loads(input), headers=headers)
response = r.json()
return response
def payload(text):
# get proper payload from text
std_payload = { "auth_key":"key",
"org":{ "id":org_id, "name":"org" },
"ver":{"id":ver_id, "name":"ver" },
"mess":{ "id":80}}
std_payload['message']['text'] = text
std_payload = json.dumps(std_payload)
return std_payload
def find(df):
ff=pd.DataFrame(columns=['text','expected','word','payload','response'])
count=0
for leng in range(0,len(df)):
search=df.text[leng].split()
ff.loc[count]=df.iloc[leng]
ff.loc[count,'word']='orginalphrase'
count=count+1
for w in range(0,len(search)):
if df.text[leng]=="3174":
ff.append(df.iloc[leng],ignore_index=True)
ff.loc[count,'text']="3174"
ff.loc[count,'word']=None
ff.loc[count,'expected']='[]'
continue
word=search[:]
ff.loc[count,'word']=word[w]
word[w]='z'
phrase=' '.join(word)
ff.loc[count,'text']=phrase
ff.loc[count,'expected']=df.loc[leng,'expected']
count=count+1
if df.text[leng]=="3174":
continue
return ff
# read in csv of phrases to be tested
df = pd.read_csv(filename,engine='python')
#allows empty cells by setting them to the phrase empty
df=df.fillna("3174")
sf=find(df)
for i in sf.index:
sf['payload']=payload(sf.text[i])
for index in df.index:
sf.response[index]=postRequest(df.text[index],url)
From all my tests this operation is running over the dataframe one by one which when my dataframe is large this operation can take a few hours.
Searching online for running things in parallel give me a few methods but I do not understand what the methods are doing, I have seen pooling and threading examples while i can get the examples to work. Such as:
Simultaneously run POST in Python
Asynchronous Requests with Python requests
When I try and apply them with my code, specifically I cannot get any method to work with the postRequest it still goes one by one.
Can any one provide assistance in getting the paralleling to work correctly. If more informations is required please let me know.
Thanks
Edit:
here is the last thing I was working with
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(postRequest, sf.payload[index],trends_url): index for index in range(10)}
counts=0
for future in concurrent.futures.as_completed(future_to_url):
repo = future_to_url[future]
data = future.result()
sf.response[count]=data
count=count+1
also the dataframe has anywhere between 2000 and 4000 rows so doing it in sequence can take up to 4 hours,

Periodically running a function in python

I have the following function that gets some data from a web page and stores in a dictionary. The time stamp as a key, and the data (list) as value.
def getData(d):
page = requests.get('http://www.transportburgas.bg/bg/%D0%B5%D0%BB%D0%B5%D0%BA%D1%82%D1%80%D0%BE%D0%BD%D0%BD%D0%BE-%D1%82%D0%B0%D0%B1%D0%BB%D0%BE')
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all("table", class_="table table-striped table-hover")
rows = table[0].find_all('tr')
table_head = table[0].find_all('th')
header = []
tr_l = []
rows = []
for tr in table[0].find_all('tr'):
value = tr.get_text()
rows.append(value)
time_stamp = rows[1].split("\n")[1]
data = []
for i in rows[1:]:
a = i.split("\n")
if time.strptime(a[1], '%d/%m/%Y %H:%M:%S')[:4] == time.localtime()[:4]:
data.append(int(a[3]))
else:
data.append(np.nan)
d[time_stamp] = data
The data on the web page gets updated every 5 mins. I would like to make the function run automatically every 5 min. I am trying to do it with the time.sleep and this function:
def period_fun(it):
iterations = it
while iterations != 0:
getData(dic)
time.sleep(300)
iterations = iterations -1
However, this function runs only once and I end up with only one item in the dictionary. I have tried it with a simple print (1) instead of the function and it works (1 gets printed several times), but when I implement it with the function it doesn't work.
Would really appreciate any suggestions on the functions or how I could achieve my goal!
Best regards,
Mladen

How about using some library which use cron jobs?
Schedule looks nice although it is not using cron like syntax: https://github.com/dbader/schedule

Tensorflow while loop runs only once

The below while loop should print "\n\nInside while..." 10 times but when I run the graph, "\n\nInside while..." is printed exactly once. Why is that?
i = tf.constant(0)
def condition(i):
return i < 10
def body(i):
print("\n\nInside while...", str(i))
return i + 1
r = tf.while_loop(condition, body, [i])

Your issue comes from conflating TensorFlow graph building with graph execution.
The functions you pass to tf.while_loop get executed once, to generate the TensorFlow graph responsible for executing the loop itself. So if you had put a tf.Print in there (for example, saying return tf.Print(i+1, [i+1])) you'd see it print 10 times when the loop is actually executed by the TensorFlow system.

I know practically nothing about TensorFlow and cannot help you with your immediate problem, but you can accomplish something similar (maybe) if you write your code differently. Following the logic of your program, a different implementation of while_loop was devised below. It takes your condition and body to run a while loop that has been parameterized with functions passed to it. Shown below is a conversation with the interpreter showing how this can be done.
>>> def while_loop(condition, body, local_data):
while condition(*local_data):
local_data = body(*local_data)
return local_data
>>> i = 0
>>> def condition(i):
return i < 10
>>> def body(i):
print('Inside while', i)
return i + 1,
>>> local_data = while_loop(condition, body, (i,))
Inside while 0
Inside while 1
Inside while 2
Inside while 3
Inside while 4
Inside while 5
Inside while 6
Inside while 7
Inside while 8
Inside while 9
>>> local_data
(10,)
>>>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why is this Jupyter cell executing ahead of asynchronous loop cell? - python-3.x

Related

Python - Speeding up Web Scraping using multiprocessing

How can i use multithreading (or multiproccessing?) for faster data upload?

running in parallel requests.post over a pandas data frame

Periodically running a function in python

Tensorflow while loop runs only once

Categories

Resources