running in parallel requests.post over a pandas data frame - python-3.x

So I am trying to run a defined function that is a requests.post that gets the input from a pandas dataframe and save it to the same dataframe but different column
import requests, json
import pandas as pd
import argparse
def postRequest(input, url):
'''Post response from url'''
headers = {'content-type': 'application/json'}
r = requests.post(url=url, json=json.loads(input), headers=headers)
response = r.json()
return response
def payload(text):
# get proper payload from text
std_payload = { "auth_key":"key",
"org":{ "id":org_id, "name":"org" },
"ver":{"id":ver_id, "name":"ver" },
"mess":{ "id":80}}
std_payload['message']['text'] = text
std_payload = json.dumps(std_payload)
return std_payload
def find(df):
ff=pd.DataFrame(columns=['text','expected','word','payload','response'])
count=0
for leng in range(0,len(df)):
search=df.text[leng].split()
ff.loc[count]=df.iloc[leng]
ff.loc[count,'word']='orginalphrase'
count=count+1
for w in range(0,len(search)):
if df.text[leng]=="3174":
ff.append(df.iloc[leng],ignore_index=True)
ff.loc[count,'text']="3174"
ff.loc[count,'word']=None
ff.loc[count,'expected']='[]'
continue
word=search[:]
ff.loc[count,'word']=word[w]
word[w]='z'
phrase=' '.join(word)
ff.loc[count,'text']=phrase
ff.loc[count,'expected']=df.loc[leng,'expected']
count=count+1
if df.text[leng]=="3174":
continue
return ff
# read in csv of phrases to be tested
df = pd.read_csv(filename,engine='python')
#allows empty cells by setting them to the phrase empty
df=df.fillna("3174")
sf=find(df)
for i in sf.index:
sf['payload']=payload(sf.text[i])
for index in df.index:
sf.response[index]=postRequest(df.text[index],url)
From all my tests this operation is running over the dataframe one by one which when my dataframe is large this operation can take a few hours.
Searching online for running things in parallel give me a few methods but I do not understand what the methods are doing, I have seen pooling and threading examples while i can get the examples to work. Such as:
Simultaneously run POST in Python
Asynchronous Requests with Python requests
When I try and apply them with my code, specifically I cannot get any method to work with the postRequest it still goes one by one.
Can any one provide assistance in getting the paralleling to work correctly. If more informations is required please let me know.
Thanks
Edit:
here is the last thing I was working with
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(postRequest, sf.payload[index],trends_url): index for index in range(10)}
counts=0
for future in concurrent.futures.as_completed(future_to_url):
repo = future_to_url[future]
data = future.result()
sf.response[count]=data
count=count+1
also the dataframe has anywhere between 2000 and 4000 rows so doing it in sequence can take up to 4 hours,

Related

Python - Speeding up Web Scraping using multiprocessing

I have the following function to scrape a webpage.
def parse(link: str, list_of_samples: list, index: int) -> None:
# Some code to scrape the webpage (link is given)
# The code will generate a list of strings, say sample
list_of_samples[index] = sample
I have another script that calls the above script for all URLs present in a list
def call_that_guy(URLs: list) -> list:
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
parse(URLs[i], samples, i)
return samples
Some other function that calls the above function
def caller() -> None:
URLs = [url_1, url_2, url_3, ..., url_n]
# n will not exceed 12
samples = call_thay_guy(URLs)
print(samples)
# Prints the list of samples, but is taking too much time
One thing I noticed is that the parse function is taking around 10s to parse a single webpage (I am using Selenium). So, parsing all the URLs present in the list, it is taking around 2 minutes. I want to speed it up, probably using multithreading.
I tried doing the following instead.
import threading
def call_that_guy(URLs: list) -> list:
threads = [None for i in range(len(URLs))]
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
threads[i] = threading.Thread(target = parse, args = (URLs[i], samples, i))
threads[i].start()
return samples
But, when I printed the returned value, all of its contents were None.
What am I trying to Achieve:
I want to asynchronously Scrape a list of URLs and populate the list of samples. Once the list is populated, I have some other statements to execute (they should execute only after samples is populated, else they'll cause Exceptions). I want to scrape the list of URLs faster (asynchronously is allowed) instead of scraping them one after another.
(I can explain something more clearly with image)
Why don't you use concurrent.futures module?
Here is a very simple but super fast code using concurrent.futures:
import concurrent.futures
def scrape_url(url):
print(f'Scraping {url}...')
scraped_content = '<html>scraped content!</html>'
return scraped_content
urls = ['https://www.google.com', 'https://www.facebook.com', 'https://www.youtube.com']
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = executor.map(scrape_url, urls)
print(list(results))
# Expected output:
# ['<html>scraped content!</html>', '<html>scraped content!</html>', '<html>scraped content!</html>']
If you want to learn threading, I recommend watching this short tutorial: https://www.youtube.com/watch?v=IEEhzQoKtQU
Also note that this is not multiprocessing, this is multithreading and the two are not the same. If you want to know more about the difference, you can read this article: https://realpython.com/python-concurrency/
Hope this solves your problem.

How to loop through api using Python to pull records from specific group of pages

I wrote a Python script that pulls json data from api. The api has 10k records.
As I set up my script, I was only pulling 10 pages, (each page contains 25 items), so it worked fine, dump everything in a .csv and also put everything into a mysql db.
When I ran, what I thought would be my last test and pulled, should I say - attempted to pull the data from all 500 pages, got an internal server error. So researched that and think it is because I am pulling all this data at once. The api documentation is kind of crapy, can find any rate limit info, anyway...
Since this is not my api, I though a quick solution would be just to run my script, let it pull the data from the first 10 pages, then the second 10 pages, 3rd 10 pages etc.
For obvious reasons I can't show all the code, but below are the basics/snippets. It is pretty simple, just grad the url, manipulate it a bit so I can add the Page#, then count the number of pages, then loop through and grab the data content.
Could someone help by explaining/showing how I can run/loop through my url, get the content from pages 1-10, then next loop through and get the content from pages 11-21 and so on?
Any insight, suggestions, examples would be greatly appreciated.
import requests
import json
import pandas as pd
import time
from datetime import datetime, timedelta
time_now = time.strftime("%Y%m%d-%H%M%S")
# Make our Request
def main_request(baseurl, x, endpoint, headers):
r = requests.get(baseurl + f'{x}' + endpoint, headers=headers)
return r.json()
# determine how many pages are needed to loop through, use for pagination
def get_pages(response):
# return response['page']['size']
return 2
def parse_json(response):
animal_list = []
for item in response['_embedded']['results']:
animal_details = {
'animal type': item['_type'],
'animal title': item['title'],
'animal phase': item['type']['value']
}
animal_list.append(animal_details)
return animal_list
animal_main_list = []
animal_data = main_request(baseurl, 1, endpoint, headers)
for x in range(1, get_pages(animal_data) + 1):
print(x)
mainList.extend(parse_json(main_request(baseurl, x, endpoint, headers)))
Got it working. Using the range function and the f{x} in my url. Set the range in a for loop and it updates the url. I used sleep.time to slow the retrieval down

Why my function that creates a pandas dataframe changes the dtype to none when called

I'm working on processing csv files, I was writing my code without functions and it worked, albeit some problems when trying to fillna with a string, before I did a try and except.
For some reason it didn't work before creating the while loop.
My question is why a dataframe object created inside of a function by reading a csv file name I passed when I called the function, returned an empty object? I thought when the dataframe was in memory it wouldn't be destroyed, what am I missing?
My code:
import pandas as pd
grossmargin = 1.2
def read_wholesalefile(name):
mac = name
apple = pd.read_csv(mac)
apple['price'] = apple['Wholesale'] * grossmargin
while True:
try:
apple.fillna('N/A', inplace=True)
break
except ValueError:
print('Not Valid')
read_wholesalefile('Wholesalelist5182021.csv')
Well sorry guys, I figure it out by myself:
I was missing the scope, sorry again for the newb stuff. I just started coding in Python a few months ago(last December) and I'm learning in the process.
What worked for me was to add the scope Global, within the function, seriously I didn't know dataframes behaved as variables ... inside a function.
#My Modified code that works
import pandas as pd
grossmargin = 1.2
def read_wholesalefile(name):
global apple
mac = name
apple = pd.read_csv(mac)
apple['price'] = apple['Wholesale'] * grossmargin
while True:
try:
apple.fillna('N/A', inplace=True)
break
except ValueError:
print('Not Valid')
read_wholesalefile('Wholesalelist5182021.csv')

How to get a certain number of words from a website in python

i want to fetch data from cheat.sh using the requests lib and the discord.py lib....but since discord only allows 2000 characters at length to send at a time, i want to fetch only a certain number of words/digits/newline like 1800. how can i do so?
a small bit of code example showing my idea
import requests
url = "https://cheat.sh/python/string+annotations" #this gets the docs of string annotation in python
response = requests.get(url)
data = response.text # This gives approximately 2403 words...but i want to get only 1809 words
print(data)
import requests
url = "https://cheat.sh/python/string+annotations" #this gets the docs of string
annotation in python
response = requests.get(url)
data = response.text[:1800]
print(data)
This will be the correct code

python3, Trying to get an output from my function I defined, need some guidance

I found pretty cool ASN API tool that allows me to supply an AS # and it will go out and pull down the subnets that relate with that ASN.
Here is (rough) but partial code. I am defining a function ASNNUMBER (to which I will supply the number through another file)
When I call url here, it just gives me an n...
What I'm trying to do here, is append my str(ASNNUMBER) to the end of the ?q= parameter in the URL.
Once I do that, I'd like to display my results and output it to a file
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
My results I'd like to get is an output of the get request I'm performing
## Running ASNFinder
n
Try to write something like that:
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
data = response.text
print(data)
with open('filename', 'r') as f:
f.write(data)
It must works fine
P.S. If it helped ya, please make sure you mark this as the answer :)

Resources