Python - Multiprocessing passing Pandas Dataframe - python-3.x

I'm trying to use multiprocessing to do some web scraping and then add it to a unique DataFrame for each Process and then merge all the DataFrames together at the end to avoid Locks.
I tried out the code below and it ran but the DataFrame would only hold the data for a moment but not when the process finish running.
Am I missing something?
import pandas as pd
import multiprocessing
def add_row_to_db(database):
database.loc[len(database.index)] = ['Sample', 'Testing']
print('Added')
print(f'{database}\n')
if __name__ == '__main__':
columns_name = ['name', 'power']
db = pd.DataFrame(columns=columns_name)
process = multiprocessing.Process(target=add_row_to_db, args=(db,))
process.start()
process.join()
print(db)
Output
Added
name power
0 Sample Testing
Empty DataFrame
Columns: [name, power]
Index: []
Process finished with exit code 0

Related

loop over a python list

I have a python list of Ids which I am calling in my function. There are around 200 ids. I would like to know what will be the best way to call these ids in chunks like I call 10 or 20 ids at a time and in next call, I call the next 20 ids and so on.. I have used multithreading here to make it faster but it seems to take lot of time. Here is the code I managed:
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import datetime as dt
df = pd.ExcelFile('ids.xlsx').parse('Sheet1')
x=[]
x.append(df['external_ids'].to_list())
def download():
#client is my python sdk
dtest_df = client.datapoints.retrieve_dataframe(external_id = x[0], start=0, end="now",granularity='1m')
dtest_df = dtest_df.rename(columns = {'index':'timestamp'})
client.datapoints.insert_dataframe(dtest_df,external_id_headers = True,dropna = True)
print(dtest_df)
with ThreadPoolExecutor(max_workers=20) as executor:
future = executor.submit(download)
print(future.result())

How to confirm multiprocessing library is being used?

I am trying to use multiprocessing for the below code. The code seems to run a bit faster than the for loop inside the function.
How can I confirm I using the library and not the just the for loop?
from multiprocessing import Pool
from multiprocessing import cpu_count
import requests
import pandas as pd
data= pd.read_csv('~/Downloads/50kNAE000.txt.1' ,sep="\t", header=None)
data = data[0].str.strip("0 ")
lst = []
def request(x):
for i,v in x.items():
print(i)
file = requests.get(v)
lst.append(file.text)
#time.sleep(1)
if __name__ == "__main__":
pool = Pool(cpu_count())
results = pool.map(request(data))
pool.close() # 'TERM'
pool.join() # 'KILL'
Multiprocessing has overhead. It has to start the process and transfer function data via interprocess mechanism. Just running a single function in another process vs. running that same function normally is always going to be slower. The advantage is actually doing parallelism with significant work in the functions that makes the overhead minimal.
You can call multiprocessing.current_process().name to see the process name change.

Fastest iteration on a dataframe by applying an url function

I need to request some data from an url by inserting a variable=var for each row of my dataframe. I wrote a function that iterates over each row
def df_eval(data):
data_eval = data.copy()
df_price = []
for i in data_eval.index:
var = data_eval.at[i, 'var']
url = ("http://blablabla/params&cid={}".format(var))
r_json = requests.get(url).json()
df = json_normalize(r_json)
df_price.append(df['price'])
print(df_price)
data_eval['price_eval'] = df_price
return data_eval
Could you be able to suggest a faster way for this operation. Currently it takes about 30 minutes over 23000 rows.
You could paralelize your calls like this:
import random
import pandas as pd
import numpy as np
from multiprocessing import Pool
data_split = np.array_split(data, n_cores)
pool = Pool(n_cores)
data = pd.concat(pool.map(df_eval, data_split))
pool.close()
pool.join()
Source: https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1

Multi-Processing to share memory between processes

I am trying to update a variable of a class by calling a function of the class from a different function which is being run on multi-process.
To achieve the desired result, process (p1) needs to update the variable "transaction" and which should get then modified by process (p2)
I tried the below code and I know i should use Multiprocess.value or manager to achieve the desired result and I am not sure of how to do it as my variable to be updated is in another class
Below is the code:
from multiprocessing import Process
from helper import Helper
camsource = ['a','b']
Pros = []
def sub(i):
HC.trail_func(i)
def main():
for i in camsource:
print ("Camera Thread {} Started!".format(i))
p = Process(target=sub, args=(i))
Pros.append(p)
p.start()
# block until all the threads finish (i.e. block until all function_x calls finish)
for t in Pros:
t.join()
if __name__ == "__main__":
HC = Helper()
main()
Here is the helper code:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
class Helper():
def __init__(self):
self.transactions = []
def trail_func(self,preview):
if preview == 'a':
self.transactions.append({"Apple":1})
else:
if self.transactions[0]['Apple'] == 1:
self.transactions[0]['Apple'] = self.transactions[0]['Apple'] + 1
print (self.transactions)
Desired Output:
p1:
transactions = {"Apple":1}
p2:
transactions = {"Apple":2}
I've recently released this module that can help you with your code, as all data frames (data models that can hold any type of data), have locks on them, in order to solve concurrency issues. Anyway, take a look at the README file and the examples.
I've made an example here too, if you'd like to check.

Customize Python Script On Azure ML

I want to use Fuzzywuzzy logic on python script. I am implement in this way but i didn't get anything.
This is my python script code:
import pandas as pd
from fuzzywuzzy import process
def azureml_main(dataframe1 = None):
return dataframe1,
def get_matches(query, choice, limit = 6):
result = process.extract(query, choice, limit = limit)
return result,
get_matches("admissibility", dataframe1)

Resources