Dynamically updating a nested dictionary with multiprocessing.pool (speed issue) - python-3.x

I have written a simple code to understand how lack of communication between the child processes leads to a random result when using multiprocessing.Pool. I input a nested dictionary as a dictproxy object made by multiprocessing.Manager:
manager = Manager()
my_dict = manager.dict()
my_dict['nested'] = nested
into a pool embedding 16 open processes. The nested dictionary is defined below. The function my_function simply generates the second power of each number stored in the elements of the nested dictionary.
As expected because of the shared memory in multithreading, I get the correct result when I use multiprocessing.dummy
{0: 1, 1: 4, 2: 9, 3: 16}
{0: 4, 1: 9, 2: 16, 3: 25}
{0: 9, 1: 16, 2: 25, 3: 36}
{0: 16, 1: 25, 2: 36, 3: 49}
{0: 25, 1: 36, 2: 49, 3: 64}
but when I use multiprocessing, the result is incorrect and completely random in each run. One example of the incorrect result is:
{0: 1, 1: 2, 2: 3, 3: 4}
{0: 4, 1: 9, 2: 16, 3: 25}
{0: 3, 1: 4, 2: 5, 3: 6}
{0: 16, 1: 25, 2: 36, 3: 49}
{0: 25, 1: 36, 2: 49, 3: 64}
In this particular run, the 'data' in 'element' 1 and 3 was not updated. I understand that this happens due to the lack of communication between the child processes which prohibits the "updated" nested dictionary in each child process to be properly sent to the others. However, can someone help me use Manager.Queue to organize this inter-child communication and get the correct results possibly with minimal runtime?
Code (Python 3.5)
from multiprocessing import Pool, Manager
import numpy as np
def my_function(A):
arg1 = A[0]
my_dict = A[1]
temporary_dict = my_dict['nested']
for arg2 in np.arange(len(my_dict['nested']['elements'][arg1]['data'])):
temporary_dict['elements'][arg1]['data'][arg2] = temporary_dict['elements'][arg1]['data'][arg2] ** 2
my_dict['nested'] = temporary_dict
if __name__ == '__main__':
# nested dictionary definition
strs1 = {}
strs2 = {}
strs3 = {}
strs4 = {}
strs5 = {}
strs1['data'] = {}
strs2['data'] = {}
strs3['data'] = {}
strs4['data'] = {}
strs5['data'] = {}
for i in [0,1,2,3]:
strs1['data'][i] = i + 1
strs2['data'][i] = i + 2
strs3['data'][i] = i + 3
strs4['data'][i] = i + 4
strs5['data'][i] = i + 5
nested = {}
nested['elements'] = [strs1, strs2, strs3, strs4, strs5]
nested['names'] = ['series1', 'series2', 'series3', 'series4', 'series5']
# parallel processing
pool = Pool(processes = 16)
manager = Manager()
my_dict = manager.dict()
my_dict['nested'] = nested
sequence = np.arange(len(my_dict['nested']['elements']))
pool.map(my_function, ([seq,my_dict] for seq in sequence))
pool.close()
pool.join()
# printing the data in all elements of the nested dictionary
print(my_dict['nested']['elements'][0]['data'])
print(my_dict['nested']['elements'][1]['data'])
print(my_dict['nested']['elements'][2]['data'])
print(my_dict['nested']['elements'][3]['data'])
print(my_dict['nested']['elements'][4]['data'])
One way to go around this and get correct results would be using multiprocessing.Lock, but that kills the speed:
from multiprocessing import Pool, Manager, Lock
import numpy as np
def init(l):
global lock
lock = l
def my_function(A):
arg1 = A[0]
my_dict = A[1]
with lock:
temporary_dict = my_dict['nested']
for arg2 in np.arange(len(my_dict['nested']['elements'][arg1]['data'])):
temporary_dict['elements'][arg1]['data'][arg2] = temporary_dict['elements'][arg1]['data'][arg2] ** 2
my_dict['nested'] = temporary_dict
if __name__ == '__main__':
# nested dictionary definition
strs1 = {}
strs2 = {}
strs3 = {}
strs4 = {}
strs5 = {}
strs1['data'] = {}
strs2['data'] = {}
strs3['data'] = {}
strs4['data'] = {}
strs5['data'] = {}
for i in [0,1,2,3]:
strs1['data'][i] = i + 1
strs2['data'][i] = i + 2
strs3['data'][i] = i + 3
strs4['data'][i] = i + 4
strs5['data'][i] = i + 5
nested = {}
nested['elements'] = [strs1, strs2, strs3, strs4, strs5]
nested['names'] = ['series1', 'series2', 'series3', 'series4', 'series5']
# parallel processing
manager = Manager()
l = Lock()
my_dict = manager.dict()
my_dict['nested'] = nested
pool = Pool(processes = 16, initializer=init, initargs=(l,))
sequence = np.arange(len(my_dict['nested']['elements']))
pool.map(my_function, ([seq,my_dict] for seq in sequence))
pool.close()
pool.join()
# printing the data in all elements of the nested dictionary
print(my_dict['nested']['elements'][0]['data'])
print(my_dict['nested']['elements'][1]['data'])
print(my_dict['nested']['elements'][2]['data'])
print(my_dict['nested']['elements'][3]['data'])
print(my_dict['nested']['elements'][4]['data'])

Related

ast nodes not preserving some properties (lineno or/and col_offset)

I'm trying to convert every break statement with exec('break') in a code. So far I've got this:
import ast
source = '''some_list = [2, 3, 4, 5]
for i in some_list:
if i == 4:
p = 0
break
exec('d = 9')'''
tree = ast.parse(source)
class NodeTransformer(ast.NodeTransformer):
def visit_Break(self, node: ast.Break):
print(ast.dump(node))
exec_break = ast.Call(func=ast.Name(id='exec', ctx=ast.Load()),
args=[ast.Constant(value='break')],
keywords=[])
return ast.copy_location(exec_break, node)
NodeTransformer().visit(tree)
print(ast.unparse(tree))
However, at the end it outputs p = 0 and exec('break') at the same line:
some_list = [2, 3, 4, 5]
for i in some_list:
if i == 4:
p = 0exec('break')
exec('d = 9')
I created the ast.Call object to the exec function with first argument 'break' but it seems not to transform properly. What did I miss?
I've found the bug. The ast.Call node has to be an ast.Expr object:
def visit_Break(self, node: ast.Break):
exec_break = ast.Call(func=ast.Name(id='exec', ctx=ast.Load()),
args=[ast.Constant(value='break')],
keywords=[])
new_node = ast.Expr(value=exec_break)
ast.copy_location(new_node, node)
ast.fix_missing_locations(new_node)
return new_node
Reference: https://greentreesnakes.readthedocs.io/en/latest/examples.html#simple-test-framework

How to customize width table in python-docx

I want to make a table in Document Word using python-docx but the width of the table is always max to the ruler. How can I customize this?
My code:
def table_columns(text, my_rows):
row = table.rows[0].cells
paragraph = row[my_rows].add_paragraph()
get_paragraph = paragraph.add_run(text)
paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
get_paragraph.bold = True
font = get_paragraph.font
font.size= Pt(10)
table = doc.add_table(rows = 1, cols = 5, style = 'Table Grid')
columns_width = {
0: 2,
1: 35,
2: 35,
3: 42,
4: 170
}
for column_idx in range(len(table.columns)):
table.cell(0, column_idx).width = Cm(columns_width[column_idx])
for rows_idx in range(len(table.rows)):
table.rows[rows_idx].height = Cm(1.25)
columns_names = {
0: 'NO',
1: 'VALUE1',
2: 'VALUE2',
3: 'VALUE3',
4: 'VALUE4'
}
for column_idx in range(len(table.columns)):
table_columns(columns_names[column_idx], column_idx)
I also change the columns_width but give the same result.
Here are the results and what I want to make to:
Thanks for your help.
Cell width is what matters here. You are using:
columns_width = {
0: 2,
1: 35,
2: 35,
3: 42,
4: 170
}
table.cell(0, column_idx).width = Cm(columns_width[column_idx])
to set the cell widths, which is fine, but you are using large Cm() (centimeter) lengths to do it. For example, 170 cm is 1.7 meters.
If you use Pt() instead or possibly Mm() I think you'll get better results.

Is the rear item in a Queue the last item added or the item at the end of a Queue?

My professor wrote a Queue class that uses arrays. I was giving it multiple test cases and got confused with one specific part. I want to figure out if the last item added is the rear of the queue. Lets say I enqueued 8 elements:
[1, 2, 3, 4, 5, 6, 7, 8]
Then I dequeued. And now:
[None, 2, 3, 4, 5, 6, 7, 8]
I enqueued 9 onto the Queue and it goes to the front. However, when I called my method that returns the rear item of the queue, q.que_rear, it returned 8. I thought the rear item would be 9? Since it was the last item added.
Here is how I tested it in case anyone is confused:
>>> q = ArrayQueue()
>>> q.enqueue(1)
>>> q.enqueue(2)
>>> q.enqueue(3)
>>> q.enqueue(4)
>>> q.data
[1, 2, 3, 4, None, None, None, None]
>>> q.dequeue()
1
>>> q.enqueue(5)
>>> q.enqueue(6)
>>> q.enqueue(7)
>>> q.enqueue(8)
>>> q.data
[None, 2, 3, 4, 5, 6, 7, 8]
>>> q.enqueue(9)
>>> q.data
[9, 2, 3, 4, 5, 6, 7, 8]
>>> q.que_rear()
Rear item is 8
EDIT
I just want to know what’s supposed to be the “rear of the Queue”? The last element added, or the element at the end of the list? In this case I showed, is it supposed to be 8 or 9?
Here is my code:
class ArrayQueue:
INITIAL_CAPACITY = 8
def __init__(self):
self.data = [None] * ArrayQueue.INITIAL_CAPACITY
self.rear = ArrayQueue.INITIAL_CAPACITY -1
self.num_of_elems = 0
self.front_ind = None
# O(1) time
def __len__(self):
return self.num_of_elems
# O(1) time
def is_empty(self):
return len(self) == 0
# Amortized worst case running time is O(1)
def enqueue(self, elem):
if self.num_of_elems == len(self.data):
self.resize(2 * len(self.data))
if self.is_empty():
self.data[0] = elem
self.front_ind = 0
self.num_of_elems += 1
else:
back_ind = (self.front_ind + self.num_of_elems) % len(self.data)
self.data[back_ind] = elem
self.num_of_elems += 1
def dequeue(self):
if self.is_empty():
raise Exception("Queue is empty")
elem = self.data[self.front_ind]
self.data[self.front_ind] = None
self.front_ind = (self.front_ind + 1) % len(self.data)
self.num_of_elems -= 1
if self.is_empty():
self.front_ind = None
# As with dynamic arrays, we shrink the underlying array (by half) if we are using less than 1/4 of the capacity
elif len(self) < len(self.data) // 4:
self.resize(len(self.data) // 2)
return elem
# O(1) running time
def first(self):
if self.is_empty():
raise Exception("Queue is empty")
return self.data[self.front_ind]
def que_rear(self):
if self.is_empty():
print("Queue is empty")
print("Rear item is", self.data[self.rear])
# Resizing takes time O(n) where n is the number of elements in the queue
def resize(self, new_capacity):
old_data = self.data
self.data = [None] * new_capacity
old_ind = self.front_ind
for new_ind in range(self.num_of_elems):
self.data[new_ind] = old_data[old_ind]
old_ind = (old_ind + 1) % len(old_data)
self.front_ind = 0
The que_rear function seems to be added post-hoc in an attempt to understand how the internal circular queue operates. But notice that self.rear (the variable que_rear uses to determine what the "rear" is) is a meaningless garbage variable, in spite of its promising name. In the initializer, it's set to the internal array length and never gets touched again, so it's just pure luck if it prints out the rear or anything remotely related to the rear.
The true rear is actually the variable back_ind, which is computed on the spot whenever enqueue is called, which is the only time it matters what the back is. Typically, queue data structures don't permit access to the back or rear (if it did, that would make it a deque, or double-ended queue), so all of this is irrelevant and implementation-specific from the perspective of the client (the code which is using the class to do a task as a black box, without caring how it works).
Here's a function that gives you the actual rear. Unsurprisingly, it's pretty much a copy of part of enqueue:
def queue_rear(self):
if self.is_empty():
raise Exception("Queue is empty")
back_ind = (self.front_ind + self.num_of_elems - 1) % len(self.data)
return self.data[back_ind]
Also, I understand this class is likely for educational purposes, but I'm obliged to mention that in a real application, use collections.dequeue for all your queueing needs (unless you need a synchronized queue).
Interestingly, CPython doesn't use a circular array to implement the deque, but Java does in its ArrayDeque class, which is worth a read.

How can I make my program to use multiple cores of my system in python?

I wanted to run my program on all the cores that I have. Here is the code below which I used in my program(which is a part of my full program. somehow, managed to write the working flow).
def ssmake(data):
sslist=[]
for cols in data.columns:
sslist.append(cols)
return sslist
def scorecal(slisted):
subspaceScoresList=[]
if __name__ == '__main__':
pool = mp.Pool(4)
feature,FinalsubSpaceScore = pool.map(performDBScan, ssList)
subspaceScoresList.append([feature, FinalsubSpaceScore])
#for feature in ssList:
#FinalsubSpaceScore = performDBScan(feature)
#subspaceScoresList.append([feature,FinalsubSpaceScore])
return subspaceScoresList
def performDBScan(subspace):
minpoi=2
Epsj=2
final_data = df[subspace]
db = DBSCAN(eps=Epsj, min_samples=minpoi, metric='euclidean').fit(final_data)
labels = db.labels_
FScore = calculateSScore(labels)
return subspace, FScore
def calculateSScore(cluresult):
score = random.randint(1,21)*5
return score
def StartingFunction(prvscore,curscore,fe_select,df):
while prvscore<=curscore:
featurelist=ssmake(df)
scorelist=scorecal(featurelist)
a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 'c' : ['dog', 'cat', 'tree','slow','fast','hurry']}
df2 = pd.DataFrame(a)
previous=0
current=0
dim=[]
StartingFunction(previous,current,dim,df2)
I had a for loop in scorecal(slisted) method which was commented, takes each column to perform DBSCAN and has to calculate the score for that particular column based on the result(but I tried using random score here in example). This looping is making my code to run for a longer time. So I tried to parallelize each column of the DataFrame to perform DBSCAN on the cores that i had on my system and wrote the code in the above fashion which is not giving the result that i need. I was new to this multiprocessing library. I was not sure with the placement of '__main__' in my program. I also would like to know if there is any other way in python to run in a parallel fashion. Any help is appreciated.
Your code has all what is needed to run on multi-core processor using more than one core. But it is a mess. I don't know what problem you trying to solve with the code. Also I cannot run it since I don't know what is DBSCAN. To fix your code you should do several steps.
Function scorecal():
def scorecal(feature_list):
pool = mp.Pool(4)
result = pool.map(performDBScan, feature_list)
return result
result is a list containing all the results returned by performDBSCAN(). You don't have to populate the list manually.
Main body of the program:
# imports
# functions
if __name__ == '__main__':
# your code after functions' definition where you call StartingFunction()
I created very simplified version of your code (pool with 4 processes to handle 8 columns of my data) with dummy for loops (to achieve cpu-bound operation) and tried it. I got 100% cpu load (I have 4-core i5 processor) that naturally resulted in approx x4 faster computation (20 seconds vs 74 seconds) in comparison with single process implementation through for loop.
EDIT.
The complete code I used to try multiprocessing (I use Anaconda (Spyder) / Python 3.6.5 / Win10):
import multiprocessing as mp
import pandas as pd
import time
def ssmake():
pass
def score_cal(data):
if True:
pool = mp.Pool(4)
result = pool.map(
perform_dbscan,
(data.loc[:, col] for col in data.columns))
else:
result = list()
for col in data.columns:
result.append(perform_dbscan(data.loc[:, col]))
return result
def perform_dbscan(data):
assert isinstance(data, pd.Series)
for dummy in range(5 * 10 ** 8):
dummy += 0
return data.name, 101
def calculate_score():
pass
def starting_function(data):
print(score_cal(data))
if __name__ == '__main__':
data = {
'a': [1, 2, 3, 1, 2, 3],
'b': [5, 6, 7, 4, 6, 5],
'c': ['dog', 'cat', 'tree', 'slow', 'fast', 'hurry'],
'd': [1, 1, 1, 1, 1, 1]}
data = pd.DataFrame(data)
start = time.time()
starting_function(data)
print(
'running time = {:.2f} s'
.format(time.time() - start))

Get rid of zombie processes

I'm having trouble getting rid of some zombie processes. I've read some of the other answers to this problem and from what I gather is this occurs when your child processes do not close correctly. I wasn't having this problem until I added a while loop to my code. Take a look.
def worker(self):
cmd = ["/home/orlando/CountMem","400000000","2000"]
p = subprocess.Popen(cmd,stdout=subprocess.PIPE)
id_list = []
id_list.append(p.pid)
while len(id_list) > 0:
for num in id_list:
stat_file = open("/proc/{0}/status".format(num))
mem_dict = {}
for i, line in enumerate(stat_file):
if i == 3:
#print line
mem_dict['ID'] = line
print(mem_dict)
if i == 10:
#print line
mem_dict['Mem'] = line
print(mem_dict)
return id_list
if __name__ == '__main__':
count = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes = count)
print(pool.map(worker,['ls']*count))
my code loops through the "/proc/PID/status" of each child process multiple times grabbing information. Without the "while" loop it doesn't spawn zombie processes but it also doesn't fulfill what I want it to do. With the loop it fulfills what I want it to do but it also spawns zombie processes. My question is how do I keep my code from spawning zombies. Below is some of the output I get:
{'ID': 'Pid:\t2446\n'}
{'ID': 'Pid:\t2441\n'}
{'Mem': 'VmPeak:\t 936824 kB\n', 'ID': 'Pid:\t2446\n'}
{'Mem': 'VmPeak:\t 542360 kB\n', 'ID': 'Pid:\t2441\n'}
{'ID': 'Pid:\t2442\n'}
{'Mem': 'VmPeak:\t 1037580 kB\n', 'ID':
this continues until the child processes are complete then it immediately begins printing this:
{'ID': 'Pid:\t2602\n'}
{'ID': 'Pid:\t2607\n'}
{'ID': 'Pid:\t2606\n'}
{'ID': 'Pid:\t2604\n'}
{'ID': 'Pid:\t2605\n'}
{'Mem': 'Threads:\t1\n', 'ID': 'Pid:\t2606\n'}
{'Mem': 'Threads:\t1\n', 'ID': 'Pid:\t2607\n'}
{'Mem': 'Threads:\t1\n', 'ID': 'Pid:\t2605\n'}
{'Mem': 'Threads:\t1\n', 'ID': 'Pid:\t2604\n'}
Can anyone help me understand and solve what is happening?
I figured out the answer I needed to add p.poll() I added it inside the while loop.
def worker(self):
cmd = ["/home/orlando/CountMem","400000000","2000"]
p = subprocess.Popen(cmd,stdout=subprocess.PIPE)
id_list = []
id_list.append(p.pid)
while len(id_list) > 0:
for num in id_list:
stat_file = open("/proc/{0}/status".format(num))
mem_dict = {}
for i, line in enumerate(stat_file):
if i == 3:
#print line
mem_dict['ID'] = line
print(mem_dict)
if i == 10:
#print line
mem_dict['Mem'] = line
print(mem_dict)
p.poll()
return id_list
if __name__ == '__main__':
count = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes = count)
print(pool.map(worker,['ls']*count))

Resources