I have the following:
from struct import pack_into
from mmap import mmap
from multiprocessing import Pool
mem_label = "packed_ints"
total = 5 * 10**7
def create_mmap(size = total):
''' Seems only the Windows version of mmap accepts labels '''
is_this_pickled = mmap(-1, total * 4, mem_label)
def pack_into_mmap(idx_nums_tup):
idx, ints_to_pack = idx_nums_tup
pack_into(str(len(ints_to_pack)) + 'i', mmap(-1, total * 4, mem_label) , idx*4*total//2 , *ints_to_pack)
if __name__ == '__main__':
create_mmap()
ints_to_pack = range(total)
pool = Pool()
pool.map(pack_into_mmap, enumerate((ints_to_pack[:total//2], ints_to_pack[total//2:])))
I "hid" the initial mmap inside a function, but I would like to know for certain what is being pickled.
Can I monitor / tap into that information in Python?
I am not certain if there is an easy way to tell what information is pickled and what information is inherited when using multiprocessing.Pool. However, in your code example, I am confident that the is_this_pickled variable is in fact not pickled since it is never passed to the Pool object in any fashion. The underlying mmap object should be inherited by child processes.
Related
I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?
I cannot use multiprocessing, I need shared memory among entirely separate python processes on Windows using python 3. I've figured out how to do this using mmap, and it works great...when I use simple primitive types. However, I need to pass around more complex information. I've found the ctypes.Structure and it seems to be exactly what I need.
I want to create an array of ctypes.Structure and update an individual element within that array, write it back to memory as well as read an individual element.
import ctypes
import mmap
class Person(ctypes.Structure):
_fields_ = [
('name', ctypes.c_wchar * 10),
('age', ctypes.c_int)
]
if __name__ == '__main__':
num_people = 5
person = Person()
people = Person * num_people
mm_file = mmap.mmap(-1, ctypes.sizeof(people), access=mmap.ACCESS_WRITE, tagname="shmem")
Your people is not an array yet, it's still a class. In order to have your array, you need to initialize the class using from_buffer(), just like you were doing before with c_int:
PeopleArray = Person * num_people
mm_file = mmap.mmap(-1, ctypes.sizeof(PeopleArray), ...)
people = PeopleArray.from_buffer(mm_file)
people[0].name = 'foo'
people[0].age = 27
people[1].name = 'bar'
people[1].age = 42
...
I am trying to write a Python function for fast calculation of the sum of a list, using parallel computing. Initially I tried to use the Python multithreading library, but then I noticed that all threads run on the same CPU, so there is no speed gain, so I switched to using multiprocessing. In the first version I made the list a global variable:
from multiprocessing import Pool
array = 100000000*[1]
def sumPart(fromTo:tuple):
return sum(array[fromTo[0]:fromTo[1]])
with Pool(2) as pool:
print(sum(pool.map(sumPart, [(0,len(array)//2), (len(array)//2,len(array))])))
This worked well and returned the correct sum after about half the time of a serial computation.
But then I wanted to make it a function that accepts the array as an argument:
def parallelSum(theArray):
def sumPartLocal(fromTo: tuple):
return sum(theArray[fromTo[0]:fromTo[1]])
with Pool(2) as pool:
return (sum(pool.map(sumPartLocal, [(0, len(theArray) // 2), (len(theArray) // 2, len(theArray))])))
Here I got an error:
AttributeError: Can't pickle local object 'parallelSum.<locals>.sumPartLocal'
What is the correct way to write this function?
When scheduling jobs to a Python Pool you need to ensure both the function and it's arguments can be serialized as they will be transferred over a pipe.
Python uses the pickle protocol to serialize its objects. You can see what can be pickled in the module documentation. In your case, you are facing this limitation.
functions defined at the top level of a module (using def, not lambda)
Under the hood, the Pool is sending a string with the function name and its parameters. The Python interpreter in the child process looks for that function name in the module and fails to find it as it's nested in the scope of another function parallelSum.
Move sumPartLocal outside parallelSum and everything will be fine.
I believe you are hitting this, or see the documentation
What you could do is leave def sumPartLocal at module level, and pass theArray as third component of your tuple so that would be fromTo[2] inside the sumPartLocal function.
Example:
from multiprocessing import Pool
def sumPartLocal(fromTo: tuple):
return sum(fromTo[2][fromTo[0]:fromTo[1]])
def parallelSum(theArray):
with Pool(2) as pool:
return (sum
(pool.map
(sumPartLocal, [
(0, len(theArray) // 2, theArray),
(len(theArray) // 2, len(theArray), theArray)
]
)
)
)
if __name__ == '__main__':
theArray = 100000000*[1]
s = parallelSum(theArray)
print(s)
[EDIT 15-Dec-2017 based on comments]
Anyone who is thinking of multi-threading in python, I strongly recommend reading up about the Global Interpreter Lock
Also, some good answers on this question here on SO
I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map()
method. For very long iterables using a large value for chunksize can
make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.
In order to get a Cassandra insert going faster I'm using multithreading, its working ok, but if I add more threads it doesnt make any difference, I think I'm not generating more connections, I think maybe I should be using pool.execute(f, *args, **kwargs) but I dont know how to use it, the documentation is quite scanty. Heres my code so far..
import connect_to_ks_bp
from connect_to_ks_bp import ks_refs
import time
import pycassa
from datetime import datetime
import json
import threadpool
pool = threadpool.ThreadPool(20)
count = 1
bench = open("benchCassp20_100000.txt", "w")
def process_tasks(lines):
#let threadpool format your requests into a list
requests = threadpool.makeRequests(insert_into_cfs, lines)
#insert the requests into the threadpool
for req in requests:
pool.putRequest(req)
pool.wait()
def read(file):
"""read data from json and insert into keyspace"""
json_data=open(file)
lines = []
for line in json_data:
lines.append(line)
print len(lines)
process_tasks(lines)
def insert_into_cfs(line):
global count
count +=1
if count > 5000:
bench.write(str(datetime.now())+"\n")
count = 1
#print count
#print kspool.checkedout()
"""
user_tweet_cf = pycassa.ColumnFamily(kspool, 'UserTweet')
user_name_cf = pycassa.ColumnFamily(kspool, 'UserName')
tweet_cf = pycassa.ColumnFamily(kspool, 'Tweet')
user_follower_cf = pycassa.ColumnFamily(kspool, 'UserFollower')
"""
tweet_data = json.loads(line)
"""Format the tweet time as an epoch seconds int value"""
tweet_time = time.strptime(tweet_data['created_at'],"%a, %d %b %Y %H:%M:%S +0000")
tweet_time = int(time.mktime(tweet_time))
new_user_tweet(tweet_data['from_user_id'],tweet_time,tweet_data['id'])
new_user_name(tweet_data['from_user_id'],tweet_data['from_user_name'])
new_tweet(tweet_data['id'],tweet_data['text'],tweet_data['to_user_id'])
if tweet_data['to_user_id'] != 0:
new_user_follower(tweet_data['from_user_id'],tweet_data['to_user_id'])
""""4 functions below carry out the inserts into specific column families"""
def new_user_tweet(from_user_id,tweet_time,id):
ks_refs.user_tweet_cf.insert(from_user_id,{(tweet_time): id})
def new_user_name(from_user_id,user_name):
ks_refs.user_name_cf.insert(from_user_id,{'username': user_name})
def new_tweet(id,text,to_user_id):
ks_refs.tweet_cf.insert(id,{
'text': text
,'to_user_id': to_user_id
})
def new_user_follower(from_user_id,to_user_id):
ks_refs.user_follower_cf.insert(from_user_id,{to_user_id: 0})
read('tweets.json')
if __name__ == '__main__':
This is just another file..
import pycassa
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
"""This is a static class I set up to hold the global database connection stuff,
I only want to connect once and then the various insert functions will use these fields a lot"""
class ks_refs():
pool = ConnectionPool('TweetsKS',use_threadlocal = True,max_overflow = -1)
#classmethod
def cf_connect(cls, column_family):
cf = pycassa.ColumnFamily(cls.pool, column_family)
return cf
ks_refs.user_name_cfo = ks_refs.cf_connect('UserName')
ks_refs.user_tweet_cfo = ks_refs.cf_connect('UserTweet')
ks_refs.tweet_cfo = ks_refs.cf_connect('Tweet')
ks_refs.user_follower_cfo = ks_refs.cf_connect('UserFollower')
#trying out a batch mutator whihc is supposed to increase performance
ks_refs.user_name_cf = ks_refs.user_name_cfo.batch(queue_size=10000)
ks_refs.user_tweet_cf = ks_refs.user_tweet_cfo.batch(queue_size=10000)
ks_refs.tweet_cf = ks_refs.tweet_cfo.batch(queue_size=10000)
ks_refs.user_follower_cf = ks_refs.user_follower_cfo.batch(queue_size=10000)
A few thoughts:
Batch sizes of 10,000 are way too large. Try 100.
Make your ConnectionPool size at least as large as the number of threads using the pool_size parameter. The default is 5. Pool overflow should only be used when the number of active threads may vary over time, not when you have a fixed number of threads. The reason is that it will result in a lot of unnecessary opening and closing of new connections, which is a fairly expensive process.
After you've resolved those issues, look into these:
I'm not familiar with the threadpool library that you're using. Make sure that if you take the insertions to Cassandra out of the picture that you see an increase in the performance when you increase the number of threads
Python itself has a limit to how many threads may be useful due to the GIL. It shouldn't normally max out at 20, but it might if you're doing something CPU intensive or something that requires a lot of Python interpretation. The test that I described in my previous point will cover this as well. It may be the case that you should consider using the multiprocessing module, but you would need some code changes to handle that (namely, not sharing ConnectionPools, CFs, or hardly anything else between processes).