locks needed for multithreaded python scraping?

locks needed for multithreaded python scraping? - multithreading

I have a list of zipcodes that I want to pull business listings for using the yelp fusion api. Each zipcode will have to make at least one api call ( often much more) and so, I want to be able to keep track of my api usage as the daily limit is 25000. I have defined each zipcode as an instance of user defined Locale class. This locale class has a class variable Locale.pulls, which acts as a global counter for the number of pulls.
I want to multithread this using the multiprocessing module but I am not sure if I need to use locks and if so, how would I do so? The concern is race conditions as I need to be sure each thread sees the current number of pulls defined as the Zip.pulls class variable in the pseudo code below.
import multiprocessing.dummy as mt
class Locale():
pulls = 0
MAX_PULLS = 20000
def __init__(self,x,y):
#initialize the instance with arguments needed to complete the API call
def pull(self):
if Locale.pulls > MAX_PULLS:
return none
else:
# make the request, store the returned data and increment the counter
self.data = self.call_yelp()
Locale.pulls += 1
def main():
#zipcodes below is a list of arguments needed to initialize each zipcode as a Locale class object
pool = mt.Pool(len(zipcodes)/100) # let each thread work on 100 zipcodes
data = pool.map(Locale, zipcodes)

A simple solution would be to check that len(zipcodes) < MAP_PULLS before running the map().

Related

Shared memory and how to access a global variable from within a class in Python, with multiprocessing?

I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?

Python multiprocess: run several instances of a class, keep all child processes in memory

First, I'd like to thank the StackOverflow community for the tremendous help it provided me over the years, without me having to ask a single question.
I could not find anything that I can relate to my problem, though it is probably due to my lack of understanding of the subject, rather than the absence of a response on the website. My apologies in advance if this is a duplicate.
I am relatively new to multiprocess; some time ago I succeeded in using multiprocessing.pools in a very simple way, where I didn't need any feedback between the child processes.
Now I am facing a much more complicated problem, and I am just lost in the documentation about multiprocessing. I hence ask for you help, your kindness and your patience.
I am trying to build a parallel tempering monte-carlo algorithm, from a class.
The basic class very roughly goes as follows:
import numpy as np
class monte_carlo:
def __init__(self):
self.x=np.ones((1000,3))
self.E=np.mean(self.x)
self.Elist=[]
def simulation(self,temperature):
self.T=temperature
for i in range(3000):
self.MC_step()
if i%10==0:
self.Elist.append(self.E)
return
def MC_step(self):
x=self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1,1,3))
temp_E=np.mean(self.x)
if np.random.random()<np.exp((self.E-temp_E)/self.T):
self.E=temp_E
self.x=x
return
Obviously, I simplified a great deal (actual class is 500 lines long!), and built fake functions for simplicity: __init__ takes a bunch of parameters as arguments, there are many more lists of measurement else than self.Elist, and also many arrays derived from self.X that I use to compute them. The key point is that each instance of the class contains a lot of informations that I want to keep in memory, and that I don't want to copy over and over again, to avoid dramatic slowing down. Else I would just use the multiprocessing.pool module.
Now, the parallelization I want to do, in pseudo-code:
def proba(dE,pT):
return np.exp(-dE/pT)
Tlist=[1.1,1.2,1.3]
N=len(Tlist)
G=[]
for _ in range(N):
G.append(monte_carlo())
for _ in range(5):
for i in range(N): # this loop should be ran in multiprocess
G[i].simulation(Tlist[i])
for i in range(N//2):
dE=G[i].E-G[i+1].E
pT=G[i].T + G[i+1].T
p=proba(dE,pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = G[i].T
G[i].T = G[i+1].T
G[i+1].T = T_temp
Synthesis: I want to run several instances of my monte-carlo class in parallel child processes, with different values for a parameter T, then periodically pause everything to change the different T's, and run again the child processes/class instances, from where they paused.
Doing this, I want each class-instance/child-process to stay independent from one another, save its current state with all internal variables while it is paused, and do as few copies as possible. This last point is critical, as the arrays inside the class are quite big (some are 1000x1000), and a copy will therefore very quickly become quite time-costly.
Thanks in advance, and sorry if I am not clear...
Edit:
I am using a distant machine with many (64) CPUs, running on Debian GNU/Linux 10 (buster).
Edit2:
I made a mistake in my original post: in the end, the temperatures must be exchanged between the class-instances, and not inside the global Tlist.
Edit3: Charchit answer works perfectly for the test code, on both my personal machine and the distant machine I am usually using for running my codes. I hence check this as the accepted answer.
However, I want to report here that, inserting the actual, more complicated code, instead of the oversimplified monte_carlo class, the distant machine gives me some strange errors:
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gtk-WARNING **: ##:##:##:###: Locale not supported by C library.
Using the fallback 'C' locale.
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###:
gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
The "##:##:##:###" are (or seems like) IP adresses.
Without the call to set_start_method('spawn') this error shows only once, in the very beginning, while when I use this method, it seems to show at every occurrence of result.get()...
The strangest thing is that the code seems otherwise to work fine, does not crash, produces the datafiles I then ask it to, etc...
I think this would deserve to publish a new question, but I put it here nonetheless in case someone has a quick answer.
If not, I will resort to add one by one the variables, methods, etc... that are present in my actual code but not in the test example, to try and find the origin of the bug. My best guess for now is that the memory space required by each child-process with the actual code, is too large for the distant machine to accept it, due to some restrictions implemented by the admin.

What you are looking for is sharing state between processes. As per the documentation, you can either create shared memory, which is restrictive about the data it can store and is not thread-safe, but offers better speed and performance; or you can use server processes through managers. The latter is what we are going to use since you want to share whole objects of user-defined datatypes. Keep in mind that using managers will impact speed of your code depending on the complexity of the arguments that you pass and receive, to and from the managed objects.
Managers, proxies and pickling
As mentioned, managers create server processes to store objects, and allow access to them through proxies. I have answered a question with better details on how they work, and how to create a suitable proxy here. We are going to use the same proxy defined in the linked answer, with some variations. Namely, I have replaced the factory functions inside the __getattr__ to something that can be pickled using pickle. This means that you can run instance methods of managed objects created with this proxy without resorting to using multiprocess. The result is this modified proxy:
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
Solution
Now we only need to make sure that when we are creating objects of monte_carlo, we do so using managers and the above proxy. For that, we create a class constructor called create. All objects for monte_carlo should be created with this function. With that, the final code looks like this:
from multiprocessing import Pool
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
class monte_carlo:
def __init__(self, ):
self.x = np.ones((1000, 3))
self.E = np.mean(self.x)
self.Elist = []
self.T = None
def simulation(self, temperature):
self.T = temperature
for i in range(3000):
self.MC_step()
if i % 10 == 0:
self.Elist.append(self.E)
return
def MC_step(self):
x = self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1, 1, 3))
temp_E = np.mean(self.x)
if np.random.random() < np.exp((self.E - temp_E) / self.T):
self.E = temp_E
self.x = x
return
#classmethod
def create(cls, *args, **kwargs):
# Register class
class_str = cls.__name__
BaseManager.register(class_str, cls, ObjProxy, exposed=tuple(dir(cls)))
# Start a manager process
manager = BaseManager()
manager.start()
# Create and return this proxy instance. Using this proxy allows sharing of state between processes.
inst = eval("manager.{}(*args, **kwargs)".format(class_str))
return inst
def proba(dE,pT):
return np.exp(-dE/pT)
if __name__ == "__main__":
Tlist = [1.1, 1.2, 1.3]
N = len(Tlist)
G = []
# Create our managed instances
for _ in range(N):
G.append(monte_carlo.create())
for _ in range(5):
# Run simulations in the manager server
results = []
with Pool(8) as pool:
for i in range(N): # this loop should be ran in multiprocess
results.append(pool.apply_async(G[i].simulation, (Tlist[i], )))
# Wait for the simulations to complete
for result in results:
result.get()
for i in range(N // 2):
dE = G[i].E - G[i + 1].E
pT = G[i].T + G[i + 1].T
p = proba(dE, pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = Tlist[i]
Tlist[i] = Tlist[i + 1]
Tlist[i + 1] = T_temp
print(Tlist)
This meets the criteria you wanted. It does not create any copies at all, rather, all arguments to the simulation method call are serialized inside the pool and sent to the manager server where the object is actually stored. It gets executed there, and the results (if any) are serialized and returned in the main process. All of this, with only using the builtins!
Output
[1.2, 1.1, 1.3]
Edit
Since you are using Linux, I encourage you to use multiprocessing.set_start_method inside the if __name__ ... clause to set the start method to "spawn". Doing this will ensure that the child processes do not have access to variables defined inside the clause.

How to use global variables in django

I use django_rest_framework. I need parse huge xml file then find some data.
Parse xml on get call is bad idea because is too long time. I try once a 3 minute parse it and save parsed object to global variable. But I'm not sure about it work correctly.
Example:
class MyView(APIView):
catalog = None
parse_time = 0
#classmethod
def get_respect_catalog(cls):
if time.time() - cls.parse_time > 300:
cls.catalog = parse_xml()
cls.parse_time = time.time()
return cls.catalog
def get(self, request):
vals = catalog.xpath('tag/text()') # here i find some tag
...
return response
I sent several requests but many time variable parse_time had value 0. As if class MyView recreate sometimes and class variables "catalog" and "parse_method" resets to init values.
I think it because uwsgi have many workers and many interprets. May be exists way for using global variables in django.
P.S. I know for my case I need use database. But I want use global vars.

Using manual deepcopy on cython classes causes memory overflow.Why?

I am developing an Intelligent agent for board games using MCTS algorithm.
Monte carlo tree search (MCTS) is a popular method in AI which is mostly used for games (like GO, Chess, ...). In this method, An agent builds a tree based on states which would be a result of choosing moves allowed in current state. Agent is allowed to search through the tree for limited time. in this period, Agent expands the tree to the nodes which are most promising (for winning a game).
The picture below shows the process:
For more information you can check this link:
1 - http://www.cameronius.com/research/mcts/about/index.html
In root node of the tree, there would be a variable rootstate which shows the current state of game. A deepcopy of rootstate is used to simulate the tree states (future states) as we go deep in the tree.
I used this code for deepcopy of gamestate class because deepcopy doesn't work fine with cython objects due to their problem with pickle protocol:
cdef class gamestate:
# ... other functions
def __deepcopy__(self,memo_dictionary):
res = gamestate(self.size)
res.PLAYERS = self.PLAYERS
res.size = int(self.size)
res.board = np.array(self.board, dtype=np.int32)
res.white_groups = deepcopy(self.white_groups) # a module which checks if white player has won the game
res.black_groups = deepcopy(self.black_groups) # a module which checks if black player has won the game
# the black_groups and white_groups are also cython objects which the same deepcopy function is implemented for them
# .... etc
return res
Whenever an MCTS iteration starts, a deepcopy of the state is stored in memory.
The problem which occurs is that in the begining of the game,
the iterations per 1 second is between 2000 and 3000 which is expected, but as the game tree expands,the iterations per 1 second decreases to 1. It get even worse when each iteration takes more time to
be completed. When I checked the memory usage, I noticed that it increases from 0.6 percent to 90 percent for each time I call the agent to search. I had implemented the same algorithm in pure python and it has no issues of this type. So I guess the __deepcopy__ function causes the problem. I was once suggested to make my own pickle protocol for cython objects in here, but I am not very much familiar with pickle module.
Can anyone suggest me some protocol to use for my cython objects to get rid of this obstacle.
Edit 2:
I add some parts of the code which might help more.
The code below belongs to deepcopy of class unionfind which is used for white_groups and black_groups in gamestate:
cdef class unionfind:
cdef public:
dict parent
dict rank
dict groups
list ignored
cdef __init__(self):
# initialize variables ...
def __deepcopy__(self, memo_dictionary):
res = unionfind()
res.parent = self.parent
res.rank = self.rank
res.groups = self.groups
res.ignored = self.ignored
return res
this one is the search function which is run during allowed time:
cdef class mctsagent:
def search(time_budget):
cdef int num_rollouts = 0
while (num_rollouts < time_budget):
state_copy = deepcopy(self.rootstate)
node, state = self.select_node(state_copy) # expansion runs inside the select_node function
turn = state.turn()
outcome = self.roll_out(state)
self.backup(node, turn, outcome)
num_rollouts += 1

This issue is probably the lines
res.white_groups = deepcopy(self.white_groups) # a module which checks if white player has won the game
res.black_groups = deepcopy(self.black_groups) # a module which checks if black player has won the game
What you should be doing is calling deepcopy with the second argument memo_dictionary. This is deepcopys record of if it's already copied an object. Without it deepcopy ends up copying the same object multiple times (hence the huge memory use)
res.white_groups = deepcopy(self.white_groups, memo_dictionary) # a module which checks if white player has won the game
res.black_groups = deepcopy(self.black_groups, memo_dictionary) # a module which checks if black player has won the game
If the __deepcopy__() implementation needs to make a deep copy of a component, it should call the deepcopy() function with the component as first argument and the memo dictionary as second argument.
(edit: just seen that #Blckknght already pointed this out in the comments)
(edit2: unionfind looks to mainly contain Python objects. There probably isn't a huge value in it being a cdef class and not just a normal class. Also, your current __deepcopy__ for it doesn't actually make a copy of those dictionaries - you should be doing res.parent = deepcopy(self.parent, memo_dictionary) etc.. If you just made it a normal class this would be implemented automatatically)

GObject.timeout_add stops running unexpectedly

I am working on an Ubuntu Appindicator that displays the value of a JSON API call every X seconds.
The issue is that, randomly, it will stop calling self.loop without any error or warning. I can have it running for days or for hours. I've setup (in development) debug statements and it always stops running after the loop function is called.
It's as if I was returning False or not returning from the function even though the logic in this code should always return True.
Here is the documentation for GObject.timeout_add (for GTK2 but the principle stands).
I'm not sure if it's dependent on the PyGTK version. I've had it happen in Ubuntu 16.04 and Ubuntu 17.04.
Here is the full class. The point where the JSON API is called is at result = self.currency.query(). I am happy to give further feedback.
import gi
gi.require_version('Gtk', '3.0')
from gi.repository import GObject
class QueryLoop():
"""QueryLoop accepts the indicator to which the result will be written and an currency to obtain the results from.
To define an currency you only need to implement the query method and return the results in a pre-determined
format so that it will be consistent."""
def __init__(self, indicator, currency, timeout=5000):
"""
Initialize the query loop with the indicator and the currency to get the data from.
:param indicator: An instance of an indicator
:param currency: An instance of an currency to get the information from
:param timeout The interval between requests to the currency API
"""
self.indicator = indicator
self.currency = currency
self.timeout = timeout
self.last_known = {"last": "0.00"}
def loop(self):
"""Loop calls it-self forever and ever and will consult the currency for the most current value and update
the indicator's label content."""
result = self.currency.query()
if result is not None:
self.indicator.set_label("{} {}".format(result["last"], self.currency.get_ticker()))
self.last_known = result
else:
self.indicator.set_label("Last Known: {} EUR (Error)".format(self.last_known["last"]))
return True
def start(self):
"""Starts the query loop it does not do anything else. It's merely a matter of naming because
when initializing the loop in the main() point of entry."""
GObject.timeout_add(self.timeout, self.loop)
self.loop()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

locks needed for multithreaded python scraping? - multithreading

A simple solution would be to check that len(zipcodes) < MAP_PULLS before running the map().

Related

Shared memory and how to access a global variable from within a class in Python, with multiprocessing?

Python multiprocess: run several instances of a class, keep all child processes in memory

How to use global variables in django

Using manual deepcopy on cython classes causes memory overflow.Why?

GObject.timeout_add stops running unexpectedly

Categories

Resources