I am creating a simple web spider. All it does is accept a URL, download the HTML and extract the remaining URLs. It then repeats the process for each new URL. I'm also making sure I don't visit the same URL twice and I am limiting the number of concurrent downloads.
After every unique URL has been exhausted (could run for days, weeks or till after I'm dead and gone), I would like to perform an action, like updating the UI or simply exiting the application.
The problem is, I don't know how to detect when the last thread has finished running.
Has this threading problem been solved? Am I looking at the problem wrong?
One thought was to keep each thread alive until all of its children finished (join). The problem is that the number of threads grow exponentially. For such a long-running process, it would quickly exhaust OS resources.
I'm not sure what language we are talking about so I'll speak generically.
You need a data structure for each URL that keeps track of how many "children" pages get generated from it. Whenever an URL is being spidered, it will have a "parent" data structure. Whenever a new page is found, is added to the parent's tree count. Whenever a page is spidered, the parent's tree count is decremented. This will need to be done in a synchronized manner since multiple threads will be updating it.
You may actually want to save the entire URL structure. The root URL "http://foo.x/" has links to "/1.html" and "/2.html" so it's children-count is 2. The root URL has a null parent and "1" and "2" have a parent of the root. When "1.html" is spidered then the root's children-count is decremented to 1. But if there are 3 links inside of "1.html" then the root's count gets incremented to 4. If you want to keep track of the tree then "1.html" children count goes to 3, etc.. Then when one of the children of "1.html" gets spidered, the count for "1.html" goes to 2 and the root URL's count goes to 3.
You certainly do not want to be keeping the threads around and then joining later as you mention -- your thread count will explode. You should use a thread-pool and submit URLs to spidered, each with their associated node in the URL tree, to the pool so they can be spidered by the same threads.
When an URL is spidered, and the children count goes to 0 then you know that you have spidered the whole tree and the URL can be removed from the working-list and moved to the done-list. Again, these lists will need to be synchronized since multiple threads will be operating on them.
Hope this helps somewhat.
Related
Over 2 years ago, Remy Lebeau gave me invaluable tips on threads in Delphi. His answers were very useful to me and I feel like I made great progress thanks to him. This post can be found here.
Today, I now face a "conceptual problem" about threads. This is not really about code, this is about the approach one should choose for a certain problem. I know we are not supposed to ask for personal opinions, I am merely asking if, on a technical point a view, one of these approach must be avoided or if they are both viable.
My application has a list of unique product numbers (named SKU) in a database. Querying an API with theses SKUS, I get back a JSON file containing details about these products. This JSON file is processed and results are displayed on screen, and saved in database. So, at one step, a download process is involved and it is executed in a worker thread.
I see two different approaches possible for this whole procedure :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. A Tstringlist is then built and, for each element of the list, a thread is launched, downloads the JSON, sends back the result to the main thread and terminates.
This can be pictured like this :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. Instead of sending SKU numbers one after another to the worker thread, the whole list is sent, and the worker thread iterates through the list, sending back results for displaying and saving to the main thread (via a synchronize event). So we only have one worker thread working the whole list before terminating.
This can be pictured like this :
I have coded these two different approaches and they both work... with each their downsides that I have experienced.
I am not a professional developer, this is a hobby and, before working my way further down a path or another for "polishing", I would like to know if, on a technical point of view and according to your knowledge and experience, one of the approaches I depicted should be avoided and why.
Thanks for your time
Mathias
Another thing to consider in this case is latency to your API that is producing the JSON. For example, if it takes 30 msec to go back and forth to the server, and 0.01 msec to create the JSON on the server, then querying a single JSON record per request, even if each request is in a different thread, does not make much sense. In that case, it would make sense to do fewer requests to the server, returning more data on each request, and partition the results up among different threads.
The other thing is that threads are not a solution to every problem. I would question why you need to break each sku into a single thread. how long is each individual thread running and how much processing is each thread doing? In general, creating lots of threads, for each thread to work for a fraction of a msec does not make sense. You want the threads to be alive for as long as possible, processing as much data as they can for the job. You don't want the computer to be using as much time creating/destroying threads as actually doing useful work.
I have this concurrent pattern that came up when trying to model my problem, and I don't know if there's a name for it. Having a design pattern reference or something like that could help me implement it more safely.
Concept:
The foreman (main thread) is asked to look for a series of objects in a big warehouse.
This warehouse has n floors. The foreman has a team of n workers (helper threads), each with a dedicated floor.
The foreman recieves an object, and asks every worker to find it.
If a worker finds it on their floor, they return to the foreman with appropriate information. (location, status...)
The foreman then calls back all other workers (since the item has been found there's no need for more searching), and move on to the next object.
If everyone comes back saying "No it's not on my floor" we can act accordingly. (signal a missing product to management...)
The main problem I have is that I need to make sure threads don't waste calculation time when the item has already been found, and to ensure proper coordination.
I also can't give every thread the entire list of things to find, since this information is recieved item by item. (eg. via network)
Are you looking for Observer pattern ?
Once a worker finds the item and returns to the Foreman. The Foreman should notify to all the workers that the item is found, so all the threads will stop search and return.
I'm building a multithreaded web crawler.
I launch a thread that gets first n href links and parses some data. Then it should add those links to a Visited list that other threads can access and adds the data to a global map that will be printed when the program is done. Then the thread launches new n new threads all doing the same thing.
How can I setup a global list of Visited sites that all threads can access and a global map that all threads can also write to.
You can't share data between processes. That doesn't mean that you can't share information.
the usual way is either to use a special process (a server) in charge of this job: maintain a state; in your case the list of visited links.
Another way is to use ETS (or Mnesia the database build upon ETS) which is designed to share information between processes.
Just to clarify, erlang/elixir uses processes rather than threads.
Given a list of elements, a generic approach:
An empty list called processed is saved to ets, dets, mnesia or some DB.
The new list of elements is filtered against the processed list so the Task is not unnecessarily repeated.
For each element of the filtered list, a task is run (which in turn spawns a process) and does some work on each element that returns a map of the required data. See the Task module Task.async/1 and Task.yield_many/2 could be useful.
Once all the tasks have returned or yielded,
all the maps or parts of the data in the maps are merged and can be persisted if/as required/appropriate.
the elements whose tasks did not crash or timeout are added to the processed list in the DB.
Tasks which crash or timeout could be handled differently.
I'm designing a large-scale project, and I think I see a way I could drastically improve performance by taking advantage of multiple cores. However, I have zero experience with multiprocessing, and I'm a little concerned that my ideas might not be good ones.
Idea
The program is a video game that procedurally generates massive amounts of content. Since there's far too much to generate all at once, the program instead tries to generate what it needs as or slightly before it needs it, and expends a large amount of effort trying to predict what it will need in the near future and how near that future is. The entire program, therefore, is built around a task scheduler, which gets passed function objects with bits of metadata attached to help determine what order they should be processed in and calls them in that order.
Motivation
It seems to be like it ought to be easy to make these functions execute concurrently in their own processes. But looking at the documentation for the multiprocessing modules makes me reconsider- there doesn't seem to be any simple way to share large data structures between threads. I can't help but imagine this is intentional.
Questions
So I suppose the fundamental questions I need to know the answers to are thus:
Is there any practical way to allow multiple threads to access the same list/dict/etc... for both reading and writing at the same time? Can I just launch multiple instances of my star generator, give it access to the dict that holds all the stars, and have new objects appear to just pop into existence in the dict from the perspective of other threads (that is, I wouldn't have to explicitly grab the star from the process that made it; I'd just pull it out of the dict as if the main thread had put it there itself).
If not, is there any practical way to allow multiple threads to read the same data structure at the same time, but feed their resultant data back to a main thread to be rolled into that same data structure safely?
Would this design work even if I ensured that no two concurrent functions tried to access the same data structure at the same time, either for reading or for writing?
Can data structures be inherently shared between processes at all, or do I always explicitly have to send data from one process to another as I would with processes communicating over a TCP stream? I know there are objects that abstract away that sort of thing, but I'm asking if it can be done away with entirely; have the object each thread is looking at actually be the same block of memory.
How flexible are the objects that the modules provide to abstract away the communication between processes? Can I use them as a drop-in replacement for data structures used in existing code and not notice any differences? If I do such a thing, would it cause an unmanageable amount of overhead?
Sorry for my naivete, but I don't have a formal computer science education (at least, not yet) and I've never worked with concurrent systems before. Is the idea I'm trying to implement here even remotely practical, or would any solution that allows me to transparently execute arbitrary functions concurrently cause so much overhead that I'd be better off doing everything in one thread?
Example
For maximum clarity, here's an example of how I imagine the system would work:
The UI module has been instructed by the player to move the view over to a certain area of space. It informs the content management module of this, and asks it to make sure that all of the stars the player can currently click on are fully generated and ready to be clicked on.
The content management module checks and sees that a couple of the stars the UI is saying the player could potentially try to interact with have not, in fact, had the details that would show upon click generated yet. It produces a number of Task objects containing the methods of those stars that, when called, will generate the necessary data. It also adds some metadata to these task objects, assuming (possibly based on further information collected from the UI module) that it will be 0.1 seconds before the player tries to click anything, and that stars whose icons are closest to the cursor have the greatest chance of being clicked on and should therefore be requested for a time slightly sooner than the stars further from the cursor. It then adds these objects to the scheduler queue.
The scheduler quickly sorts its queue by how soon each task needs to be done, then pops the first task object off the queue, makes a new process from the function it contains, and then thinks no more about that process, instead just popping another task off the queue and stuffing it into a process too, then the next one, then the next one...
Meanwhile, the new process executes, stores the data it generates on the star object it is a method of, and terminates when it gets to the return statement.
The UI then registers that the player has indeed clicked on a star now, and looks up the data it needs to display on the star object whose representative sprite has been clicked. If the data is there, it displays it; if it isn't, the UI displays a message asking the player to wait and continues repeatedly trying to access the necessary attributes of the star object until it succeeds.
Even though your problem seems very complicated, there is a very easy solution. You can hide away all the complicated stuff of sharing you objects across processes using a proxy.
The basic idea is that you create some manager that manages all your objects that should be shared across processes. This manager then creates its own process where it waits that some other process instructs it to change the object. But enough said. It looks like this:
import multiprocessing as m
manager = m.Manager()
starsdict = manager.dict()
process = Process(target=yourfunction, args=(starsdict,))
process.run()
The object stored in starsdict is not the real dict. instead it sends all changes and requests, you do with it, to its manager. This is called a "proxy", it has almost exactly the same API as the object it mimics. These proxies are pickleable, so you can pass as arguments to functions in new processes (like shown above) or send them through queues.
You can read more about this in the documentation.
I don't know how proxies react if two processes are accessing them simultaneously. Since they're made for parallelism I guess they should be safe, even though I heard they're not. It would be best if you test this yourself or look for it in the documentation.
A process in Erlang will either call link/1 or spawn_link to create a link with another process. In a recent application i am working on i got curious on whether its possible for a process to know at a given instance, the number of other processes its linked to. is this possible ? is their a BIF ?
Then, also, when a linked process dies, i guess that if it were possible to know the number of linked processes, this number would be decremented automatically by the run-time system. Such a mechanism would be ideal in dealing with Parent-Child relationships in Erlang concurrent programs, even in simple ones which do not involve supervisors.
Well, is it possible for an Erlang process to know out-of-the-box perhaps via a BIF, the number of processes linked to it, such that whenever a linked process dies, this value is decremented automatically under-the-hood :)?
To expand on this question a little bit, consider a gen_server, which will handle thousands of messages via handle_info. In this part, its job is to dispatch child processes to handle the task as soon as it comes in. The aim of this is to make sure the server loop returns immediately to take up the next request. Now, the child process handles the task asynchronously and sends the reply back to the requestor before it dies. Please refer to this question and its answer before you continue. Now, what if, for every child process spawned off by the gen_server, a link is created, and i would like to use this link as a counter. I know, i know, everyone is going to be like " why not use the gen_server State, to carry say, a counter, and then increment or decrement it accordingly ? " :) Somewhere in the gen_server, i have:
handle_info({Sender,Task},State)->
spawn_link(?MODULE,child,[Sender,Task]),
%% At this point, the number of links to the gen_server is incremented
%% by the run-time system
{noreply,State};
handle_info( _ ,State) -> {noreply,State}.
The child goes on to do this:
child(Sender,Task)->
Result = (catch execute_task(Task)),
Sender ! Result,
ok. %% At this point the child process exits,
%% and i expect the link value to be decremented
Then finally, the gen_server has an exposed call like this:
get_no_of_links()-> gen_server:call(?MODULE,links).
handle_call(links, _ ,State)->
%% BIF to get number of instantaneous links expected here
Links = erlang:get_links(), %% This is fake, do not do it at home :)
{reply,Links,State};
handle_call(_ , _ ,State)-> {reply,ok,State}.
Now, some one may ask them selves, really, Why would anyone want to do this ?
Usually, its possible to create an integer in the gen_server State and then we do it ourselves, or at least make the gen_server handle_info of type {'EXIT',ChildPid,_Reason} and then the server would act accordingly. My thinking is that if it were possible to know the number of links, i would use this to know ( at a given moment in time), how many child processes are still busy working, this in turn may actually assist in anticipating server load.
From manual for process_info:
{links, Pids}:
Pids is a list of pids, with processes to which the process
has a link
3> process_info(self(), links).
{links,[<0.26.0>]}
4> spawn_link(fun() -> timer:sleep(100000) end).
<0.38.0>
5> process_info(self(), links).
{links,[<0.26.0>,<0.38.0>]}
I guess it could be used to count number of linked processes
Your process should run process_flag(trap_exit, true) and listen for messages of the form {'EXIT', Pid, Reason} which will arrive whenever a linked process exits. If you don't trap exits, the default behaviour will be for your linked process to exit when the other side of the link exits.
As for listening to when processes add links, you can use case process_info(self(), links) of {links, L} -> length(L) end or length(element(2, process_info(self(), links)), but you have to re-run this regularly as there is no way for your process to be notified whenever a link is added.
A process following OTP guidelines never needs to know how many processes are linked to it.