I have a CLI application of which I can run multiple instances simultaneously. I need to associate a unique, sequential and re-usable identifier to each instance. It should also be contextual/independent for each process type.
Example:
The first, second and third instances get ids 0, 1 and 2, respectively.
Now if the second instance dies and another instance comes up, that new instance should be given id 1 since it was "freed" by the dying instance.
If I run a different process type, I should be given id 0.
The obvious choice would be to use the processes' PID but that would give me too many different and too sparse identifiers.
Is there something built-in in Unix/Linux or some service that gives me that?
I would prefer a system native or Node.js solution.
Background:
I'm using Graphite to generate stats of an application and I don't want to potentially create thousands of buckets of the same stats using the processes' PIDs. If there's an alternative solution to this problem, I would also be interested in knowing that.
Thank you!
Since I didn't find any system that meets my requirements, I have created an app myself and hosted it at GitHub: https://github.com/muzzley/process-id-dealer
It's a Node.js app that deals sequential and reusable process ids through an HTTP endpoint. Thus, it can be used by any other program.
Related
I'm looking at several examples from PETSc and petsc4py and looking at the PDF user manual of PETSc. The manual states:
For those not familiar with MPI, acommunicatoris a way of indicating a collection of processes that will be involved together in a calculation or communication. Communicators have the variable type MPI_Comm. In most cases users can employ the communicator PETSC_COMM_WORLD to indicate all processes in a given run and PETSC_COMM_SELF to indicate a single process.
I believe I understand that statement, but I'm unsure of the real consequences of actually using these communicators are. I'm unsure of what really happens when you do TSCreate(PETSC_COMM_WORLD,...) vs TSCreate(PETSC_COMM_SELF,...) or likewise for a distributed array. If you created a DMDA with PETSC_COMM_SELF, does this maybe mean that the DM object won't really be distributed across multiple processes? Or if you create a TS with PETSC_COMM_SELF and a DM with PETSC_COMM_WORLD, does this mean the solver can't actually access ghost nodes? Does it effect the results of DMCreateLocalVector and DMCreateGlobalVector?
The communicator for a solver decides which processes participate in the solver operations. For example, a TS with PETSC_COMM_SELF would run independently on each process, whereas one with PETSC_COMM_WORLD would evolve a single system across all processes. If you are using a DM with the solver, the communicators must be congruent.
I need to run different Python processes, in a certain order of priority.
Specifically, I have 3 processes, and I need them to work this way:
An object detection script, used to locate a person and their position. I need this one to run continuously at a high FPS;
another process that, once some conditions are met (when the person is present in the picture in the required position) starts taking screenshots of the image for a certain amount of time;
another script that analyzes the screenshots taken by the second one.
I wrote the 3 scripts already and they work fine, but the problem is that process 3 is particularly computationally demanding, and I don't want it to prevent processes 1 and 2 from running smoothly.
My idea is that I could give highest priority to process 1, and send screenshots taken by process 2...to a queue, or something like this.
When the person is not detected in the picture, I could run process 3, and empty the queue as the screenshots are analyzed. However, script 3 should still run with limited resources, so that FPS of script 1 isn't affected too much, and it can still detect if the person enters the picture again.
I'm afraid this might all be a little vague, but could you please suggest me a way or tool I could use to manage the processes this way?
So far, I tried simply saving the screenshots to a folder, but I don't know how to limit the resources usage by process 3.
I'm familiar with the basic usage of Docker, so I was thinking that maybe I could:
run the processes in different containers, limiting resources allocated to the 3rd one (?);
use a message broker (Kafka, RabbitMQ?) to store screenshots;
but again, I'm a newbie when it comes to this stuff (speaking of which, I hope I tagged this question correctly), so I don't know if it's an efficient way to to do this (or if it can be done this way, for that matter).
Let's say, I have chrome running, which has 100 different processeses, not all of which are direct children. What's the best way to programmatically get all of the processes from either the procfs or the any syscall may be (I believe getrusage only allows calling process), given the PID of the main chrome parent in the hierarchy?
Also, is there any API that's equivalent to PSAPI in Windows which provides OpenProcess, GetProcessMemoryInfo etc, that allows you to iterate through memory efficiently, rather than parsing the procfs?
Most efficient way please. No calling other processes like ps, pstree, pgrep, etc.
Side context: This is mostly an educational exercise to find the most efficient way to do this, which I started going down trying to write a simple script in nodejs to try and get all the processes programatically and then, calculate the sum of the memory taken by the process tree, including each.
I'm actually the author of a C++ library that is designed to do exactly that - pfs.
pfs attempts to make all the interesting information inside procfs accessible through a very simple API. If you find it lacking any useful information, please create an issue, and I'll try to add it.
Seeing that you require that information from Node.js, you might be able to use the library for "inspiration" or for research purposes (as in, understand where the information is located).
Regarding the process tree: The procfs contains the parent PID of every process (find it under stat and/or status). You can enumerate all the running processes and store them into a container and then iterate over it while drawing or retaining the order you require.
I have a node program that does a lot of heavy synchronous work. The work that needs to be done could easily be split into several parts. I would like to utilize all processor cores on my machine for this. Is this possible?
Form the docs on child processes and clusters I see no obvious solution. Child processes seems to be focused on running external programs and clusters only work for incoming http connections (or have I misunderstood that?).
I have a simple function var output = fn(input) and would just like to run it several times, spread all the calls across the cores on my machine and provide the result in a callback. Can that be done?
Yes, child processes and clusters are the way to do that. There are a couple of ways of implementing a solution to your problem.
Your server creates a queue and manages that queue. Whenever you need to call your function, you will drop it into the queue. You will then process the queue N items at a time, where N equals the number of your cores. When you start processing, you will spawn a child process, probably either using spawn or exec, with the argument being another standalone Node.js script, along with any additional parameters (it's just a command line call, basically). Inside that script you will do your work, and emit the result back to the server. The worker is then freed up.
You can create a dedicated server with cluster, where all it will do is run your function. With the cluster module, you can (once again) create N number of other workers, and delegate work to these wokers.
Now this may seem like a lot of work, and it is. And for that reason you should use an existing library as this is a, for the most part, a solve problem at this point. I really like redis-based queues, so if you're interested in that see this answer for some queue recommendations.
I'm designing a large-scale project, and I think I see a way I could drastically improve performance by taking advantage of multiple cores. However, I have zero experience with multiprocessing, and I'm a little concerned that my ideas might not be good ones.
Idea
The program is a video game that procedurally generates massive amounts of content. Since there's far too much to generate all at once, the program instead tries to generate what it needs as or slightly before it needs it, and expends a large amount of effort trying to predict what it will need in the near future and how near that future is. The entire program, therefore, is built around a task scheduler, which gets passed function objects with bits of metadata attached to help determine what order they should be processed in and calls them in that order.
Motivation
It seems to be like it ought to be easy to make these functions execute concurrently in their own processes. But looking at the documentation for the multiprocessing modules makes me reconsider- there doesn't seem to be any simple way to share large data structures between threads. I can't help but imagine this is intentional.
Questions
So I suppose the fundamental questions I need to know the answers to are thus:
Is there any practical way to allow multiple threads to access the same list/dict/etc... for both reading and writing at the same time? Can I just launch multiple instances of my star generator, give it access to the dict that holds all the stars, and have new objects appear to just pop into existence in the dict from the perspective of other threads (that is, I wouldn't have to explicitly grab the star from the process that made it; I'd just pull it out of the dict as if the main thread had put it there itself).
If not, is there any practical way to allow multiple threads to read the same data structure at the same time, but feed their resultant data back to a main thread to be rolled into that same data structure safely?
Would this design work even if I ensured that no two concurrent functions tried to access the same data structure at the same time, either for reading or for writing?
Can data structures be inherently shared between processes at all, or do I always explicitly have to send data from one process to another as I would with processes communicating over a TCP stream? I know there are objects that abstract away that sort of thing, but I'm asking if it can be done away with entirely; have the object each thread is looking at actually be the same block of memory.
How flexible are the objects that the modules provide to abstract away the communication between processes? Can I use them as a drop-in replacement for data structures used in existing code and not notice any differences? If I do such a thing, would it cause an unmanageable amount of overhead?
Sorry for my naivete, but I don't have a formal computer science education (at least, not yet) and I've never worked with concurrent systems before. Is the idea I'm trying to implement here even remotely practical, or would any solution that allows me to transparently execute arbitrary functions concurrently cause so much overhead that I'd be better off doing everything in one thread?
Example
For maximum clarity, here's an example of how I imagine the system would work:
The UI module has been instructed by the player to move the view over to a certain area of space. It informs the content management module of this, and asks it to make sure that all of the stars the player can currently click on are fully generated and ready to be clicked on.
The content management module checks and sees that a couple of the stars the UI is saying the player could potentially try to interact with have not, in fact, had the details that would show upon click generated yet. It produces a number of Task objects containing the methods of those stars that, when called, will generate the necessary data. It also adds some metadata to these task objects, assuming (possibly based on further information collected from the UI module) that it will be 0.1 seconds before the player tries to click anything, and that stars whose icons are closest to the cursor have the greatest chance of being clicked on and should therefore be requested for a time slightly sooner than the stars further from the cursor. It then adds these objects to the scheduler queue.
The scheduler quickly sorts its queue by how soon each task needs to be done, then pops the first task object off the queue, makes a new process from the function it contains, and then thinks no more about that process, instead just popping another task off the queue and stuffing it into a process too, then the next one, then the next one...
Meanwhile, the new process executes, stores the data it generates on the star object it is a method of, and terminates when it gets to the return statement.
The UI then registers that the player has indeed clicked on a star now, and looks up the data it needs to display on the star object whose representative sprite has been clicked. If the data is there, it displays it; if it isn't, the UI displays a message asking the player to wait and continues repeatedly trying to access the necessary attributes of the star object until it succeeds.
Even though your problem seems very complicated, there is a very easy solution. You can hide away all the complicated stuff of sharing you objects across processes using a proxy.
The basic idea is that you create some manager that manages all your objects that should be shared across processes. This manager then creates its own process where it waits that some other process instructs it to change the object. But enough said. It looks like this:
import multiprocessing as m
manager = m.Manager()
starsdict = manager.dict()
process = Process(target=yourfunction, args=(starsdict,))
process.run()
The object stored in starsdict is not the real dict. instead it sends all changes and requests, you do with it, to its manager. This is called a "proxy", it has almost exactly the same API as the object it mimics. These proxies are pickleable, so you can pass as arguments to functions in new processes (like shown above) or send them through queues.
You can read more about this in the documentation.
I don't know how proxies react if two processes are accessing them simultaneously. Since they're made for parallelism I guess they should be safe, even though I heard they're not. It would be best if you test this yourself or look for it in the documentation.