Concurrent file system access with golang in Kubernetes

Concurrent file system access with golang in Kubernetes - multithreading

A prerequisite to my question is the assumption, that one uses a Kubernetes cluster with multiple pods accessing a file system storage, for example Azure File (ReadWriteMany). According to Microsoft this relies on SMB3 and should be safe for concurrent use.
Within one pod GO is used (with the GIN framework) as the programming language for a server that accesses the the file system. Since requests are handled in parallel using goroutines the file system access is concurrent.
Does SMB3 also ensure correctness for concurrent ReadWrites within one pod, or do I need to manually manage file system operations with something like a Mutex or a Worker Pool?
More specifically: I want to use git2go which is a GO wrapper around libgit2 written in C. According to libgit2s threading.md it is not thread safe:
Unless otherwise specified, libgit2 objects cannot be safely accessed by multiple threads simultaneously.
Since git2go does not specify otherwise I assume, that different references to the same repository on the file system cannot be used concurrently from different goroutines.
To summarise my question:
Do network communication protocols (like SMB3) also "make" file system operations thread safe locally?
If not: how would I ensure thread safe file system access in GO? lockedfile seems to be a good option, but it is also internal.

Related

How worker threads works in Nodejs?

Nodejs can not have a built-in thread API like java and .net
do. If threads are added, the nature of the language itself will
change. It’s not possible to add threads as a new set of available
classes or functions.
Nodejs 10.x added worker threads as an experiment and now stable since 12.x. I have gone through the few blogs but did not understand much maybe due to lack of knowledge. How are they different than the threads.

Worker threads in Javascript are somewhat analogous to WebWorkers in the browser. They do not share direct access to any variables with the main thread or with each other and the only way they communicate with the main thread is via messaging. This messaging is synchronized through the event loop. This avoids all the classic race conditions that multiple threads have trying to access the same variables because two separate threads can't access the same variables in node.js. Each thread has its own set of variables and the only way to influence another thread's variables is to send it a message and ask it to modify its own variables. Since that message is synchronized through that thread's event queue, there's no risk of classic race conditions in accessing variables.
Java threads, on the other hand, are similar to C++ or native threads in that they share access to the same variables and the threads are freely timesliced so right in the middle of functionA running in threadA, execution could be interrupted and functionB running in threadB could run. Since both can freely access the same variables, there are all sorts of race conditions possible unless one manually uses thread synchronization tools (such as mutexes) to coordinate and protect all access to shared variables. This type of programming is often the source of very hard to find and next-to-impossible to reliably reproduce concurrency bugs. While powerful and useful for some system-level things or more real-time-ish code, it's very easy for anyone but a very senior and experienced developer to make costly concurrency mistakes. And, it's very hard to devise a test that will tell you if it's really stable under all types of load or not.
node.js attempts to avoid the classic concurrency bugs by separating the threads into their own variable space and forcing all communication between them to be synchronized via the event queue. This means that threadA/functionA is never arbitrarily interrupted and some other code in your process changes some shared variables it was accessing while it wasn't looking.
node.js also has a backstop that it can run a child_process that can be written in any language and can use native threads if needed or one can actually hook native code and real system level threads right into node.js using the add-on SDK (and it communicates with node.js Javascript through the SDK interface). And, in fact, a number of node.js built-in libraries do exactly this to surface functionality that requires that level of access to the nodejs environment. For example, the implementation of file access uses a pool of native threads to carry out file operations.
So, with all that said, there are still some types of race conditions that can occur and this has to do with access to outside resources. For example if two threads or processes are both trying to do their own thing and write to the same file, they can clearly conflict with each other and create problems.
So, using Workers in node.js still has to be aware of concurrency issues when accessing outside resources. node.js protects the local variable environment for each Worker, but can't do anything about contention among outside resources. In that regard, node.js Workers have the same issues as Java threads and the programmer has to code for that (exclusive file access, file locks, separate files for each Worker, using a database to manage the concurrency for storage, etc...).

It comes under the node js architecture. whenever a req reaches the node it is passed on to "EVENT QUE" then to "Event Loop" . Here the event-loop checks whether the request is 'blocking io or non-blocking io'. (blocking io - the operations which takes time to complete eg:fetching a data from someother place ) . Then Event-loop passes the blocking io to THREAD POOL. Thread pool is a collection of WORKER THREADS. This blocking io gets attached to one of the worker-threads and it begins to perform its operation(eg: fetching data from database) after the completion it is send back to event loop and later to Execution.

Allowing multiple processes to connect to a singleton process and invoke method calls or access resources

Scenario:
I currently have a class MyLoadBalancerSingleton that manages access to a cluster of resources (resource1 and resource2). The class has methods create(count) and delete(count). When these methods get called, the load balancer would queue up the request and then processes it FIFO on the resources.
Naturally, there should be only one load balancer running otherwise each they'll all think they have complete control over the resources being managed.
Here is the problem:
Multiple users will simultaneously try to access the load balancer from a GUI. Each user will spawn their own GUI via python gui.py on the same machine. (They will all ssh into the same machine) As such, each GUI will be running in it's own process. The GUI will then attempt to communicate with the load balancer.
Is it possible to have those multiple GUI processes access only one loadbalancer process and call the load balancer's methods?
I looked into the multiprocessing library and it appears that the workflow is opposite to what I want. Using my example it would be: Loadbalancer process spawns 2 GUI processes (child) and then shares the parent resources with the child. In my example, both the GUI and the load balancer are top level processes. (No parent-child relationship)
I suspect that Singleton is not the right word as singletons only work within one process. Maybe run the load balancer as a daemon process and then have those GUI processes connect to it? I tried searching IPC but it just lead me to the multiprocessing module which is not what I want. The distributed, cluster computing modules (dispy) isn't what I want either. This is strictly processes communicating with each other (IPC?) on the same machine.
Which brings me to my original question:
Is it possible to allow multiple processes to connect to a singleton process and invoke method calls or access its resources? All processes will be executing on the same machine.
Fictitious pseudocode:
LoadBalancer.py
class MyLoadBalancerSingleton(object):
def __init__():
# Singleton instance logic here
# Resource logic here
def create(count):
resource1.create(count)
resource2.create(count)
def delete(count):
resource1.delete(count)
resource2.delete(count)
Gui.py
class GUI(object):
def event_loop():
# count = Ask for user input
# process = Locate load balancer process
# process.create(count)
# process.delete(count)
Thank you for your time!

Yes, it's possible. I don't have a Python-specific example easily at hand, but you can do it. There are several kinds of IPC that allow multiple clients (GUIs, in your case) to connect to a single server (your singleton, which yes, would usually be run as a daemon). Here are a few of them:
Cross-platform: TCP sockets. You'd need your server to allow multiple connections on a single listening socket, and handle them as clients connect (and disconnect). The easiest approach to use across multiple machines, but also the least secure option (no ACLs, etc.).
Windows-specific: Named pipes. Windows' named pipes, unlike the similarly-named but much less capable feature of POSIX OSes, can allow multiple clients to connect at once. You'd need to create a multiple-instance pipe server. MSDN has good examples of this. I'm not sure what the best way to do it in Python would be, but I know that ActivePython has wrappers for the NT named pipe APIs. The clients only need to be able to open a file (of the form \\.\pipe\LoadBalancer). File APIs are used to read from, and write to, the pipes.
POSIX-only (Linux, BSD, OS X, etc.): Unix domain sockets. The POSIX equivalent of NT's named pipes, they use socket APIs but the endpoint is on the file system (like, /var/LoadBalanceSocket) instead of on an IP address/protocol/port tuple.
Various other things, using stuff like shared memory / memory-mapped files, RPC, COM (on Windows), Grand Central Dispatch (OS X), D-Bus (cross-platform but third-party), and so on. None of them, with the possible exception of GCD, is ideal for the simple case you're talking about here.
Of course, each of these approaches require the server to handle multiple simultaneously-connected clients. The server will need to synchronize across them, and avoid one client blocking the others from being served. You could use multithreading for quick responsiveness and minimal CPU cost while waiting, or polling for a quick-and-dirty solution that avoids multithreading synchronization (mutexes and the like).

Who can share shared memory in Linux?

I am working on hardening a sandbox for student code execution. I think I'm satisfied that students can't share data on the file system or with signals because I've found express rules dictating those and they execute as different unprivileged users. However, I am having a really hard time looking at documentation to determine, when shared memory (or IPC more generally - queues or semaphores) is created, who can see that. If you create shared memory, can anyone on the same machine open it, or is there a way to control that? Does the control lie in the program that creates the memory, or can the sysadmin limit it?

Any process in the same ipc namespace can see and (potentially) access ipc objects created by other processes in the same ipc namespace. Each ipc object has the same user/group/other-rwx permissions as file system objects objects -- see the svipc(7) manual page.
You can create a new ipc namespace by using the clone(2) system call with the CLONE_NEWIPC flag. You can use the unshare(1) program to do a clone+exec of another program with this or certain other CLONE flags.

Thread Safe web apps - why does it matter?

Why does being thread safe matter in a web app? Pylons (Python web framework) uses a global application variable which is not thread safe. Does this matter? Is it only a problem if I intend on using multi-threading? Or, does it mean that one user might not have updated state if another user... I'm just confusing myself. What's so important with this?

Threading errors can lead to serious and subtle problems.
Say your system has 10 members. One more user signs up to your system and the application adds him to the roster and increments the count of members; "simultaneously", another user quits and the application removes him from the roster and decrements the count of members.
If you don't handling threading properly, your member count (which should be 10) could easily be nine, 10, or 11, and you'll never be able to reproduce the bug.
So be careful.

You should care about thread safety. E.g in java you write a servlet that provides some functionality. The container will deploy an instance of your servlet, and as HTTP requests arrive from clients, over different TCP connections, each request is handled by a separate thread which in turn will call your servlet. As a result, you will have your servlet being call from multiple threads. So if it is not thread-safe, then erroneous result will be returned to the user, due to data corruption of access to shared data by threads.

It really depends on the application framework (which I know nothing about in this case) and how the web server handles it. Obviously, any good webserver is going to be responding to multiple requests simultaneously, so it will be operating with multiple threads. That web server may dispatch to a single instance of your application code for all of these requests, or it may spawn multiple instances of your web application and never use a given instance concurrently.
Even if the app server does use separate instances, your application will probably have some shared state--say, a database with a list of users. In that case, you need to make sure that state can be accessed safely from multiple threads/instances of your web app.
Then, of course, there is the case where you use threading explicitly in your application. In that case, the answer is obvious.

Your Web Application is almost always multithreading. Even though you might not use threads explicitly. So, to answer your questions: it's very important.
How can this happen? Usually, Apache (or IIS) will serve several request simultaneously, calling multiple times from multiple threads your python programs. So you need to consider that your programs run in multiple threads concurrently and act accordingly.

(This was too long to add a comment to the other fine answers.)
Concurrency problems (read: multiple access to shared state) is a super-set of threading problems. The (concurrency problems) can easily exist at an "above thread" level such as a process/server level (the global variable in the case you mention above is process-unique value, which in turn can lead to an inconsistent view/state if there are multiple processes).
Care must be taken to analyze the data consistency requirements and then implement the software to fulfill those requirements. I would always err on the side of safe, and only degrade in carefully analyzed areas where it is acceptable.
However, note that CPython runs only one thread context for Python code execution (to get true concurrent threads you need to write/use C extensions), so, while you can get a form of race condition upon expected data, you won't get (all) the same kind of partial-write scenarios and such that may plague C/C++ programs. But, once again. Err on the side of a consistent view.
There are a number of various existing methods of making access to a global atomic -- across threads or processes. Use them.

How can I manage use of a shared resource used by several Perl programs?

I am looking for a good way to manage the access to an external FTP server from various programs on a single server.
Currently I am working with a lock file, so that only one process can use the ftp server at a time. What would be a good way to allow 2-3 parallel processes to access the ftp server simultaneously. Unfortunately the provider does not allow more sessions and locks my account for a day if too many processes access their server.
Used platforms are Solaris and Linux - all ftp access is encapsulated in a single library thus there is only 1 function which I need to change. Would be nice if there is something on CPAN.

I'd look into perlipc(1) for SystemV semaphores or modules like POSIX::RT::Semaphore for posix semaphores. I'd create a semaphore with a resource count of 2-3, and then in the different process try to get the semaphore.

Instead of making a bunch of programs wait in line, could you create one local program that handled all the remote communication while the local programs talked to it? You effectively create a proxy and push that complexity away from your programs so you don't have to deal with it in every program.
I don't know the other constraints on your problem, but this has worked for me on similar issues.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string