How to ensure two docker containers don't process the same file?

How to ensure two docker containers don't process the same file? - multithreading

So I have two services A and B in a swarm. A has 5 instances and so does B. They both access files from a common mount. If I put a 100 files in this mount, how do I ensure that the files A and B pick (maybe 50 each) are mutually exclusive i.e the file doesn't get processed twice? Additionally how would i ensure this for two instances of the same service?

With the same methods that you would employ with a shared mount, when there are multiple processes on the same host/or different hosts with access to it. There is little that is unique about containers in this aspect.
You need to apply locks to prevent simultaneous writes. And make sure you always reading the latest data from the volume.

Related

Concurrent file system access with golang in Kubernetes

A prerequisite to my question is the assumption, that one uses a Kubernetes cluster with multiple pods accessing a file system storage, for example Azure File (ReadWriteMany). According to Microsoft this relies on SMB3 and should be safe for concurrent use.
Within one pod GO is used (with the GIN framework) as the programming language for a server that accesses the the file system. Since requests are handled in parallel using goroutines the file system access is concurrent.
Does SMB3 also ensure correctness for concurrent ReadWrites within one pod, or do I need to manually manage file system operations with something like a Mutex or a Worker Pool?
More specifically: I want to use git2go which is a GO wrapper around libgit2 written in C. According to libgit2s threading.md it is not thread safe:
Unless otherwise specified, libgit2 objects cannot be safely accessed by multiple threads simultaneously.
Since git2go does not specify otherwise I assume, that different references to the same repository on the file system cannot be used concurrently from different goroutines.
To summarise my question:
Do network communication protocols (like SMB3) also "make" file system operations thread safe locally?
If not: how would I ensure thread safe file system access in GO? lockedfile seems to be a good option, but it is also internal.

mmap file shared via nfs?

Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.

Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.

You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.

I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...

Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?

So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.
This might not apply to your scenario, but it's something to consider...

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
Blobs – Since the partition key is down to the blob name, we can load
balance access to different blobs across as many servers in order to
scale out access to them. This allows the containers to grow as large
as you need them to (within the storage account space limit). The
tradeoff is that we don’t provide the ability to do atomic
transactions across multiple blobs.

Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).
Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.
Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.

There is also one more factor that get's into this. Price!
Currently operation List and Create container are for the same price:
0,054 US$ / 10.000 calls
Same price is actually for writing the blob.
So in extreme cause you can pay a lot more, if you create and delete many containers
delete is free
you can see the calculator here:
https://azure.microsoft.com/en-us/pricing/calculator/

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning
Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.
Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.

Process Many Files Concurrently — Copy Files Over or Read Through NFS?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be processed by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
What is the better thing to do:
(1) For every processing machine M, Copy all files that will be processed by M into its local hard drive, and then read & process the files locally on machine M.
(2) Instead of copying the files to every machine, every machine will access the 'incoming' folder directly (using NFS), and will read the files from there, and then process them locally.
Which idea is better? Are there any 'do' and 'donts' when one is doing such a thing?
I am mostly curious if it is a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)

I would definitely do #2 - and I would do it as follows:
Run Apache on your main server with all the files. (Or some other HTTP server, if you really want). There are several reason's I'd do it this way:
HTTP is basically pure TCP (with some headers on it). Once the request is sent - it's a very "one-way" protocol. Low overhead, not chatty. High performance and efficiency - low overhead.
If you (for whatever reason) decided you needed to move or scale it out (using a could service, for example) HTTP would be a much better way to move the data around over the open Internet, than NFS. You could use SSL (if needed). You could get through firewalls (if needed). etc..etc..etc...
Depending on the access pattern of your file, and assuming the whole file is required to be read - it's easier/faster just to do one network operation - and pull the whole file in in one whack - rather than to constantly request I/Os over the network every time you're reading a smaller piece of the file.
It could be easy to distribute and run an application that does all this - and doesn't rely on the existance of network mounts - specific file paths, etc. If you have the URL to the files - the client can do it's job. It doesn't need to have established mounts, hard directory - or to become root to set-up such mounts.
If you have NFS connectivity problems - the whole system can get whacky when you try to access the mounts and they hang. With HTTP running in a user-space context - you just get a timeout error - and your application can take whatever action it chooses (like page you - log errors, etc).

Is there any scenario where an application instance runs across multiple computers?

We know that a single application instance can use multiple cores and even multiple physical processors. With cloud and cluster computing, or other special scenario, I don't know if a single stance can run across multiple computers, or across multiple OS instances.
This is important for me because, besides being considered as bad programming, I use some static (as in C#) (global) variables, and my program will probably have an unexpected behavior if those variables become shared between computers.
Update I'll be more specific: I'm writing a .NET application, that has one variable that counts the number of active IP connections, to prevent that the number of connections don't exceed a limit per computer. I am concerned if I deploy that application in a cloud-computing host, if I'll have one instance of the variable per computer.

If you want to learn how to realize such a scenario (a single instance across multiple computers), I think you should read some articles about MPI.
it has become a de facto standard for
communication among processes that
model a parallel program running on a
distributed memory system.
Regarding your worries: Obviously, you'll need to somehow consciously change your program to run as one instance across several computers. Otherwise, no sharing of course takes place, and as Shy writes, there is nothing to worry about. This kind of stuff wouldn't happen automatically.

What programming environment (language) are you using? It should define exactly what "static" means. Most likely it does not allow any sharing of information between different computers except through explicit channels such as MPI or RPC.
However, abstracted high-level environments may be designed to hide the concept of "running on multiple computers" from you entirely. It is conceivable to design a virtual machine that can run on a cluster and will actually share static variables between different physical machines - but in this case your program is running on the single virtual machine, and whether that runs on one or more physical machines should not make any difference.

If you can't even imagine a situation where this happens, why are you worrying about it?
Usually, this could happen in a scenario involving RPC of some kind.

Well yes, there's distcc. It's the GCC compiler distributed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string