Immutable Global objects for multi-tenancy - nashorn

We run a multi-tenant environment where users can execute arbitrary scripts using Nashorn. Performance is extremely important for us, so we would rather not create a new SimpleScriptContext object (or even a new SimpleBindings object) on each script eval.
However, this leaves us open to people modifying the global context, which then must be reused for other users / executions. E.g. Math.min = function(a,b) { return 42; }.
Freezing the Math object is a partial solution, but it seems to slow down script execution, and we have to be extremely diligent to make sure we do this everywhere. Similarly, creating a new SimpleBindings object to replace the ENGINE_SCOPE bindings on each execution is a big performance hit.
Are there any options available to us to lock the Global state? Or anything that I haven't thought of?
Thanks!

Related

Azure durable entity or static variables?

Question: Is it thread-safe to use static variables (as a shared storage between orchestrations) or better to save/retrieve data to durable-entity?
There are couple of azure functions in the same namespace: hub-trigger, durable-entity, 2 orchestrations (main process and the one that monitors the whole process) and activity.
They all need some shared variables. In my case I need to know the number of main orchestration instances (start new or hold on). It's done in another orchestration (monitor)
I've tried both options and ask because I see different results.
Static variables: in my case there is a generic List, where SomeMyType holds the Id of the task, state, number of attempts, records it processed and other info.
When I need to start new orchestration and List.Add(), when I need to retrieve and modify it I use simple List.First(id_of_the_task). First() - I know for sure needed task is there.
With static variables I sometimes see that tasks become duplicated for some reason - I retrieve the task with List.First(id_of_the_task) - change something on result variable and that is it. Not a lot of code.
Durable-entity: the major difference is that I add List on a durable entity and each time I need to retrieve it I call for .CallEntityAsync("getTask") and .CallEntityAsync("saveTask") that might slow done the app.
With this approach more code and calls is required however it looks more stable, I don't see any duplicates.
Please, advice
Can't answer why you would see duplicates with the static variables approach without the code, may be because list is not thread safe and it may need ConcurrentBag but not sure. One issue with static variable is if the function app is not always on or if it can have multiple instances. Because when function unloads (or crashes) the state would be lost. Static variables are not shared across instances either so during high loads it wont work (if there can be many instances).
Durable entities seem better here. Yes they can be shared across many concurrent function instances and each entity can only execute one operation at a time so they are for sure a better option. The performance cost is a bit higher but they should not be slower than orchestrators since they perform a lot of common operations, writing to Table Storage, checking for events etc.
Can't say if its right for you but instead of List.First(id_of_the_task) you should just be able to access the orchestrators properties through the client which can hold custom data. Another idea depending on the usage is that you may be able to query the Table Storages directly with CloudTable class for the information about the running orchestrators.
Although not entirely related you can look at some settings for parallelism for durable functions Azure (Durable) Functions - Managing parallelism
Please ask any questions if I should clarify anything or if I misunderstood your question.

How best to instantiate execution contexts?

I'm building a library that others--i.e. those uninterested in internals--could use to pull data from our DBs. In the internals, I want a couple I/O calls to be performed in parallel for performance purposes. The trade-off here is that the client (who, again, might not care much about this whole threading thing) would need to provide an appropriate execution context. Therefore, I provide a suggestion to use a helpful execution context in a helper object:
object ThreadPoolHelper {
val cachedThreadPoolContext: ExecutionContext =
ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
}
The question is (assuming that someday I also provide other options, like, say, a fixed thread pool for the clients to optionally use) am I fine just leaving this (these) as a val? Or am I better off making it lazy? Or a def?
One way or another, lazy is the way to go.
Making them lazy vals would be a good all-purpose choice, as each could be initialized as needed (as they are accessed). Then, you would never initialize more thread pools than are needed. scala.concurrent.ExecutionContext.Implicits.global is an implicit lazy val.
Technically, singleton objects (like ThreadPoolHelper) are lazy by default, so they will not be initialized until they are first accessed. A val would be fine if you only had one ExecutionContext in an object. However, multiple ExecutionContexts as vals in the same object wouldn't make as much sense, because accessing one would initialize them all--which would use more resources than needed.
A def would not make sense, because then you would be creating a new ExecutionContext on each call, and throwing it away when done. That could cause a lot of unwanted overhead, and default the purpose of having a thread pool in the first place.
Some ExecutionContexts that are little more custom than yours are singleton objects that extend ExecutionContext and implement their own custom behavior. These would also be lazy.

Using threadsafe initialization in a JRuby gem

Wanting to be sure we're using the correct synchronization (and no more than necessary) when writing threadsafe code in JRuby; specifically, in a Puma instantiated Rails app.
UPDATE: Extensively re-edited this question, to be very clear and use latest code we are implementing. This code uses the atomic gem written by #headius (Charles Nutter) for JRuby, but not sure it is totally necessary, or in which ways it's necessary, for what we're trying to do here.
Here's what we've got, is this overkill (meaning, are we over/uber-engineering this), or perhaps incorrect?
ourgem.rb:
require 'atomic' # gem from #headius
SUPPORTED_SERVICES = %w(serviceABC anotherSvc andSoOnSvc).freeze
module Foo
def self.included(cls)
cls.extend(ClassMethods)
cls.send :__setup
end
module ClassMethods
def get(service_name, method_name, *args)
__cached_client(service_name).send(method_name.to_sym, *args)
# we also capture exceptions here, but leaving those out for brevity
end
private
def __client(service_name)
# obtain and return a client handle for the given service_name
# we definitely want to cache the value returned from this method
# **AND**
# it is a requirement that this method ONLY be called *once PER service_name*.
end
def __cached_client(service_name)
##_clients.value[service_name]
end
def __setup
##_clients = Atomic.new({})
##_clients.update do |current_service|
SUPPORTED_SERVICES.inject(Atomic.new({}).value) do |memo, service_name|
if current_services[service_name]
current_services[service_name]
else
memo.merge({service_name => __client(service_name)})
end
end
end
end
end
end
client.rb:
require 'ourgem'
class GetStuffFromServiceABC
include Foo
def self.get_some_stuff
result = get('serviceABC', 'method_bar', 'arg1', 'arg2', 'arg3')
puts result
end
end
Summary of the above: we have ##_clients (a mutable class variable holding a Hash of clients) which we only want to populate ONCE for all available services, which are keyed on service_name.
Since the hash is in a class variable (and hence threadsafe?), are we guaranteed that the call to __client will not get run more than once per service name (even if Puma is instantiating multiple threads with this class to service all the requests from different users)? If the class variable is threadsafe (in that way), then perhaps the Atomic.new({}) is unnecessary?
Also, should we be using an Atomic.new(ThreadSafe::Hash) instead? Or again, is that not necessary?
If not (meaning: you think we do need the Atomic.news at least, and perhaps also the ThreadSafe::Hash), then why couldn't a second (or third, etc.) thread interrupt between the Atomic.new(nil) and the ##_clients.update do ... meaning the Atomic.news from EACH thread will EACH create two (separate) objects?
Thanks for any thread-safety advice, we don't see any questions on SO that directly address this issue.
Just a friendly piece of advice, before I attempt to tackle the issues you raise here:
This question, and the accompanying code, strongly suggests that you don't (yet) have a solid grasp of the issues involved in writing multi-threaded code. I encourage you to think twice before deciding to write a multi-threaded app for production use. Why do you actually want to use Puma? Is it for performance? Will your app handle many long-running, I/O-bound requests (like uploading/downloading large files) at the same time? Or (like many apps) will it primarily handle short, CPU-bound requests?
If the answer is "short/CPU-bound", then you have little to gain from using Puma. Multiple single-threaded server processes would be better. Memory consumption will be higher, but you will keep your sanity. Writing correct multi-threaded code is devilishly hard, and even experts make mistakes. If your business success, job security, etc. depends on that multi-threaded code working and working right, you are going to cause yourself a lot of unnecessary pain and mental anguish.
That aside, let me try to unravel some of the issues raised in your question. There is so much to say that it's hard to know where to start. You may want to pour yourself a cold or hot beverage of your choice before sitting down to read this treatise:
When you talk about writing "thread-safe" code, you need to be clear about what you mean. In most cases, "thread-safe" code means code which doesn't concurrently modify mutable data in a way which could cause data corruption. (What a mouthful!) That could mean that the code doesn't allow concurrent modification of mutable data at all (using locks), or that it does allow concurrent modification, but makes sure that it doesn't corrupt data (probably using atomic operations and a touch of black magic).
Note that when your threads are only reading data, not modifying it, or when working with shared stateless objects, there is no question of "thread safety".
Another definition of "thread-safe", which probably applies better to your situation, has to do with operations which affect the outside world (basically I/O). You may want some operations to only happen once, or to happen in a specific order. If the code which performs those operations runs on multiple threads, they could happen more times than desired, or in a different order than desired, unless you do something to prevent that.
It appears that your __setup method is only called when ourgem.rb is first loaded. As far as I know, even if multiple threads require the same file at the same time, MRI will only ever let a single thread load the file. I don't know whether JRuby is the same. But in any case, if your source files are being loaded more than once, that is symptomatic of a deeper problem. They should only be loaded once, on a single thread. If your app handles requests on multiple threads, those threads should be started up after the application has loaded, not before. This is the only sane way to do things.
Assuming that everything is sane, ourgem.rb will be loaded using a single thread. That means __setup will only ever be called by a single thread. In that case, there is no question of thread safety at all to worry about (as far as initialization of your "client cache" goes).
Even if __setup was to be called concurrently by multiple threads, your atomic code won't do what you think it does. First of all, you use Atomic.new({}).value. This wraps a Hash in an atomic reference, then unwraps it so you just get back the Hash. It's a no-op. You could just write {} instead.
Second, your Atomic#update call will not prevent the initialization code from running more than once. To understand this, you need to know what Atomic actually does.
Let me pull out the old, tired "increment a shared counter" example. Imagine the following code is running on 2 threads:
i += 1
We all know what can go wrong here. You may end up with the following sequence of events:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A writes its incremented value back to i.
Thread B writes its incremented value back to i.
So we lose an update, right? But what if we store the counter value in an atomic reference, and use Atomic#update? Then it would be like this:
Thread A reads i and increments it.
Thread B reads i and increments it.
Thread A tries to write its incremented value back to i, and succeeds.
Thread B tries to write its incremented value back to i, and fails, because the value has already changed.
Thread B reads i again and increments it.
Thread B tries to write its incremented value back to i again, and succeeds this time.
Do you get the idea? Atomic never stops 2 threads from running the same code at the same time. What it does do, is force some threads to retry the #update block when necessary, to avoid lost updates.
If your goal is to ensure that your initialization code will only ever run once, using Atomic is a very inappropriate choice. If anything, it could make it run more times, rather than less (due to retries).
So, that is that. But if you're still with me here, I am actually more concerned about whether your "client" objects are themselves thread-safe. Do they have any mutable state? Since you are caching them, it seems that initializing them must be slow. Be that as it may, if you use locks to make them thread-safe, you may not be gaining anything from caching and sharing them between threads. Your "multi-threaded" server may be reduced to what is effectively an unnecessarily complicated, single-threaded server.
If the client objects have no mutable state, good for you. You can be "free and easy" and share them between threads with no problems. If they do have mutable state, but initializing them is slow, then I would recommend caching one object per thread, so they are never shared. Thread[] is your friend there.

Is calling a lua function(as a callback) from another thread safe enough?

Actually I am using visual C++ to try to bind lua functions as callbacks for socket events(in another thread). I initialize the lua stuff in one thread and the socket is in another thread, so every time the socket sends/receives a message, it will call the lua function and the lua function determines what it should do according to the 'tag' within the message.
So my questions are:
Since I pass the same Lua state to lua functions, is that safe? Doesn't it need some kinda protection? The lua functions are called from another thead so I guess they might be called simultaneously.
If it is not safe, what's the solution for this case?
It is not safe to call back asynchronously into a Lua state.
There are many approaches to dealing with this. The most popular involve some kind of polling.
A recent generic synchronization library is DarkSideSync
A popular Lua binding to libev is lua-ev
This SO answer recommends Lua Lanes with LuaSocket.
It is not safe to call function within one Lua state simultaneously in multiple threads.
I was dealing with the same problem, since in my application all basics such as communication are handled by C++ and all the business logic is implemented in Lua. What I do is create a pool of Lua states that are all created and initialised on an incremental basis (once there's not enough states, create one and initialise with common functions / objects). It works like this:
Once a connection thread needs to call a Lua function, it checks out an instance of Lua state, initialises specific globals (I call it a thread / connection context) in a separate (proxy) global table that prevents polluting the original global, but is indexed by the original global
Call a Lua function
Check the Lua state back in to the pool, where it is restored to the "ready" state (dispose of the proxy global table)
I think this approach would be well suited for your case as well. The pool checks each state (on an interval basis) when it was last checked out. When the time difference is big enough, it destroys the state to preserve resources and adjust the number of active states to current server load. The state that is checked out is the most recently used among the available states.
There are some things you need to consider when implementing such a pool:
Each state needs to be populated with the same variables and global functions, which increases memory consumption.
Implementing an upper limit for state count in the pool
Ensuring all the globals in each state are in a consistent state, if they happen to change (here I would recommend prepopulating only static globals, while populating dynamic ones when checking out a state)
Dynamic loading of functions. In my case there are many thousands of functions / procedures that can be called in Lua. Having them constantly loaded in all states would be a huge waste. So instead I keep them byte code compiled on the C++ side and have them loaded when needed. It turns out not to impact performance that much in my case, but your mileage may vary. One thing to keep in mind is to load them only once. Say you invoke a script that needs to call another dynamically loaded function in a loop. Then you should load the function as a local once before the loop. Doing it otherwise would be a huge performance hit.
Of course this is just one idea, but one that turned out to be best suited for me.
It's not safe, as the others mentioned
Depends on your usecase
Simplest solution is using a global lock using the lua_lock and lua_unlock macros. That would use a single Lua state, locked by a single mutex. For a low number of callbacks it might suffice, but for higher traffic it probably won't due to the overhead incurred.
Once you need better performance, the Lua state pool as mentioned by W.B. is a nice way to handle this. Trickiest part here I find synchronizing the global data across the multiple states.
DarkSideSync, mentioned by Doug, is useful in cases where the main application loop resides on the Lua side. I specifically wrote it for that purpose. In your case this doesn't seem a fit. Having said that; depending on your needs, you might consider changing your application so the main loop does reside on the Lua side. If you only handle sockets, then you can use LuaSocket and no synchronization is required at all. But obviously that depends on what else the application does.

Is it ok to create shared variables inside a thread?

I think this might be a fairly easy question.
I found a lot of examples using threads and shared variables but in no example a shared variable was created inside a thread. I want to make sure I don't do something that seems to work and will break some time in the future.
The reason I need this is I have a shared hash that maps keys to array refs. Those refs are created/filled by one thread and read/modified by another (proper synchronization is assumed). In order to store those array refs I have to make them shared too. Otherwise I get the error Invalid value for shared scalar.
Following is an example:
my %hash :shared;
my $t1 = threads->create(
sub { my #ar :shared = (1,2,3); $hash{foo} = \#ar });
$t1->join;
my $t2 = threads->create(
sub { print Dumper(\%hash) });
$t2->join;
This works as expected: The second thread sees the changes the first made. But does this really hold under all circumstances?
Some clarifications (regarding Ian's answer):
I have one thread A reading from a pipe and waiting for input. If there is any, thread A will write this input in a shared hash (it maps scalars to hashes... those are the hashes that need to be declared shared as well) and continues to listen on the pipe. Another thread B gets notified (via cond_wait/cond_signal) when there is something to do, works on the stuff in the shared hash and deletes the appropriate entries upon completion. Meanwhile A can add new stuff to the hash.
So regarding Ian's question
[...] Hence most people create all their shared variables before starting any sub-threads.
Therefore even if shared variables can be created in a thread, how useful would it be?
The shared hash is a dynamically growing and shrinking data structure that represents scheduled work that hasn't yet been worked on. Therefore it makes no sense to create the complete data structure at the start of the program.
Also the program has to be in (at least) two threads because reading from the pipe blocks of course. Furthermore I don't see any way to make this happen without sharing variables.
The reason for a shared variable is to share. Therefore it is likely that you will wish to have more than one thread access the variable.
If you create your shared variable in a sub-thread, how will you stop other threads accessing it before it has been created? Hence most people create all their shared variables before starting any sub-threads.
Therefore even if shared variables can be created in a thread, how useful would it be?
(PS, I don’t know if there is anything in perl that prevents shared variables being created in a thread.)
PS A good design will lead to very few (if any) shared variables
This task seems like a good fit for the core module Thread::Queue. You would create the queue before starting your threads, push items on with the reader, and pop them off with the processing thread. You can use the blocking dequeue method to have the processing thread wait for input, avoiding the need for signals.
I don't feel good answering my own question but I think the answers so far don't really answer it. If something better comes along, I'd be happy to accept that. Eric's answer helped though.
I now think there is no problem with sharing variables inside threads. The reasoning is: Threads::Queue's enqueue() method shares anthing it enqueues. It does so with shared_clone. Since enqueuing should be good from any thread, sharing should too.

Resources