In the Spectre variant 2, the branch target buffer (BTB) can be poisoned in another process. If the branch predictor is also using the virtual address to index the branch, why we can not train the branch predictor like training BTB in Spectre v1 attack?
If the branch predictor is also using the virtual address to index the branch, why we can not train the branch predictor like training BTB in Spectre v1 attack?
Simple answers:
We don't know the virtual address of another process
Since it's another process, we can't directly inject CPU instructions to another process to access the memory we need.
More detailed:
Sure, Spectre V1 and V2 use the same vulnerability -- branch predictor. The basic difference is that V1 works within the same process, while V2 works for among the processes.
A typical example of V1 is a JavaScript virtual machine. We can't access process memory from within the VM, because VM checks bounds of each memory access. But using Spectre V1 we can first train branch predictor for the VM and then speculatively access a memory within the current process.
Things get more complicated if we want to access another process, i.e. Spectre V2. First, we have to train the branch predictor of a remote process. There are many ways to do this, including the way you've mentioned: train local virtual address (if we know it).
Then we can't just ask remote process "please read this and this secret", so we have to use the technique called "return oriented programming" to speculatively execute pieces of code we need inside the remote process.
That is the main difference. But as I said, your proposal is valid and will definitely work on some CPUs and some circumstances.
Related
In a single GPU such as P100 there are 56 SMs(Streaming Multiprocessors), and different SMs may have little correlation .I would like to know the application performance variation with different SMs.So it there any way to disable some SMs for a certain GPU. I know CPU offer the corresponding mechanisms but have get a good one for GPU yet.Thanks!
There are no CUDA-provided methods to disable a SM (streaming multiprocessor). With varying degrees of difficulty and behavior, some possibilities exist to try this using indirect methods:
Use CUDA MPS, and launch an application that "occupies" fully one or more SMs, by carefully controlling number of blocks launched and resource utilization of those blocks. With CUDA MPS, another application can run on the same GPU, and the kernels can run concurrently, assuming sufficient care is taken for it. This might allow for no direct modification of the application code under test (but an additional application launch is needed, as well as MPS). The kernel duration will need to be "long" so as to occupy the SMs while the application under test is running.
In your application code, effectively re-create the behavior listed in item 1 above by launching the "dummy" kernel from the same application as the code under test, and have the dummy kernel "occupy" one or more SMs. The application under test can then launch the desired kernel. This should allow for kernel concurrency without MPS.
In your application code, for the kernel under test itself, modify the kernel block scheduling behavior, probably using the smid special register via inline PTX, to cause the application kernel itself to only utilize certain SMs, effectively reducing the total number in use.
I am working on an a multithreaded implementation of DDPG (Deep Deterministic Policy Gradient) in TensorFlow/Keras on the CPU. Multiple worker agents (each with a copy of the environment) are being used to update a global network via worker gradients.
My question is: How do you properly update the global network via its workers and then safely run the updated global network periodically in its own copy of the environment? The intent is that I want to periodically observe/store how the global network performs over time, while the workers continue to learn/update the global network via interacting with their own copy of the environment. According to this Stackoverflow question regarding locking in TensorFlow, "Reads to the variables are not performed under the lock, so it is possible to see intermediate states and partially-applied updates". I would like to avoid the global network being partially updated/mid-update as it interacts with its environments.
I'm asking this because it's completely new for me (It's more about a computer networks in Linux question than about TF, but maybe someone has already done it)
Since my GPU is not able to compute the input data I need, I had to get resources from my CPU, however there are times that even the CPU + GPU together cannot cope with all of the operations. I can use the processor of another computer which is in a network with my computer, but I don"t know how I should code that (I have access to that computer, but in that area I'm not that good in Linux :p)
I was looking in the TF web page, but they just tell when the resources are local. Here I found the usual with tf.device('/cpu:0'): ... to solve when my GPU was not able to cope with all of the information, I think that maybe it could be something like with tf.device('other_computer/cpu:0'): but then I think I would have to change the line sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) and at the same time I would have had to access to the other computer but I don't know how to do it
Anyway, if someone has done this before, I would be thankful to know it. I accept any reference I could use
Thanks
TensorFlow supports distributed computation, using the CPUs (and potentially GPUs) of multiple computers connected by a network. Your intuition that the with tf.device(): block will be useful is correct. This is a broad topic, but there are three main steps to setting up a distributed TensorFlow computation:
Create a tf.train.Server, which maps the machines in your cluster to TensorFlow jobs, which are lists of tasks. You'll need to create a tf.train.Server on each computer that you want to use, and configure it with the same cluster definition.
Build a TensorFlow graph. Use with tf.device("/job:foo/task:17"): to define a block of nodes that will be placed in the 17th task of the job called "foo" that you defined in step 1. There are convenience methods for applying device mapping policies, such as tf.train.replica_device_setter(), which help to simplify this task for common training topologies.
Create a tf.Session that connects to the local server. If you created your server as server = tf.train.Server(...), you'd create your session as sess = tf.Session(server.target, ...).
There's a longer tutorial about distributed TensorFlow available here, and some example code for training the Inception image recognition model using a cluster here.
My problem is that I have code that need a rebooted node. I have many long running Jenkins test jobs that needs to be executed on rebooted nodes.
My existing solution is to define multiple "proxy" machines in Jenkins with the same label (TestLable) and 1 executor per machine. I bind all the test jobs to the label (TestLable). In the test execution script I detect the Jenkins machine (Jenkins env. NODE_NAME) and use that to know what physical physical machine the tests should use.
Do anybody know of a better solution?
The above works but I need to define a high number of “nodes/machines” that may not be needed. What I would like was a plugin that would be able to grant a token to a Jenkins job. This way a job would not be executed before a Jenkins executor and a token was free. The token should be a string so that my test jobs could use it to know what external node it could use.
We have written our own scheduler that allocates stuff before starting Jenkins nodes. There may be a better solution - but this works for us mostly. I've yet to come across an off-the-shelf scheduler that can deal with complicated allocation of different hardware resources. We have n box types, allocated to n build types.
Some build types we have are not compatible together without destroying all persistent data - which may be required as it takes a long time to gather. Some jobs require combinations of these hardware types. We store the details in a DB, and then use business logic to determine how it is allocated. We've often found that particular job types need additional business logic or extra data fields to account for their specific requirements.
So it may be the best way is to write your own scheduler, in your own language of choice, which takes into account your particular needs.
The SPOJ is a website that lists programming puzzles, then allows users to write code to solve those puzzles and upload their source code to the server. The server then compiles that source code (or interprets it if it's an interpreted language), runs a battery of unit tests against the code, and verifies that it correctly solves the problem.
What's the best way to implement something like this - how do you sandbox the user input so that it can not compromise the server? Should you use SELinux, chroot, or virtualization? All three plus something else I haven't thought of?
How does the application reliably communicate results outside of the jail while also insuring that the results are not compromised? How would you prevent, for instance, an application from writing huge chunks of nonsense data to disk, or other malicious activities?
I'm genuinely curious, as this just seems like a very risky sort of application to run.
A chroot jail executed from a limited user account sounds like the best starting point (i.e. NOT root or the same user that runs your webserver)
To prevent huge chunks of nonsense data being written to disk, you could use disk quotas or a separate volume that you don't mind filling up (assuming you're not testing in parallel under the same user - or you'll end up dealing with annoying race conditions)
If you wanted to do something more scalable and secure, you could use dynamic virtualized hosts with your own server/client solution for communication - you have a pool of 'agents' that receive instructions to copy and compile from X repository or share, then execute a battery of tests, and log the output back via the same server/client protocol. Your host process can watch for excessive disk usage and report warnings if required, the agents may or may not execute the code under a chroot jail, and if you're super paranoid you would destroy the agent after each run and spin up a new VM when the next sample is ready for testing. If you're doing this large scale in the cloud (e.g. 100+ agents running on EC2) you only ever have enough spun up to accommodate demand and therefore reduce your costs. Again, if you're going for scale you can use something like Amazon SQS to buffer requests, or if you're doing a experimental sample project then you could do something much simpler (just think distributed parallel processing systems, e.g. seti#home)