Is it possible to get a stream for an assembly generated in memory using CodeDOM? - .net-assembly

In some of my tests I need to generate some assemblies and then "decompile" them using ICSharpCode.Decompiler which uses Mono.Cecil to inspect assemblies.
For performance reasons I'd like to generate the assembly in memory to avoid disk I/O.
Bellow you can find some code I intend to use:
var cdp = CodeDomProvider.CreateProvider(CodeDomProvider.GetLanguageFromExtension("cs"));
var p = new CompilerParameters { GenerateInMemory = true } ;
var cr = cdp.CompileAssemblyFromSource(p, sources);
if (cr.Errors.Count > 0)
{
throw new Exception(cr.Errors[0].ErrorText);
}
// !!! I'd like to avoid building / loading the assembly to / from disk
// var assembly = AssemblyDefinition.ReadAssembly(assemblyPath, readerParameters);
// Instead, I'd like to do something like:
Stream assemblyStream = GetAssemblyContentAsStream(cr.CompiledAssembly);
var assembly = AssemblyDefinition.ReadAssembly(assemblyStream, readerParameters);
var dc = new DecompilerContext(assembly.MainModule);
var astBuilder = new AstBuilder(dc);
astBuilder.AddType(typeToBeConverted);
var output = new StringWriter();
astBuilder.GenerateCode(new PlainTextOutput(output));
So the question is: Is it possible to implement GetAssemblyContentAsStream() ?

For performance reasons I'd like to generate the assembly in memory to avoid disk I/O.
This is one of the Great Myths of programming. Modern operating systems are way too smart to make disk I/O a bottleneck. Your program doesn't actually write the assembly to disk, it writes it to the file system cache. RAM. Writing to disk happens later, much later, in a kernel thread that runs in the background and has no effect on perf.
Very similarly, thinking of "memory" as RAM is a myth as well. Programs allocate virtual memory. It is not memory at all, it is space allocated in the paging file. On disk. It is the job of the operating system to make that space available to a program when it needs it. Mapping the paging file allocation to RAM. Writing to the disk happens later, much later, in a kernel thread that runs when another process needs RAM.
See the similarity? There is no difference. The only possible side effect you could ever observe is when you run on a machine that has a very restricted amount of RAM. When there simply isn't enough space available in the file system cache so the program must wait until the data is written to a file. Such a machine will also have great difficulty making the RAM available to your program. It needs to throw out pages of RAM used by other processes (or your own), writing them to disk. In other words, your program will be slow to get started instead of slow to finish the job. The net difference is close to zero.
The size of an assembly never puts a strain on the amount of RAM available to the file system cache on a modern machine. An easy gigabyte at a minimum. So just don't bother, you'll never actually see a performance improvement.

Related

Hey could someone help me understand sync syscall usage?

like said in the title, I don't really understand the usage of this syscall. I was writing some program that write some data in a file, and the tutorial I've seen told me to use sys_sync syscall. But my problem is why and when should we use this? The data isn't already written on the file?
The manual says:
sync - Synchronize cached writes to persistent storage
So it is written to the file cache in memory, not on disk.
You rarely have to use sync unless you are writing really important data and need to make sure that data is on disk before you go on. One example of systems that use sync a lot are databases (such as MySQL or PostgreSQL).
So in other words, it is theoretically in your file, just not on disk and therefore if you lose electricity, you could lose the data, especially if you have a lot of RAM and many writes in a raw, it may privilege the writes to cache for a long while, increasing the risk of data loss.
But how can a file be not on the disk? I understand the concept of cache but if I wrote in the disk why would it be in a different place?
First, when you write to a file, you send the data to the Kernel. You don't directly send it to the disk. Some kernel driver is then responsible to write the data to disk. In my days on Apple 2 and Amiga computers, I would actually directly read/write to disk. And at least the Amiga had a DMA so you could setup a buffer, then tell the disk I/O to do a read or a write and it would send you an interrupt when done. On the Apple 2, you had to write loops in assembly language with precise timings to read/write data on floppy disks... A different era!
Although you could, of course, directly access the disk (but with a Kernel like Linux, you'd have to make sure the kernel gives you hands free to do that...).
Cache is primarily used for speed. It is very slow to write to disk (as far as a human is concerned, it looks extremely fast, but compared to how much data the CPU can push to the drive, it's still slow).
So what happens is that the kernel has a task to write data to disk. That task wakes up as soon as data appears in the cache and ends once all the caches are transferred to disk. This task works in parallel. You can have one such task per drive (which is especially useful when you have a system such as RAID 1).
If your application fills up the cache, then a further write will block until some of the cache can be replaced.
and the tutorial I've seen told me to use sys_sync syscall
Well that sounds silly, unless you're doing filesystem write benchmarking or something.
If you have one really critical file that you want to make sure is "durable" wrt. power outages before you do something else (like sent a network packet to acknowledge a complete transfer), use fsync(fd) to sync just that one file's data and metadata.
(In asm, call number SYS_fsync from sys/syscall.h, with the file descriptor as the first register arg.)
But my problem is why and when should we use this?
Generally never use the sync system call in programs you're writing.
There are interactive use-cases where you'd normally use the wrapper command of the same name, sync(1). e.g. with removable media, to get the kernel started doing write-back now, so unmount will take less time once you finish typing it. Or for some benchmarking use-cases.
The system shutdown scripts may run sync after unmounting filesystems (and remounting / read-only), before making a reboot(2) system call.
Re: why sync(2) exists
No, your data isn't already on disk right after echo foo > bar.txt.
Most OSes, including Linux, do write-back caching, not write-through, for file writes.
You don't want write() system calls to wait for an actual magnetic disk when there's free RAM, because the traditional way to do I/O is synchronous so simple single-threaded programs wouldn't be able to do anything else (like reading more data or computing anything) while waiting for write() to return. Blocking for ~10 ms on every write system call would be disastrous; that's as long as a whole scheduler timeslice. (It would still be bad even with SSDs, but of course OSes were designed before SSDs were a thing.) Even just queueing up the DMA would be slow, especially for small file writes that aren't a whole number of aligned sectors, so even letting the disk's own write-back write caching work wouldn't be good enough.
Therefore, file writes do create "dirty" pages of kernel buffers that haven't yet been sent to the disk. Sometimes we can even avoid the IO entirely, e.g. for tmp files that get deleted before anything triggers write-back. On Linux, dirty_writeback_centisecs defaults to 1500 (15 seconds) before the kernel starts write-back, unless it's running low on free pages. (Heuristics for what "low" means use other tunable values).
If you really want writes to flush to disk immediately and wait for data to be on disk, mount with -o sync. Or for one program, have it use open(O_SYNC) or O_DSYNC (for just the data, not metadata like timestamps).
See Are file reads served from dirtied pages in the page cache?
There are other advantages to write-back, including delayed allocation even at the filesystem level. The FS can wait until it knows how big the file will be before even deciding where to put it, allowing better decisions that reduce fragmentation. e.g. a small file can go into a gap that would have been a bad place to start a potentially-large file. (It just have to reserve space to make sure it can put it somewhere.) XFS was one of the first filesystems to do "lazy" delayed allocation, and ext4 has also had the feature for a while.
https://en.wikipedia.org/wiki/XFS#Delayed_allocation
https://en.wikipedia.org/wiki/Allocate-on-flush
https://lwn.net/Articles/323169/

Why memory reordering is not a problem on single core/processor machines?

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

Using memory mapped file with OpenCL

I access a file on a disk using memory mapped I/O (mmap call on linux).
Is it possible to pass this virtual memory buffer to OpenCL using CL_MEM_USE_HOST_PTR (for reading only). And could this result in performance gains?
I want to avoid copying an entire file into host memory, and instead let the OpenCL kernel control which parts of the file get loaded/buffered by the operating system.
I think this should work - you shouldn't end up with errors, crashes, or incorrect results; whether or not it brings performance gains probably depends on hardware, driver/CL implementation, and access patterns. I would not be surprised if it didn't make much of a difference in many cases. I could imagine the GPU driver prefaulting and wiring down all the pages in order to map it into the GPU's address space.

Necessity to bring program to main memory for execution?

Why is it necessary that we need to bring program in main memory from secondary memory for execution?
Why cant we execute program from secondary memory?
Though, it may not be possible currently, but is it possible in future, somehow by some mechanism, that we can execute the program from secondary memory directly?
Almost all modern CPUs execute instructions by fetching them from an address in main memory identified by the instruction pointer register, loading the referenced memory through one or more cache levels before the portion of the CPU that executes the instruction even starts its work. Designing a CPU that could, for example, fetch instructions directly from a disk or network stream would be a rather large project, and performance would likely be pathetic. There's a reason you have a main memory that operates orders of magnitude faster than disk/network access, and caches between that and the actual execution cores that are orders of magnitude faster even than the main memory...
Mostly some parts of the program is required to be accessed multiple times during the execution of the program. Reading from secondary memory every single time we needed the particular data would obviously require a lot of time.
It is better to load the program in a faster memory i.e. Main memory , so that whenever a part of program is required it can be accessed much faster. Similarly, more frequently used variables are stored in the cache memory for even faster access. It;s all about speed.
If could somehow develop affordable secondary memories that have speed as fast as the main memory, we could do without copying the whole program into main memory. However, we would still need some memory to store the temporaries during the program execution.
Main memory is used to distinguish it from external mass storage devices such as hard drives.
Another term for main memory is RAM. The computer can manipulate only data that is in main memory.
So, every program you execute and every file you access must be copied from a
storage device into main memory.The amount of main memory on a computer is crucial because it determines
how many programs can be executed at one time and how much data can be readily available to a program.

How does the CPU read from the disk?

I'm a bit confused about the whole idea of IO; I want to know how the CPU reads from the disk (a SATA disk for example) ?
When the program with read()/write() is complied with a reference to a specific file and when the CPU encounters this reference, does it read from the disk directly (via memory mapped IO ports)? Or does it write to the RAM and then writes back to disk?
I'd suggest reading:
http://www.makelinux.net/books/ulk3/understandlk-CHP-13-SECT-1
With a supplement of:
http://en.wikipedia.org/wiki/Direct_memory_access
With regards to buffering in RAM: most programming languages and operating systems buffer at least part of I/O operations (read and write) to memory. This is usually done asynchronously: i.e. a buffer is created, filled, and then processed. For a read, the CPU would (working with the disk controller) create IO instructions to fetch data and a place to put it in memory, fill that space, and then present its contents to the program making the request. For a write request, this would be queuing write operations and their associated data and then sending them off to the IO controller and eventually the disk to be executed. Buffering can happen in multiple places: on the CPU's caches, in RAM, (sometimes) on the disk controller, or on the hard disk itself. How much buffering is done, and exactly how the abstract sequence of operations I've mentioned is handled, differs depending on your hardware architecture, OS, and task.
Main memory is the only large storage area (millions to bilions of bytes) that the processors can access directly.
"Operating System Concepts" said.
So if you want to run a program or manipulate some data, they (program and data) must be in Main memory.

Resources