Increase 4k buffer size for libfuse readdirplus? - fuse

Is there a way to modify or increase the size value that is passed to
the readdir/readdirplus functions?
My implementation uses the low-level API.
With directories that are rather complex, deeply nested, or contain a
large amount of sub-directories, I experience a performance impact from
what seems to be due to the amount of recurring calls to
readdir/readdirplus. It seems a buffer larger than 4096 bytes (which is
what is passed in now) would help tremendously.
I've modified max_read, max_readahead, and max_write values but have not
seen this have any effect.
Thank you in advance.

Related

What's the difference between Buffer.allocUnsafe() and Buffer.allocUnsafeSlow() in NodeJS?

I've read the docs on the subject.
It says:
When using Buffer.allocUnsafe() to allocate new Buffer instances,
allocations under 4KB are sliced from a single pre-allocated Buffer.
This allows applications to avoid the garbage collection overhead of
creating many individually allocated Buffer instances. This approach
improves both performance and memory usage by eliminating the need to
track and clean up as many individual ArrayBuffer objects.
However, in the case where a developer may need to retain a small
chunk of memory from a pool for an indeterminate amount of time, it
may be appropriate to create an un-pooled Buffer instance using
Buffer.allocUnsafeSlow() and then copying out the relevant bits.
Also I've found this explanation:
Buffer.allocUnsafeSlow is different from Buffer.allocUnsafe() method. In
allocUnsafe() method, if buffer size is less than 4KB than it
automatically cut out the required buffer from a pre-allocated buffer
i.e. it does not initialize a new buffer. It saves memory by not
allocating many small Buffer instances. But if developer need to hold
on some amount of overhead memory for intermediate amount of time,
than allocUnsafeSlow() method can be used.
Though it is hard to understand for me. May you explain it more eloquent and detailed with some examples for both, please?

how to choose chunk size when reading a large file?

I know that reading a file with chunk size that is multiple of filesystem block size is better.
1) Why is that the case? I mean lets say block size is 8kb and I read 9kb. This means that it has to go and get 12kb and then get rid of the other extra 3kb.
Yes it did go and do some extra work but does that make much of a difference unless your block size is really huge?
I mean yes if I am reading 1tb file than this definitely makes a difference.
The other reason I can think of is that the block size refers to a group of sectors on hard disk (please correct me). So it could be pointing to 8 or 16 or 32 or just one sector. so your hard disk would have to do more work essentially if the block points to a lot more sectors? am I right?
2) So lets say block size is 8kb. Do I now read 16kb at a time? 1mb? 1gb? what should I use as a chunk size?
I know available memory is a limitation but apart from that what other factors affect my choice?
Thanks a lot in advance for all the answers.
Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.
If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.
So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.
But then, if there are a lot of layers above, eg, a RAID, all bets are
off.
Really, the best you can do is to benchmark your particular
circumstances.

1GB Vector, will Vector.Unboxed give trouble, will Vector.Storable give trouble?

We need to store a large 1GB of contiguous bytes in memory for long periods of time (weeks to months), and are trying to choose a Vector/Array library. I had two concerns that I can't find the answer to.
Vector.Unboxed seems to store the underlying bytes on the heap, which can be moved around at will by the GC.... Periodically moving 1GB of data would be something I would like to avoid.
Vector.Storable solves this problem by storing the underlying bytes in the c heap. But everything I've read seems to indicate that this is really only to be used for communicating with other languages (primarily c). Is there some reason that I should avoid using Vector.Storable for internal Haskell usage.
I'm open to a third option if it makes sense!
My first thought was the mmap package, which allows you to "memory-map" a file into memory, using the virtual memory system to manage paging. I don't know if this is appropriate for your use case (in particular, I don't know if you're loading or computing this 1GB of data), but it may be worth looking at.
In particular, I think this prevents the GC moving the data around (since it's not on the Haskell heap, it's managed by the OS virtual memory subsystem). On the other hand, this interface handles only raw bytes; you couldn't have, say, an array of Customer objects or something.

Block based storage

I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph
In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).
Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.
Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements

Memcached chunk limit

Why is there a hardcoded chunk limit (.5 meg after compression) in memcached? Has anyone recompiled theirs to up it? I know I should not be sending big chunks like that around, but these extra heavy chunks happen for me from time to time and wreak havoc.
This question used to be in the official FAQ
What are some limits in memcached I might hit? (Wayback Machine)
To quote:
The simple limits you will probably see with memcache are the key and
item size limits. Keys are restricted to 250 characters. Stored data
cannot exceed 1 megabyte in size, since that is the largest typical
slab size."
The FAQ has now been revised and there are now two separate questions covering this:
What is the maxiumum key length? (250 bytes)
The maximum size of a key is 250 characters. Note this value will be
less if you are using client "prefixes" or similar features, since the
prefix is tacked onto the front of the original key. Shorter keys are
generally better since they save memory and use less bandwidth.
Why are items limited to 1 megabyte in size?
Ahh, this is a popular question!
Short answer: Because of how the memory allocator's algorithm works.
Long answer: Memcached's memory storage engine (which will be
pluggable/adjusted in the future...), uses a slabs approach to memory
management. Memory is broken up into slabs chunks of varying sizes,
starting at a minimum number and ascending by a factorial up to the
largest possible value.
Say the minimum value is 400 bytes, and the maximum value is 1
megabyte, and the factorial is 1.20:
slab 1 - 400 bytes slab 2 - 480 bytes slab 3 - 576 bytes ... etc.
The larger the slab, the more of a gap there is between it and the
previous slab. So the larger the maximum value the less efficient the
memory storage is. Memcached also has to pre-allocate some memory for
every slab that exists, so setting a smaller factorial with a larger
max value will require even more overhead.
There're other reason why you wouldn't want to do that... If we're
talking about a web page and you're attempting to store/load values
that large, you're probably doing something wrong. At that size it'll
take a noticeable amount of time to load and unpack the data structure
into memory, and your site will likely not perform very well.
If you really do want to store items larger than 1MB, you can
recompile memcached with an edited slabs.c:POWER_BLOCK value, or use
the inefficient malloc/free backend. Other suggestions include a
database, MogileFS, etc.

Resources