Should access to files stored in an hsqldb database be serialized? - multithreading

Given:
One can access an HSQLDB database concurrently using connections pooled with the help of the apache commons dbcp package.
I store files in a cached table in an embedded hsqldb database.
It is known that files on a conventional hard drive (as opposed to a solid state) should not be accessed from multiple threads, because we are likely to get performance degradation rather than boost. This is because of the time it takes to move the mechanical reading head back and forth between the files with each thread context switch.
Question:
Does this rule hold to files managed in an HSQLDB database? The file sizes may range from several KB to several MB.

HSQLDB accesses two files for data storage during operations. One file for all CACHED table data, and another file for all the lobs. It manages access to these files internally.
With multiple threads, there is a possibility of reduced access speed in the following circumstances.
Simultaneous read and write access to large tables.
Simultaneous read and write access to lobs larger than 500KB.

Related

What is the best beetween multiple small h5 files or one huge?

I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread.
[settings : python, Ubuntu 18.04]
I can't find any answer of which is the best in term of data accessing and storage between :
registering all the data in one huge HDF5 file (over 20Go)
splitting it into multiple (over 16 000) small HDF5 files (approx
1.4Mo).
Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?
I would go for multiple files if I were you (but read till the end).
Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).
You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).
On the other hand, this approach is more cumbersome and harder to implement correctly,
though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.
Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.

Would a short-lived temp file avoid actual disk io?

I have a linux server on AWS that is hosting a postgres database server and a nodejs api server. It has 240GB RAM, and major portion is dedicated to the database. Approx 32GB is the memory is left open to the OS to deal among the
app server, various OS-level caches, and whatever need arises. There is no explicit fsync. The database + indexes add up to about 500-600 GB.
Within the database some operations may spill over to disk for want of enough work memory. E.g., a sort operation or a temp file. Thus, within a span
of say 1-2 seconds, a temp file may be created, written to, read back and then deleted. The temp file size I am guessing should seldom exceed 100MB.
Question - will the temp file in this scenario necessarily cause disk io? If not, how does one assess the likelihood? What parameters majorly influence this?

Storing media files in Cassandra

I tried to store the audio/video files in the database.
Is cassandra able to do that ? if yes, how do we store the media files in cassandra.
How about storing the metadata and original audio files in cassandra
Yes, Cassandra is definitely able to store files in its database, as "blobs", strings of bytes.
However, it is not ideal for this use case:
First, you are limited in blob size. The hard limit is 2GB size, so large videos are out of the question. But worse, the documentation from Datastax (the commercial company behind Cassandra's development) suggests that even 1 MB (!) is too large - see https://docs.datastax.com/en/cql/3.1/cql/cql_reference/blob_r.html.
One of the reasons why huge blobs are a problem is that Cassandra offers no API for fetching parts of them - you need to read (and write) a blob in one CQL operation, which opens up all sorts of problems. So if you want to store large files in Cassandra, you'll probably want to split them up into many small blobs - not one large blob.
The next problem is that some of Cassandra's implementation is inefficient when the database contains files (even if split up to a bunch of smaller blobs). One of the problems is the compaction algorithm, which ends up copying all the data over and over (a logarithmic number of times) on disk; An implementation optimized for storing files would keep the file data and the metadata separately, and only "compact" the metadata. Unfortunately neither Cassandra nor Scylla implement such a file format yet.
All-in-all, you're probably better off storing your metadata in Cassandra but the actual file content in a different object-store implementation.

What is the max Excel File Size(in MBs) which can be imported by SSIS

I want to know the max Excel File size which we can load into db using a Simple ETL SSIS package. If file size depends upon system configs or resources, Then how can we calculate it? In my case I am trying to load an excel file of 500+Mbs.
My Package gets hanged even while trying to map columns.
Thanks.
The only real limitation is the size of the machine's memory (RAM) where the package is running on, as SSIS loads the data into memory.
Thus, if you only have 2GB of RAM, I wouldn't try to load files bigger than 1 GB. (you must have RAM left for SQL Server to operate, and don't forget about all your other applications)
Also remember if you're not pipelining your data flows properly, and you have blocking parts like Aggregate or SQL Command objects, then you are going to be loading way more into memory than you should be.
The file size is not as important if you have no blocking parts. SSIS won't load the entire object into memory, and you can specify how much it uses. But if there are blocking parts, then it will need the entire object in memory.
Note that another big memory hog could be Lookup tasks with Full Caching - these can take large amounts of memory up if you are loading big tables.
Hope this helps.

What is the fastest way to put a large amount of data on a local file system onto a distributed store?

I have a single local directory on the order of 1 terabyte. It is made up of millions of very small text documents. If I were to iterate through each file sequentially for my ETL, it would take days. What would be the fastest way for me to perform ETL on this data, ultimately loading it onto a distributed store like hdfs or a redis cluster?
Generically: try to use several/many parallel asynchronous streams, one per file. How many will depend on several factors (number of destination endpoints, disk IO for traversing/reading data, network buffers, errors and latency...)

Resources