Linux: huge files vs huge number of files - linux

I am writing software in C, on Linux running on AWS, that has to handle 240 terabytes of data, in 72 million files.
The data will be spread across 24 or more nodes, so there will only be 10 terabytes on each node, and 3 million files per node.
Because I have to append data to each of these three million files every 60 seconds, the easiest and fastest thing to do would to be able to keep each of these files open at one time.
I can't store the data in a database, because the performance in reading/writing the data will be too slow. I need to be able to read the data back very quickly.
My questions:
1) is it even possible to keep open 3 million files
2) if it is possible, how much memory would it consume
3) if it is possible, would performance be terrible
4) if it is not possible, I will need to combine all of the individual files into a couple of dozen large files. Is there a maximum file size in Linux?
5) if it is not possible, what technique should I use to append data every 60 seconds, and keep track of it?

The following is a very coarse description of an architecture that can work for your problem, assuming that the maximum number of file descriptors is irrelevant when you have enough instances.
First, take a look at this:
https://aws.amazon.com/blogs/aws/amazon-elastic-file-system-shared-file-storage-for-amazon-ec2/
https://aws.amazon.com/efs/
EFS provides a shared storage that you can mount as a filesystem.
You can store ALL your files in a single storage unit of EFS. Then, you will need a set of N worker-machines running at full capacity of filehandlers. You can then use a Redis queue to distribute the updates. Each worker has to dequeue a set of updates from Redis, and then will open necessary files and perform the updates.
Again: the maximum number of open filehandlers will not be a problem, because if you hit a maximum, you only need to increase the number of worker machines until you achieve the performance you need.
This is scalable, though I'm not sure if this is the cheapest way to solve your problem.

Related

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Faster way to split way big file in to smaller files?

I have a small file which is about 6.5 GB and I tried to split it into files of size 5MB each using split -d -line--bytes=5MB. It took me over 6 minutes to split this file.
I have files over 1TB.
Is there a faster way to do this?
Faster than a tool specifically designed to do this kind of job? Doesn't sound likely in the general case. However, there are a few things you may be able to do:
Save the output files to a different physical storage unit. This avoids reading and writing data to the same disk at the same time, allowing more uninterrupted processing.
If the record size is static you can use --bytes to avoid the processing overhead of dealing with full lines.

What is the fastest way to put a large amount of data on a local file system onto a distributed store?

I have a single local directory on the order of 1 terabyte. It is made up of millions of very small text documents. If I were to iterate through each file sequentially for my ETL, it would take days. What would be the fastest way for me to perform ETL on this data, ultimately loading it onto a distributed store like hdfs or a redis cluster?
Generically: try to use several/many parallel asynchronous streams, one per file. How many will depend on several factors (number of destination endpoints, disk IO for traversing/reading data, network buffers, errors and latency...)

cassandra read performance for large number of keys

Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?
Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family
Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so
After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.

Rackspace cloud files: how to size containers to optimize performance?

Rackspace cloud files uses a flat storage system using 'containers' to store files. According to Rackspace there is no limit to the number of files per container.
My question is whether there is a best/most efficient number of files per container to optimize write/fetch performance.
If I have tens of thousands of files to store, should they all go in a single giant container or partitioned into many smaller containers? And if so, what is the optimal container size?
FYI:
[Snippets taken from rackspace support]
long story short, the containers are databases, and the more rows in a table, the more time it takes to write them on standard hardware. When a write hasn't been committed to disk, it sits in a queue, and it subject to data loss. It's something we noticed with large containers, and the more objects, the more likely it was, so we instituted the limits to protect the data.
because of the rate limits, your data is safe, it just slows down the writes a bit
the limits starts as low as 50,000 objects, and at that level it limits you to 100 writes per second
by 1,000,000 objects in a container, it's 25 per second
and at 5 million and above, you're down to 4 writes per second
We apologize for the limitations, and will be updating our documentation to more clearly express this.
-This has recently hurt us quite badly. Thought I'd share until they get there API doc's upto date, so others can plan around this issue.
We recommend no more than 1 million objects per container. The system will return a maximum of 10,000 object names per list request by default.
Update 9/20/2013 from Cloud Files development: The 1 million object per container recommendation is no longer accurate since Cloud Files switched to all SSD container servers. Also, the list is limited to 10,000 containers at a time.

Resources