I have a large GZIP-ed file. I want to read a few bytes from a specific offset of uncompressed data.
For example, I have a file that original size is 10GB. In gzipped state it has size 1GB. I want to read a few bytes at 5GB offset in that 1GB gzipped file.
You will need to read all of the first 5 GB in order to get just those bytes.
If you are frequently accessing just a few bytes from the same large gzip file, then it can be indexed for more rapid random access. You would read the entire file once to build the index. See zran.h and zran.c.
Related
I've 1B+ gzip files (avg. 50 kb per each) and I want to upload them into S3 server. As I need to pay for each write operation, it becomes a huge cost problem to transfer them into S3. Also, those files are very similar and I want to compress them within a large file, so that compression efficiency will increase too.
I'm a newbie when it comes to write shell scripts but looking for a way, where I can:
Find all .gz files,
Decompress first 1K,
Compress in a single folder,
Delete this 1K batch,
Iterate to next 1K file,
I appreciate if you able to help me to think more creatively to do this. The only way in my mind is decompressing all of them and compress them by each 1K chunks, but it is not possible as I don't have disk space to compress them.
Test with a few files how much additional space is used when decompressing files. Try to make more free space (move 90% of the files to another host).
When files are similar the compression rate of 10% of the files will be high.
I guess that 10 chunks will fit, but it will be tight everytime you want to decompress one. So I would go for 100 chunks.
But first think what you want to do with the data in the future.
Never use it? Delete it.
Perhaps 1 time in the far future? Glacier.
Often? Use smaller chunks so you can find the right file easier.
I've been working on a project which uses apache-poi to read .PPT files and change some attributes of SlideShowDocInfoAtom record in ppt file.
I can read the file using HSLFSlideShow, however, when it comes to a large ppt file (e.g. over 1GB), and my application jvm max heap size is restricted to 2GB, poi throws an OutOfMemorry Error.
After reading the source code, I know it will create a byte array when reading one of the streams of the file. In the 1GB file, the PowerPoint Document stream in the file will be up to 1GB, which consumes 1GB memorry space to create byte array, and somehow causes the jvm to crash.
So, is there any way that I can read large ppt file without enlarging jvm heap size, as I only want to read some doc info of this file, don't really want to read large blocks of the file such as audios or videos into memorry.
I'm trying to understand Cassandra read path and can't get why do we need a
compression offset map.
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
The partition index resides on disk and stores an index of all partition keys mapped to their offset.
The compression offset map stores pointers to the exact location on disk that the desired partition data will be found.
Why do we need both of them? Why can't partition index store pointers to exact location on disk?
I'm sorry for a stupid title, but that's what stackoverflow asked me, I couldn't use "Why do we need a compression offset map if we have a partition index?"
The file is compressed in chunks. By default 64k of data is compressed, then next 64k etc. The offsets written in index file are that of the uncompressed data. This is because as its writing, it knows how many bytes have been written so far so uses that to mark whenever starting new partition. The compression offsets maps the compressed offsets and their uncompressed positions so it knows which chunk to start decompressing to get to the partition at some uncompressed offset from the index.
If a partition exists in the middle of a 64k compressed chunk, you need do decompress that entire chunk. You cannot start reading in the middle of it due to how the compression algorithms work. This is why in some situations it makes sense to decrease the chunk size as it would reduce the overhead of reading a tiny partition.
I want to use Append blobs in Azure storage.
When Im uploading a blob, I should choose the block size.
What I should consider when choosing the block size?
I see no difference if Im uploading a file which has bigger size then block size.
How to choose the right block size?
According to your description, I did some research, you could refer to it for a better understanding about blocks of append blob:
I just checked the CloudAppendBlob.AppendText and CloudAppendBlob.AppendFromFile. If the file size or text content size less than 4MB, then it would be uploaded to a new individual block. Here I used CloudAppendBlob.AppendText for appending text content (byte size less than 4MB) three times, you could refer to the network traces as follows:
For content size > 4MB, then the client SDK would divide the content into small pieces (4MB) and upload them into each blocks. Here I uploaded a file with the size about 48.8MB, you could refer to the network traces as follows:
As Gaurav Mantri mentioned that you could choose small block size for low speed network. Moreover, for small block size write, you would retrieve the better performance for write requests, but when you reading data, your data spans across multiple separate blocks, it would slow down your read requests. It depends on the write/read ratio your application expected, for optimal reads, I recommend that you need to batch writes to be as near 4MB as possible, which would bring you with slower write requests but reads to be much faster.
A few things to consider when deciding on the block size:
In case of an Append Blob, maximum size of a block can be 4 MB so you can't go beyond that number.
Again, a maximum of 50000 blocks can be uploaded so you would need to divide the blob size with 50000 to decide the size of a block. For example, if you're uploading a 100MB file and decide to choose 100 byte block, you would end up with 1048576 (100x1024x1024/100) blocks which is more than allowed limit of 50000 so it is not allowed.
Most importantly, I believe it depends on your Internet speed. If you have a really good Internet connection, you can go up to 4MB block size. For not so good Internet connections, you can reduce the limit. For example, I always try to use 256-512KB block size as the Internet connection I have is not good.
I need to read data from binary files. The files are small, of the order of 1 MB, so it's probably not efficient to use binaryFiles() and process them file by file (too much overhead).
I can join them in one big file and then use binaryRecords(), but the record size is just 512 bytes, so I'd like to concatenate several records together, in order to produce chunks of the size of tens of megabytes. The binary file format allows this.
How can I achieve this?
More in general: Is this the right approach to the problem?
Thanks!
As of Spark 2.1, binaryFiles() will coalesce multiple small input files into a partition (default is 128 MB per partition), so using binaryFiles() to read small files should be much more efficient now.
See also https://stackoverflow.com/a/51460293/215945 for more details about binaryFiles() and how to adjust the default 128 MB size (if desired).
I'm not sure, but this way might help:
N is the number of your small files.
rdd = sc.parallelize(1 to N, N).mapPartitions(binaryFiles()).collect()