I'm trying to write a collection of yara signatures that will tag zip files based on artifacts of their creation.
I understand the EOCD has a magic number of 0x06054b50, and that it is located at the end of the archive structure. It has a variable length comment field, with a max length of 0xFFFF, so the EOCD could be up to 0xFFFF+ ~20 bytes. However, there could be data after the zip structure that could throw off the any offset dependent scanning.
Is there any way to locate the record without scanning the whole file for the magic bytes? How do you validate that the magic bytes aren't there by coincidence if there can be data after the EOCD?
This is typically done by scanning backwards from the end of the file until you find the EOCD signature. Yes, it is possible to find the same signature embedded in the comment, so you need to check other parts of the EOCD record to see if they are consistent with the file you are reading.
For example, if the EOCD record isn't at the end of the file, the comment length field in the EOCD cannot be zero. It should match the number of bytes left in the file.
Similarly, if this is a single disk archive, the offset of start of central directory needs to point to somewhere within the size of the zip archive. If you want to follow that offset you should find the signature for a central directory record.
And so on.
Note that I've ignored the complications of the Zip64 records and encryption records, but the principle is the same. You need to check the fields in the records are consistent with the file being read.
Related
I want to insert a string in any position (beginning, end, middle) without overlapping/overwriting the existing data.
I've tried using
fs.createWriteStream(file, {start: 0, flags: 'r+'}) but this overwrites the existing data and it does not insert it literally.
I've seen solutions of reading data into buffers then rewrite it again into the file, which won't work for me because I need to handle even large data and buffer has its limits.
Any thoughts on this?
The usual operating systems (Windows, Unix, Mac) do not have file systems that support inserting data into a file without rewriting everything that comes after. So, you cannot directly do what you're asking in a regular file.
Some of the technical choices you have are:
You rewrite all the data that comes after to new locations in the file essentially moving it up in the file and then you can write your new content at the desired position. Note, it's a bit tricky to do this without overwriting data you need to keep. You essentially have to start at the end of the file, read a block, write it to a new higher location, position back a block, repeat, until you get to your insert location.
You create your own little file system where a logical file consists of multiple files linked together by some sort of index. Then, to insert some data, you split a file into two (which involves rewriting some data) you can then insert at the end of one of the split files. This can very complicated very quickly.
You keep your data in a database where it's easier to insert new data elements and keep some sort of index that establishes their order that you can query by. Databases are in the business of managing how to store data on disk while offering you views of the data that are not directly linked to how it's stored on the disk.
Related answer: What is the optimal way of merge few lines or few words in the large file using NodeJS?
I have a problem where I am trying to split a file along n character length records for a distributed system. I have the functionality for breaking up the record and map it to the proper names on a record level but need to go from the file to being on the system to breaking up the file and passing it out to the nodes in n length sized pieces to be split and processed.
I have looked into the specs for the SparkContext object and there is a method to pull in a file from the Hadoop environment and load it as a byte array data frame. The function is byteRecords.
Say I have a really large zip file (80GB) containing one massive CSV file (> 200GB).
Is it possible to fetch a subsection of the 80GB file data, modify the central directory, and extract just that bit of data?
Pictorial representation:
Background on my problem:
I have a cyclic process that does a summing on a certain column of a large zipped CSV file stashed in the cloud.
What I do today is I stream the file to my disk, extract it and then stream the file line by line. This makes is a very disk bound operation. Disk IS the bottle neck for sure.
Sure, I can leverage other cloud services to get what I need faster but that is not free.
I'm curious if I can see speed gains by just taking 1GB sub sections of zip until there's nothing left to read.
What I know:
The Zip file is stored using the deflate compression algorithm (always)
In the API I use to get the file from the cloud, I can specify a byte range to filter to. This means I can seek through the bytes of a file without hitting disk!
According the zip file specs there are three major parts to a zip file in order:
1: A header describing the file and it's attributes
2: The raw file data in deflated format
3: The central directory listing out what files start and stop and what bytes
What I don't know:
How the deflate algorithm works exactly. Does it jumble the file up or does it just compress things in order of the original file? If it does jumble, this approach may not be possible.
Had anyone built a tool like this already?
You can always decompress starting from the beginning, going as far as you like, keeping only the last, say, 1 GB, once you get to where you want. You cannot just start decompressing somewhere in the middle. At least not with a normal .zip file that has not been very specially prepared somehow for random access.
The central directory has nothing to do with random access of a single entry. All it can do is tell you where an entry starts and how long it is (both compressed and uncompressed).
I would recommend that you reprocess the .zip file into a .zip file with many (~200) entries, each on the order of 1 GB uncompressed. The resulting .zip file will be very close to the same size, but you can then use the central directory to pick one of the 200 entries, randomly access it, and decompress just that one.
I have a question about the torrent file.
I know that it contains a list of servers (users) that I need to connect to, for downloading part of the whole file.
my question is if this is all what the torrent contains? there are more important details?
thanks!
A torrent file is a specially formatted binary file. It always contains a list of files and integrity metadata about all the pieces, and optionally contains a list of trackers.
A torrent file is a bencoded dictionary with the following keys:
announce - the URL of the tracker
info - this maps to a dictionary whose keys are dependent on whether one or more files are being shared:
name - suggested file/directory name where the file(s) is/are to be saved
piece length - number of bytes per piece. This is commonly 28KiB = 256 KiB = 262,144 B.
pieces - a hash list. That is, a concatenation of each piece's SHA-1 hash. As SHA-1 returns a 160-bit hash, pieces will be a string whose length is a multiple of 160-bits.
length - size of the file in bytes (only when one file is being shared)
files - a list of dictionaries each corresponding to a file (only when multiple files are being shared). Each dictionary has the following keys:
path - a list of strings corresponding to subdirectory names, the last of which is the actual file name
length - size of the file in bytes.
I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.