Changing the head of a large Fortran binary file without dealing with the whole body - io

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)

Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Related

Read a huge log file line by line and random access in node?

I'm working on a log file reader which will parse files and display fields from lines in a nicely formatted table in a node/electron app.
If the files were small, I could just read them line-by-line, parse them, store fields extracted from each line in a data structure and allow clients to scroll back and forth throughout the whole the file.
Since these files can be several gigabytes long, I need to do something more complex.
My current thinking is to either:
Read the whole file via the readline package.
Keep track of line ending offsets
Once the file is read, parse the bottom (hence most recent) 50 or so lines so I can extract relevant data and display visually
If client wants to scroll beyond my 50 lines, use the offsets to go to previous lines (via fs.read(..)).
Another method:
Use fs.read() to go straight to the end
Work backwards until newline characters are found
If client wants to scroll around the file, figure out the line offsets on demand
This doesn't even take into account building tail -f style functionality.
I'll have to account for at least ascii and utf8 encodings, as well as windows vs linux style line endings.
This is a LOT of low level work.
Is there a library which already provides functionality?
If I do this myself, any major caveats I haven't mentioned here? I haven't done low level, random access, programming for 20 years.

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

Reading a variable length record mainframe file

I have a mainframe data file in binary format, with variable records. No copybook works in this case, nor do I know end of line. How do I read such a file?
Assuming you're reading this file in a COBOL program running on the Mainframe, this is really no problem. COBOL doesn't write null-delimited output. It writes variable length records with the length embedded in the first two bytes of 4-byte prefix area called a (R)ecord (D)escriptor (W)ord, which is NOT included in the record layout copybook. To read such a record back into another COBOL, you just need a properly coded copybook.

How to find the position of Central Directory in a Zip file?

I am trying to find the position of the first Central Directory file header in a Zip file.
I'm reading these:
http://en.wikipedia.org/wiki/Zip_(file_format)
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...
If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.
To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?
After reading about End Of Central Directory record, Wikipedia says:
This ordering allows a zip file to be created in one pass, but it is
usually decompressed by first reading the central directory at the
end.
How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?
P.S. I'm writing a Zip file reader.
Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.
If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.
I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.
Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

potential dangers of merging two files of unknown size?

I have a binary file that I need to insert a header at the beginning of. I was thinking of opening a new file, writing the header data, and then copying the data from the binary file to this new file. Since the binary file is about 1 megabyte, are there any dangers to making this file using fwrite? One specific concern would be something like unintentionally overwriting data, similar to what happens if using gets and the input is longer than the buffer.
There's no risk. Allocate a buffer of a given size, read that many bytes into it from the source file, write the buffer back out to the destination file. The operations (file read / file write) all take a maximum number of bytes so as long as your buffer is the size you claim it is, it won't be overrun.
Also, the approach you describe is pretty much the only way to do it. I've never heard of a filesystem that has an "insert these bytes at the beginning of this file" operation.

Resources