how to merge 2 big files [closed]

how to merge 2 big files [closed] - linux

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux
we can use
cat file1 file2 > final_file
But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.

Can you merge two files without writing one file onto the other?
Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.
Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)
Can we make this better than doubling the size of both files?
Yes, we can use the append (>>) operation instead.
cat file2 >> file1
That will still result in using all the space of consumed by file2 twice over until we can delete it.
Can we avoid using extra space?
No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.
What about writing a little bit at a time, then deleting what we wrote?
No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.
What about sparse files?
Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.
Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.
Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

You can do like this
cat file2 >> file1
file1 will become the full content.

No, it is not possible to merge (on Linux) two big files by working on their meta-data.
Maybe you might consider some kind of database for your work.
As Alexandre noticed, you can append one big file to another, but this still requires a lot of data copying.

Related

how to check compression type without decompressing?

I wrote code in nodejs to decompress different file types (like tar, tar.gz etc..)
I do not have the filename available to me.
Currently I use brute force to decompress. The first one that succeeds, wins..
I want to improve this by knowing the compression type beforehand.
Is there a way to do this?

Your "brute force" approach would actually work very well, since the software would determine incredibly quickly, usually within the first few bytes, that it had been handed the wrong thing. Except for the one that will work.
You can see this answer for a list of prefix bytes for common formats. You would also need to detect the tar format within a compressed format, which is not detailed there. Even if you find a matching prefix, you still need to proceed to decompress and decode to test the hypothesis, which is essentially your brute force method.

Something like .htaccess in Linux

I have a directory with lot of files (above 4.000.000 files). All filenames has this same pattern:
PREFIX-XXXXXX-YY.ext
where
XXXXXX contains letters and digits
YY contains digits
ext is a extension of file (.txt, .jpg)
File structure have 12MB, so listing/searching of this directory takes long time. I divided all content of this directory to subdirectories, depends of filename, precisiously first letter of XXXXXX from pattern above.
ie.
main_directory/A/PREFIX-AXXXXX-YY.ext
main_directory/B/PREFIX-BXXXXX-YY.ext
main_directory/1/PREFIX-1XXXXX-YY.ext
Is in Linux easy way to make a rule, when I type in linux command for example
test:/home/usr/admin # ls main_directory/PREFIX-AXXXXX-*
I will get a list of filenames from main_directory/A/ directory? This rule MUST work only for main_directory.

You can't have this at file-system layer, not without creating links and circling back to your original problem. I can think of two easy ways out.
Take 1: scripting
You could write a short script to rewrite the names for you.
Suppose you had a rewrite script that took PREFIX-AXXXX-* and outputted main_directory/A/PREFIX-AXXXX-*. You could then change your ls line to:
$ ls `rewrite PREFIX-AXXXXX-*`
This can be easily accomplished with sed, awk or any other on-the-fly text transformation tool.
Shell programs are composable for a reason! :)
Take 2: embed a faster file-system
You could do away with the restructuring and rewriting names by using a faster file-system, mounted in your main directory. XFS sounds good for this. It should remove your performance concerns without further ado.
This requires a deeper understanding of what's going on to be effective for day-to-day usage, however.
Edit: Here's an article on how to create virtual user-space file-systems.
Edit 2: actually no, I don't think XFS would cut it. Maybe another file-system, though.

How to edit multi-gigabyte text files? Vim doesn't work =( [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Are there any editors that can edit multi-gigabyte text files, perhaps by only loading small portions into memory at once? It doesn't seem like Vim can handle it =(

Ctrl-C will stop file load. If the file is small enough you may have been lucky to have loaded all the contents and just killed any post load steps. Verify that the whole file has been loaded when using this tip.
Vim can handle large files pretty well. I just edited a 3.4GB file, deleting lines, etc. Three things to keep in mind:
Press Ctrl-C: Vim tries to read in the whole file initially, to do things like syntax highlighting and number of lines in file, etc. Ctrl-C will cancel this enumeration (and the syntax highlighting), and it will only load what's needed to display on your screen.
Readonly: Vim will likely start read-only when the file is too big for it to make a . file copy to perform the edits on. I had to w! to save the file, and that's when it took the most time.
Go to line: Typing :115355 will take you directly to line 115355, which is much faster going in those large files. Vim seems to start scanning from the beginning every time it loads a buffer of lines, and holding down Ctrl-F to scan through the file seems to get really slow near the end of it.
Note - If your Vim instance is in readonly because you hit Ctrl-C, it is possible that Vim did not load the entire file into the buffer. If that happens, saving it will only save what is in the buffer, not the entire file. You might quickly check with a G to skip to the end to make sure all the lines in your file are there.

If you are on *nix (and assuming you have to modify only parts of file (and rarely)), you may split the files (using the split command), edit them individually (using awk, sed, or something similar) and concatenate them after you are done.
cat file2 file3 >> file1

It may be plugins that are causing it to choke. (syntax highlighting, folds etc.)
You can run vim without plugins.
vim -u "NONE" hugefile.log
It's minimalist but it will at least give you the vi motions you are used to.
syntax off
is another obvious one. Prune your install down and source what you need. You'll find out what it's capable of and if you need to accomplish a task via other means.

A slight improvement on the answer given by #Al pachio with the split + vim solution you can read the files in with a glob, effectively using file chunks as a buffer e.g
$ split -l 5000 myBigFile
xaa
xab
xac
...
$ vim xa*
#edit the files
:nw #skip forward and write
:n! #skip forward and don't save
:Nw #skip back and write
:N! #skip back and don't save

You might want to check out this VIM plugin which disables certain vim features in the interest of speed when loading large files.

I've tried to do that, mostly with files around 1 GB when I needed to make some small change to an SQL dump. I'm on Windows, which makes it a major pain. It's seriously difficult.
The obvious question is "why do you need to?" I can tell you from experience having to try this more than once, you probably really want to try to find another way.
So how do you do it? There are a few ways I've done it. Sometimes I can get vim or nano to open the file, and I can use them. That's a really tough pain, but it works.
When that doesn't work (as in your case) you only have a few options. You can write a little program to make the changes you need (for example, search & replaces). You could use a command line program that may be able to do it (maybe it could be accomplished with sed/awk/grep/etc?)
If those don't work, you can always split the file into chunks (something like split being the obvious choice, but you could use head/tail to get the part you want) and then edit the part(s) that need it, and recombine later.
Trust me though, try to find another way.

I think it is reasonably common for hex editors to handle huge files. On Windows, I use HxD, which claims to handle files up to 8 EB (8 billion gigabytes).

I'm using vim 7.3.3 on Win7 x64 with the LargeFile plugin by Charles Campbell to handle multi-gigabyte plain text files. It works really well.
I hope you come right.

Wow, never managed to get vim to choke, even with a GB or two. I've heard that UltraEdit (on Windows) and BBEdit (on Macs) are even more suitable for even-larger files, but I have no personal experience.

In the past I opened up to a 3 gig file with this tool http://csved.sjfrancke.nl/

Personally, I like UltraEdit. Here is their little spiel on large files.

I've used FAR Commander's built-in editor/viewer for super-large log files.

I have used TextPad for large log files it doesn't have an upper limit.

The only thing I've been able to use for something like that is my favorite Mac hex editor, 0XED. However, that was with files that I considered large at tens of megabytes. I'm not sure how far it will go. I'm pretty sure it only loads parts of the file into memory at once, though.

In the past I've successfully used a split/edit/join approach when files get very large. For this to work you have to know about where the to-be-edited text is, in the original file.

Combining resources into a single binary file

How does one combine several resources for an application (images, sounds, scripts, xmls, etc.) into a single/multiple binary file so that they're protected from user's hands? What are the typical steps (organizing, loading, encryption, etc...)?
This is particularly common in game development, yet a lot of the game frameworks and engines out there don't provide an easy way to do this, nor describe a general approach. I've been meaning to learn how to do it, but I don't know where to begin. Could anyone point me in the right direction?

There are lots of ways to do this. m_pGladiator has some good ideas, especially with seralization. I would like to make a few other comments.
First, if you are going to pack a bunch of resources into a single file (I call these packfiles), then I think that you should work to avoid loading the whole file and then deseralizing out of that file into memory. The simple reason is that it's more memory. That's really not a problem on PC's I guess, but it's good practice, and it's essential when working on the console. While we don't (currently) serialize objects as m_pGladiator has suggested, we are moving towards that.
There are two types of packfiles that you might have. One would be a file where you want arbitrary access to the contents of the files. A second type might be a collection of files where you need all of those files when loading a level. A basic example might be:
An audio packfile might contain all the audio for your game. You might only need to load certain kinds of audio for the menus or interface screens and different sets of audio for the levels. This might fall intot he first category above.
A type that falls into the second category might be all models/textures/etc for a level. You basically want to load the entire contents of this file into the game at load time because you will (likely) need all of it's contents while a player is playing that level or section.
many of the packfiles that we build fall into the second category. We basically package up the level contents, and then compresses them with something like zlib. When we load one of these at game time, we read a small amount of the file, uncompress what we've read into a memory buffer, and then repeat until the full file has been read into memory. The buffer we read into is relatively small while final destination buffer is large enough to hold the largest set of uncompressed data that we need. This method is tricky, but again, it saves on RAM, it's an interesting exercise to get working, and you feel all nice and warm inside because you are being a good steward of system resources. once the packfile has been completely uncompressed into it's destinatino buffer, we run a final pass on the buffer to fix up pointer locations, etc. This method only works when you write out your packfile as structures that the game knows. In other words, our packfile writing tools share struct (or classses) with the game code. We are basically writing out and compressing exact representations of data structures.
If you simply want to cut down on the number of files that you are shipping and installing on a users machine, you can do with something like the first kind of packfile that I describe. Maybe you have 1000s of textures and would just simply like to cut down on the sheer number of files that you have to zip up and package. You can write a small utility that will basically read the files that you want to package together and then write a header containing the files and their offsets in the packfile, and then you can write the contents of the file, one at a time, one after the other, in your large binary file. At game time, you can simply load the header of this packfile and store the filenames and offsets in a hash. When you need to read a file, you can hash the filename and see if it exists in your packfile, and if so, you can read the contents directly from the packfile by seeking to the offset and then reading from that location in the packfile. Again, this method is basically a way to pack data together without regards for encryption, etc. It's simply an organizational method.
But again, I do want to stress that if you are going a route like I or m_pGladiator suggests, I would work hard to not have to pull the whole file into RAM and then deserialize to another location in RAM. That's a waste of resources (that you perhaps have plenty of). I would say that you can do this to get it working, and then once it's working, you can work on a method that only reads part of the file at a time and then decompresses to your destination buffer. You must use a comprsesion scheme that will work like this though. zlib and lzw both do (I believe). I'm not sure about an MD5 algorithm.
Hope that this helps.

do as Java: pack it all in a zip, and use an filesystem-like API to read directly from there.

Personally, I never used the already available tools to do that. If you want to prevent your game to be hacked easily, then you have to develop your own resource manipulation engine.
First of all read about serializing objects. When you load a resource from file (graphic, sound or whatever), it is stored in some object instance in the memory. A game usually uses dozens of graphical and sound objects. You have to make a tool, which loads them all and stores them in collections in the memory. Then serialize those collections into a binary file and you have every resource there.
Then you can use for example MD5 or any other encryption algorithm to encrypt this file.
Also, you can use zlib or other compression library to make this big binary file a bit smaller.
In the game, you should load the encrypted binary file and unpack it. Then decrypt it. Then deserialize the object collections and you have all resources back in memory.
Of course you can make this more comprehensive by storing in different binary files the resources for different levels and so on - there are plenty of variants, depending on what you want. Also you can first zip, then encrypt, or make other combinations of the steps.

Short answer: yes.
In Mac OS 6,7,8 there was a substantial API devoted to this exact task. Lookup the "Resource Manager" if you are interested. Edit: So does the ROOT physics analysis package.
Not that I know of a good tool right now. What platform(s) do you want it to work on?
Edited to add: All of the two-or-three tools of this sort that I am away of share a similar struture:
The file starts with a header and index
There are a series of blocks some of which may have there own headers and indicies, some of which are leaves
Each leaf is a simple serialization of the data to be stored.
The whole file (or sometimes individual blocks) may be compressed.
Not terribly hard to implement your own, but I'd look for a good existing one that meets your needs first.

For future people, like me, who are wondering about this same topic, check out the two following links:
http://www.sfml-dev.org/wiki/en/tutorials/formatdat
http://archive.gamedev.net/reference/programming/features/pak/

Will random data appended to a JPG make it unusable?

So, to simplify my life I want to be able to append from 1 to 7 additional characters on the end of some jpg images my program is processing*. These are dummy padding (fillers, etc - probably all 0x00) just to make the file size a multiple of 8 bytes for block encryption.
Having tried this out with a few programs, it appears they are fine with the additional characters, which occur after the FF D9 that specifies the end of the image - so it appears that the file format is well defined enough that the 'corruption' I'm adding at the end shouldn't matter.
I can always post process the files later if needed, but my preference is to do the simplest thing possible - which is to let them remain (I'm decrypting other file types and they won't mind, so having a special case is annoying).
I figure with all the talk of Steganography hullaballo years ago, someone has some input here...
(encryption processing by 8 byte blocks, I don't want to save pre-encrypted file size, so append 0x00 to input data, and leave them there after decoding)

No, you can add bits to the end of a jpg file, without making it unusable. The heading of the jpg file tells how to read it, so the program reading it will stop at the end of the jpg data.
In fact, people have hidden zip files inside jpg files by appending the zip data to the end of the jpg data. Because of the way these formats are structured, the resulting file is valid in either format.

You can .. but the results may be unpredictable.
Even though there is enough information in the format to tell the client to ignore the extra data it is likely not a case the programmer tested for.
A paranoid program might look at the size, notice the discrepancy and decide it won't process your file because clearly it doesn't fully understand it. This is particularly likely when reading data from the web when random bytes in a file could be considered a security risk.

You can embed your data in the XMP tag within a JPEG (or EXIF or IPTC fields for that matter).
XMP is XML so you have a fair bit of flexibility there to do you own custom stuff.
It's probably not the simplest thing possible but putting your data here will maintain the integrity of the JPEG and require no "post processing".
You data will then show up in other imaging software such as PhotoShop, which may not be ideal.

As others have stated, you have no control how programs process image files and therefore some programs may find the images valid others may not.
However, there is a bigger issue here. Judging by your question, I'm deducing you're practicing "security through obscurity." It's widely considered a very bad practice. Use Google to find a plethora of articles about the topic.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string