how to open a large (100GB) .txt file? [duplicate] - linux

This question already has answers here:
Working with huge files in VIM
(10 answers)
Closed 9 years ago.
I have a .txt file of ~100GB. Is there a text editor that I can use to open this? If so, how will this actually be stored in memory? I only have 16GB of RAM.
I'm also exploring other options such as splitting the file into 2 or more pieces. Any suggestions on how to do this efficiently on the command line in linux?
Thanks

Take a look at the utilities HEAD and TAIL if using the command line. Often I will use
tail -<number of lines> | more
And to split the file look at SPLIT.

Related

How to list the first or last 10 lines from a file without decompressing it in linux [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a .bz2 file. I want to list the first or last 10 lines without decompress it as it is too big. I tried the head -10 or tail -10 but I see gibberish. I also need to compare two compressed file to check if they are similar or not. how to achieve this without decompressing the files ?
EDIT: Similar means identical (have the same content).
While bzip2 is a block-based compression algorithm, so in theory you could just find the particular blocks you want to decompress, this would be complicated (e.g. what if the last ten lines you ultimately want to see actually spans two or more compressed blocks?).
To answer your immediate question, you can do this, which does actually decompress the entire file, so is in a sense wasteful, but it doesn't try to store that file anywhere, so you don't run into storage capacity issues:
bzcat file.bz2 | head -10
bzcat file.bz2 | tail -10
If your distribution doesn't include bzcat (which would be a bit unusual in my experience), bzcat is equivalent to bzip2 -d -c.
However, if your ultimate goal is to compare two compressed files (that may have been compressed at different levels, and so comparing the actual compressed files directly doesn't work), you can do this (assuming bash as your shell):
cmp <(bzcat file1.bz2) <(bzcat file2.bz2)
This will decompress both files and compare the uncompressed data byte-by-byte without ever storing either of the decompressed files anywhere.
The plain standard bunzip2 command can't do this. However, the man page says that bzip2 works in blocks of 900 KB, and mentions bzip2recover which is a program that can decompress individual blocks.
Using that knowledge, you should be able to put together something that cuts off the first and last 900 KB (or so) from the desired file, and then uses bzip2recover to decompress those.

how to merge 2 big files [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux
we can use
cat file1 file2 > final_file
But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.
Can you merge two files without writing one file onto the other?
Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.
Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)
Can we make this better than doubling the size of both files?
Yes, we can use the append (>>) operation instead.
cat file2 >> file1
That will still result in using all the space of consumed by file2 twice over until we can delete it.
Can we avoid using extra space?
No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.
What about writing a little bit at a time, then deleting what we wrote?
No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.
What about sparse files?
Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.
Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.
Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.
You can do like this
cat file2 >> file1
file1 will become the full content.
No, it is not possible to merge (on Linux) two big files by working on their meta-data.
Maybe you might consider some kind of database for your work.
As Alexandre noticed, you can append one big file to another, but this still requires a lot of data copying.

Where can I find a huge amount of text files? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looking for dataset to test FULLTEXT style searches on
I am recently on to a project of Data Mining, for which I need 100 GB of plain text for testing. I am tired of searching the net the whole day. Someone please help me out by providing the links, where can I download such text files?
What type of text are you searching for? Conversational, articles, books - or a good spread of everything?
Project Gutenberg might be a good start:
http://www.gutenberg.org/
Wikipedia also allows you to download an archive of articles:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
you should use http://dumps.wikimedia.org/

Text Editor for gigabyte sized files [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Text editor to open big (giant, huge, large) text files
I saw text editor to open big text files but that question referred to megabyte sized files. I work with 7GB csv files and find that even vim and gedit take a long time to open up.
What text editor do you use for for gigabyte sized files?
Appreciate any advice I can get.
don't know about others but i use vim (on windows) for editing GB files and it works every time. http://vim.sourceforge.net/
You can use total commander

How to edit multi-gigabyte text files? Vim doesn't work =( [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Are there any editors that can edit multi-gigabyte text files, perhaps by only loading small portions into memory at once? It doesn't seem like Vim can handle it =(
Ctrl-C will stop file load. If the file is small enough you may have been lucky to have loaded all the contents and just killed any post load steps. Verify that the whole file has been loaded when using this tip.
Vim can handle large files pretty well. I just edited a 3.4GB file, deleting lines, etc. Three things to keep in mind:
Press Ctrl-C: Vim tries to read in the whole file initially, to do things like syntax highlighting and number of lines in file, etc. Ctrl-C will cancel this enumeration (and the syntax highlighting), and it will only load what's needed to display on your screen.
Readonly: Vim will likely start read-only when the file is too big for it to make a . file copy to perform the edits on. I had to w! to save the file, and that's when it took the most time.
Go to line: Typing :115355 will take you directly to line 115355, which is much faster going in those large files. Vim seems to start scanning from the beginning every time it loads a buffer of lines, and holding down Ctrl-F to scan through the file seems to get really slow near the end of it.
Note - If your Vim instance is in readonly because you hit Ctrl-C, it is possible that Vim did not load the entire file into the buffer. If that happens, saving it will only save what is in the buffer, not the entire file. You might quickly check with a G to skip to the end to make sure all the lines in your file are there.
If you are on *nix (and assuming you have to modify only parts of file (and rarely)), you may split the files (using the split command), edit them individually (using awk, sed, or something similar) and concatenate them after you are done.
cat file2 file3 >> file1
It may be plugins that are causing it to choke. (syntax highlighting, folds etc.)
You can run vim without plugins.
vim -u "NONE" hugefile.log
It's minimalist but it will at least give you the vi motions you are used to.
syntax off
is another obvious one. Prune your install down and source what you need. You'll find out what it's capable of and if you need to accomplish a task via other means.
A slight improvement on the answer given by #Al pachio with the split + vim solution you can read the files in with a glob, effectively using file chunks as a buffer e.g
$ split -l 5000 myBigFile
xaa
xab
xac
...
$ vim xa*
#edit the files
:nw #skip forward and write
:n! #skip forward and don't save
:Nw #skip back and write
:N! #skip back and don't save
You might want to check out this VIM plugin which disables certain vim features in the interest of speed when loading large files.
I've tried to do that, mostly with files around 1 GB when I needed to make some small change to an SQL dump. I'm on Windows, which makes it a major pain. It's seriously difficult.
The obvious question is "why do you need to?" I can tell you from experience having to try this more than once, you probably really want to try to find another way.
So how do you do it? There are a few ways I've done it. Sometimes I can get vim or nano to open the file, and I can use them. That's a really tough pain, but it works.
When that doesn't work (as in your case) you only have a few options. You can write a little program to make the changes you need (for example, search & replaces). You could use a command line program that may be able to do it (maybe it could be accomplished with sed/awk/grep/etc?)
If those don't work, you can always split the file into chunks (something like split being the obvious choice, but you could use head/tail to get the part you want) and then edit the part(s) that need it, and recombine later.
Trust me though, try to find another way.
I think it is reasonably common for hex editors to handle huge files. On Windows, I use HxD, which claims to handle files up to 8 EB (8 billion gigabytes).
I'm using vim 7.3.3 on Win7 x64 with the LargeFile plugin by Charles Campbell to handle multi-gigabyte plain text files. It works really well.
I hope you come right.
Wow, never managed to get vim to choke, even with a GB or two. I've heard that UltraEdit (on Windows) and BBEdit (on Macs) are even more suitable for even-larger files, but I have no personal experience.
In the past I opened up to a 3 gig file with this tool http://csved.sjfrancke.nl/
Personally, I like UltraEdit. Here is their little spiel on large files.
I've used FAR Commander's built-in editor/viewer for super-large log files.
I have used TextPad for large log files it doesn't have an upper limit.
The only thing I've been able to use for something like that is my favorite Mac hex editor, 0XED. However, that was with files that I considered large at tens of megabytes. I'm not sure how far it will go. I'm pretty sure it only loads parts of the file into memory at once, though.
In the past I've successfully used a split/edit/join approach when files get very large. For this to work you have to know about where the to-be-edited text is, in the original file.

Resources