This question already has answers here:
Working with huge files in VIM
(10 answers)
Closed 9 years ago.
I have a .txt file of ~100GB. Is there a text editor that I can use to open this? If so, how will this actually be stored in memory? I only have 16GB of RAM.
I'm also exploring other options such as splitting the file into 2 or more pieces. Any suggestions on how to do this efficiently on the command line in linux?
Thanks
Take a look at the utilities HEAD and TAIL if using the command line. Often I will use
tail -<number of lines> | more
And to split the file look at SPLIT.
I'm trying to compare two huge text files (from 100MB to 500MB each one) in order to extract the lines that differs between the files and write these differing lines on another text files.
I found on the Net the An O(ND) Difference Algorithm for C# but when implemented the result is an OutOfMemory Exception.
Could you know an exit way from this blind alley?
Thank you very much.
Antonio
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looking for dataset to test FULLTEXT style searches on
I am recently on to a project of Data Mining, for which I need 100 GB of plain text for testing. I am tired of searching the net the whole day. Someone please help me out by providing the links, where can I download such text files?
What type of text are you searching for? Conversational, articles, books - or a good spread of everything?
Project Gutenberg might be a good start:
http://www.gutenberg.org/
Wikipedia also allows you to download an archive of articles:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
you should use http://dumps.wikimedia.org/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Are there any editors that can edit multi-gigabyte text files, perhaps by only loading small portions into memory at once? It doesn't seem like Vim can handle it =(
Ctrl-C will stop file load. If the file is small enough you may have been lucky to have loaded all the contents and just killed any post load steps. Verify that the whole file has been loaded when using this tip.
Vim can handle large files pretty well. I just edited a 3.4GB file, deleting lines, etc. Three things to keep in mind:
Press Ctrl-C: Vim tries to read in the whole file initially, to do things like syntax highlighting and number of lines in file, etc. Ctrl-C will cancel this enumeration (and the syntax highlighting), and it will only load what's needed to display on your screen.
Readonly: Vim will likely start read-only when the file is too big for it to make a . file copy to perform the edits on. I had to w! to save the file, and that's when it took the most time.
Go to line: Typing :115355 will take you directly to line 115355, which is much faster going in those large files. Vim seems to start scanning from the beginning every time it loads a buffer of lines, and holding down Ctrl-F to scan through the file seems to get really slow near the end of it.
Note - If your Vim instance is in readonly because you hit Ctrl-C, it is possible that Vim did not load the entire file into the buffer. If that happens, saving it will only save what is in the buffer, not the entire file. You might quickly check with a G to skip to the end to make sure all the lines in your file are there.
If you are on *nix (and assuming you have to modify only parts of file (and rarely)), you may split the files (using the split command), edit them individually (using awk, sed, or something similar) and concatenate them after you are done.
cat file2 file3 >> file1
It may be plugins that are causing it to choke. (syntax highlighting, folds etc.)
You can run vim without plugins.
vim -u "NONE" hugefile.log
It's minimalist but it will at least give you the vi motions you are used to.
syntax off
is another obvious one. Prune your install down and source what you need. You'll find out what it's capable of and if you need to accomplish a task via other means.
A slight improvement on the answer given by #Al pachio with the split + vim solution you can read the files in with a glob, effectively using file chunks as a buffer e.g
$ split -l 5000 myBigFile
xaa
xab
xac
...
$ vim xa*
#edit the files
:nw #skip forward and write
:n! #skip forward and don't save
:Nw #skip back and write
:N! #skip back and don't save
You might want to check out this VIM plugin which disables certain vim features in the interest of speed when loading large files.
I've tried to do that, mostly with files around 1 GB when I needed to make some small change to an SQL dump. I'm on Windows, which makes it a major pain. It's seriously difficult.
The obvious question is "why do you need to?" I can tell you from experience having to try this more than once, you probably really want to try to find another way.
So how do you do it? There are a few ways I've done it. Sometimes I can get vim or nano to open the file, and I can use them. That's a really tough pain, but it works.
When that doesn't work (as in your case) you only have a few options. You can write a little program to make the changes you need (for example, search & replaces). You could use a command line program that may be able to do it (maybe it could be accomplished with sed/awk/grep/etc?)
If those don't work, you can always split the file into chunks (something like split being the obvious choice, but you could use head/tail to get the part you want) and then edit the part(s) that need it, and recombine later.
Trust me though, try to find another way.
I think it is reasonably common for hex editors to handle huge files. On Windows, I use HxD, which claims to handle files up to 8 EB (8 billion gigabytes).
I'm using vim 7.3.3 on Win7 x64 with the LargeFile plugin by Charles Campbell to handle multi-gigabyte plain text files. It works really well.
I hope you come right.
Wow, never managed to get vim to choke, even with a GB or two. I've heard that UltraEdit (on Windows) and BBEdit (on Macs) are even more suitable for even-larger files, but I have no personal experience.
In the past I opened up to a 3 gig file with this tool http://csved.sjfrancke.nl/
Personally, I like UltraEdit. Here is their little spiel on large files.
I've used FAR Commander's built-in editor/viewer for super-large log files.
I have used TextPad for large log files it doesn't have an upper limit.
The only thing I've been able to use for something like that is my favorite Mac hex editor, 0XED. However, that was with files that I considered large at tens of megabytes. I'm not sure how far it will go. I'm pretty sure it only loads parts of the file into memory at once, though.
In the past I've successfully used a split/edit/join approach when files get very large. For this to work you have to know about where the to-be-edited text is, in the original file.
Does my question make sense? Using either Vim or Emacs, you come to understand that the interface exposes the code's representation of the state of the file you are editing in the buffer, the file is the on-disk storage you can fill a buffer from or write a buffer to. All these things a programmer would know, but when just editing text, why is it exposed? Any newer editor just tells you "Here is a file. Edit it."
Yes, I understand the technical meanings, but that isn't my question. This is a question not even about if it is a good idea to do it or not. Vim and Emacs are our two oldest editors in common use today, and they share this behavior. I know of no new editor that does the same. When did editors stop doing this and why?
For starters, Emacs uses plenty of buffers that aren't associated with any file. Any time you open a directory, read your mail, open a terminal, compile a program, launch an interactive Python session, or connect to a database, you get a buffer. Hence, Emacs's basic unit of work is a buffer and not a file, and the same logic holds for Vim.
New applications that only edit files make no distinction because every screen or window or tab directly represents a file. More capable applications like Emacs and Vim are a lot more flexible in that respect.
OK, here's my weird philosophical answer :
because late binding between the buffer in the editor and the actual concrete thing you're working on, gives the editing environment more flexibility and power.
Think this is out of date? One place where the idea is back with a vengeance is in the browser, where you don't have 1-1 correspondence between tabs and web-pages. Instead, inside each tab you can navigate forwards and backwards between multiple pages. No-one would try to make an MDI type interface to the web, where each page had it's own inner window. It would be impossibly fiddly to use. It just wouldn't scale.
Personally, I think IDEs are getting way too complicated these days, and the static binding between documents and buffers is one reason for this. I expect at some point there'll be a breakthrough as they move to the browser-like tabbed-buffer model where :
a) you'll be able to hyperlink between multiple files within the same buffer/tab (and there'll be a back-button etc.)
b) the generic buffers will be able to hold any type of data : source-code, command-line, dynamically generated graphic output, project outline etc.
In other words, much of the Vim / Emacs model, except tweaked to be more in-line with discoveries that browsers are making.
Because several buffers can show you different view of the same file. I do not know of other editors but this is true of Emacs. And what do you mean exactly with Old?
When applications started becoming used heavily by non-geeks who didn't want to trouble themselves with irrelevant detail.
I think the new editors quit doing it for the reasons you stated, that it is an abstraction that just gets in the way. Also most modern editors have unlimited undo, so the idea of the "buffer" is sort of implicit.
I guess I'm just an old fogey (in the die-hard vim camp), but the other editing packages I use, such as MS Word or Open Office, preserve the distinction between the copy of the file that I'm editing and the last saved version. That is utterly invaluable -- I don't want the editor to trample over my last good version until I'm ready for it to do so. Indeed, there's a decent chance (say one in a thousand) that I'll create a new file with the buffer I'm editing on.
On the other hand, the ability to create a file image by reading multiple files (either several copies of the same file, or copies of several different files) is also useful. There are faintly similar facilities in other documents.
So, I could be missing the point - I don't know which editors you are referring to as removing the distinction. But I think that all editors preserve the distinction between the copy of the file that you're editing and the last version saved to disk.
Because developers of those editors didn't care to hide implementation details from users.