Edit very large sql dump/text file (on linux) - linux

I have to import a large mysql dump (up to 10G). However the sql dump already predefined with a database structure with index definition. I want to speed up the db insert by removing the index and table definition.
That means I have to remove/edit the first few lines of a 10G text file. What is the most efficient way to do this on linux?
Programs that require loading the entire file into RAM will be an overkill to me.

Rather than removing the first few lines, try editing them to be whitespace.
The hexedit program can do this-- it reads files in chunks, so opening a 10GB file is no different from opening a 100KB file to it.
$ hexedit largefile.sql.dump
tab (switch to ASCII side)
space (repeat as needed until your header is gone)
F2 (save)/Ctrl-X (save and exit)/Ctrl-C (exit without saving)

joe is an editor that works well with large files. I just used it to edit a ~5G SQL dump file. It took about a minute to open the file and a few minutes to save it, with very little use of swap (on a system with 4G RAM).

sed 's/OLD_TEXT/NEW_TEXT/g' < oldfile > newfile
or
cat file | sed 's/OLD_TEXT/NEW_TEXT/g' > newfile

Perl can read the file line by line:
perl -pi.bak -e 's/^create index/--create index/'

Related

Insert text in .txt file using cmd

So, I want to insert test in .txt but when I try
type file1.txt >> file2.txt
and sort it using cygwin with sort file1 | uniq >> sorted it will place it at the end of the file. But i want to write it to the start of the file. I don't know if this is possible in cmd and if it's not I can also do it in a linux terminal.
Is there a special flag or operator I need to use?
Thanks in regards, Davin
edit: the file itself (the file i'm writing to) is about 5GB big so i would have to write 5GB to a file every time i wanted to change anything
It is not possible to write to the start of the file. You can only replace the file content with content provided or append to the end of a file. So if you need to add the sorted output in front of the sorted file, you have to do it like that:
mv sorted sorted.old
sort file1 | uniq > sorted
cat sorted.old >> sorted
rm sorted.old
This is not a limitation of the shell but of the file APIs of pretty much every existing operating system. The size of a file can only be changed at the end, so you can increase it, in that case the file will grow at the end (all content stays as it is but now there is empty space after the content) or you can truncate it (in that case content is cut off at the end). It is possible to copy data around within a file but there exists no system function to do that, you have to do it yourself and this is almost as inefficient as the solution shown above.

Adding content to middle of file..without reading it till the end

I have read various questions/answers here on unix.stackexchange on how to add or remove lines to/from a file without needing to create a temporary file.
https://unix.stackexchange.com/questions/11067/is-there-a-way-to-modify-a-file-in-place?lq=1
It appears all these answers need one to atleast read till end of the file, which can be time consuming if input is a large file. Is there a way around this? I would expect a file system to be implemented like a linked list...so there should be a way to reach the required "lines" and then just add stuff (node in linked lists). How do I go about doing this?
Am I correct in thinking so? Or Am I missing anything?
Ps: I need this to be done in 'C' and cannot use any shell commands.
As of Linux 4.1, fallocate(2) supports the FALLOC_FL_INSERT_RANGE flag, which allows one to insert a hole of a given length in the middle of a file without rewriting the following data. However, it is rather limited: the hole must be inserted at a filesystem block boundary, and the size of the inserted hole must be a multiple of the filesystem block size. Additionally, in 4.1, this feature was only supported by the XFS filesystem, with Ext4 support added in 4.2.
For all other cases, it is still necessary to rewrite the rest of the file, as indicated in other answers.
The short answer is that yes, it is possible to modify the contents of a file in place, but no, it is not possible to remove or add content in the middle of the file.
UNIX filesystems are implemented using an inode pointer structure which points to whole blocks of data. Each line of a text file does not "know" about its relationship to the previous or next line, they are simply adjacent to each other within the block. To add content between those two lines would require all of the following content to be shifted further "down" within the block, pushing some data into the next block, which in turn would have to shift into the following block, etc.
In C you can fopen a file for update and read its contents, and overwrite some of the contents, but I do not believe there is (even theoretically) any way to insert new data in the middle, or delete data (except to overwrite it with nulls.
You can modify a file in place, for example using dd.
$ echo Hello world, how are you today? > helloworld.txt
$ cat helloworld.txt
Hello world, how are you today?
$ echo -n Earth | dd of=helloworld.txt conv=notrunc bs=1 seek=6
$ cat helloworld.txt
Hello Earth, how are you today?
The problem is that if your change also changes the length, it will not quite work correctly:
$ echo -n Joe | dd of=helloworld.txt conv=notrunc bs=1 seek=6
Hello Joeth, how are you today?
$ echo -n Santa Claus | dd of=helloworld.txt conv=notrunc bs=1 seek=6
Hello Santa Clausare you today?
When you change the length, you have to re-write the file, if not completely then starting at the point of change you make.
In C this is the same as with dd. You open the file, you seek, and you write.
You can open a file in read/write mode. You read the file (or use "seek" to jump to the position you want, if you know it), and then write into the file, but you overwrite the data which are here (this is not an insertion).
Then you can choose to trunc the file from the last point you wrote or keep all the remaining data after the point you wrote without reading them.

How to edit 300 GB text file (genomics data)?

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.
UPDATE: More information
The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...
One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.
Based on your update:
One more thought... Is there an approach that would allow us to add
the asterisk to the line without opening the entire text file at once.
This could be very useful given that we will have to repeat the
process an unknown number of times.
Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:
sed 'LINE_NUMBER s/^/*/' file
See an example:
$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee
If you add -i, the file will be updated:
$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee
Even though I always think it's better to do a redirection to another file
sed '3 s/^/*/' file > new_file
so that you keep intact your original file and save the updated one in new_file.
If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.
split -a4 -d -l100000 hugefile.txt part.
This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:
cat part.* > new_hugefile.txt
The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.
A basic pattern in R is to read the data in chunks, edit, and write out
fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
## txt is now 1000000 lines, add an asterix to problem lines
## bad = <create logical vector indicating bad lines here>
## txt[bad] = paste0("*", txt[bad])
writeLines(txt, fout)
}
close(fin); close(fout)
While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Empty a file while in use in linux

I'm trying to empty a file in linux while in use, it's a log file so it is continuosly written.
Right now I've used:
echo -n > filename
or
cat /dev/null > filename
but all of this produce an empty file with a newline character (or strange character that I can see as ^#^#^#^#^#^#^#^#^#^#^#^.. on vi) and I have to remove manually with vi and dd the first line and then save.
If I don't use vi adn dd I'm not able to manipulate file with grep but I need an automatic procedure that i can write in a shell script.
Ideas?
This should be enough to empty a file:
> file
However, the other methods you said you tried should also work. If you're seeing weird characters, then they are being written to the file by something else - most probably whatever process is logging there.
What's going on is fairly simple: you are emptying out the file.
Why is it full of ^#s, then, you ask? Well, in a very real sense, it is not. It does not contain those weird characters. It has a "hole".
The program that is writing to the file is writing a file that was opened with O_WRONLY (or perhaps O_RDWR) but not O_APPEND. This program has written, say, 65536 bytes into the file at the point when you empty out the file with cp /dev/null filename or : > filename or some similar command.
Now the program goes to write another chunk of data (say, 4096 or 8192 bytes). Where will that data be written? The answer is: "at the current seek offset on the underlying file descriptor". If the program used O_APPEND the write would be, in effect, preceded by an lseek call that did a "seek to current end-of-file, i.e., current length of file". When you truncate the file that "current end of file" would become zero (the file becoming empty) so the seek would move the write offset to position 0 and the write would go there. But the program did not use O_APPEND, so there is no pre-write "reposition" operation, and the data bytes are written at the current offset (which, again, we've claimed to be 65536 above).
You now have a file that has no data in byte offsets 0 through 65535 inclusive, followed by some data in byte offsets 65536 through 73727 (assuming the write writes 8192 bytes). That "missing" data is the "hole" in the file. When some other program goes to read the file, the OS pretends there is data there: all-zero-byte data.
If the program doing the write operations does not do them on block boundaries, the OS will in fact allocate some extra data (to fit the write into whole blocks) and zero it out. Those zero bytes are not part of the "hole" (they're real zero bytes in the file) but to ordinary programs that do not peek behind the curtain at the Wizard of Oz, the "hole" zero-bytes and the "non-hole" zero bytes are indistinguishable.
What you need to do is to modify the program to use O_APPEND, or to use library routines like syslog that know how to cooperate with log-rotation operations, or perhaps both.
[Edit to add: not sure why this suddenly showed up on the front page and I answered a question from 2011...]
Another way is the following:
cp /dev/null the_file
The advantage of this technique is that it is a single command, so in case it needs sudo access only one sudo call is required.
Why not just :>filename?
(: is a bash builtin having the same effect as /bin/true, and both commands don't echo anything)
Proof that it works:
fg#erwin ~ $ du t.txt
4 t.txt
fg#erwin ~ $ :>t.txt
fg#erwin ~ $ du t.txt
0 t.txt
If it's a log file then the proper way to do this is to use logrotate. As you mentioned doing it manually does not work.
I have not a linux shell here to try ir, but have you try this?
echo "" > file

What's the fastest / most efficient find/replace app on *nix

I've got a large SQL dump 250MB+ and I need to replace www.mysite with dev.mysite. I have tried nano and vi for doing find/replace but both choke. Nano can't even open it, and vi has been doing the find/replace now for an hour.
Anyone know of a tool on *nix or windows systems that does fast Find/Replace on large files?
sed -i 's/www\.mysite/dev.mysite/g' dump.sql
(requires temporary storage space equal to the size of the input)
Search/replace on a SQL dump is not a good idea
They aren't text files
SQL syntax errors are easily introduced
They contain very long lines sometimes.
What you should do is load it into a non-production database server, run the appropriate UPDATE statements then dump it again. You can use the REPLACE function in MySQL for this.
you need sed
example
sed -e "s/www.mysite/dev.mysite/g" your_large_sql
alternatively, import the sql into database, then use replace to replace for matched strings

Resources