replacing fixed amount of text in a large file - linux

I'm trying to replace a small amount of text on a specific line of a large log file (totaling ~40 mil lines):
sed -i '20000000s/.\{5\}$/zzzzz/' log_file
The purpose of this is to "mark" a line with an expected unique string, for later testing.
The above command works fine, but in-place editing of sed (and perl) creates a temp file, which is costly.
Is there a way to replace a fixed number of characters (i.e. 5 chars with 5 other chars) in a file without having to create a temp file, or a very large buffer, which would wind up becoming a temp file itself.

You could use dd to replace some bytes in place:
dd if=/dev/zero of=path/to/file bs=1 count=10 conv=notrunc skip=1000
would write 10 zeros (0x00) after the 1000s byte. You can put whatever you want to replace inside a file and write the path to it in the if parameter. Then you had to insert the size of the replacement file into the count parameter, so the whole file gets read.
the conv=notrunc parameter tells dd to leave the end of the file untruncated.
This should work well for any 1-byte file encoding.

ex is a scriptable file editor, so it will work in-place:
ex log_file << 'END_OF_COMMANDS'
20000000s/.\{5\}$/zzzzz/
w
q
END_OF_COMMANDS

Related

Linux Sed won't work on large files (3GB)

I have a large file (>3GB) which has a single line containing content similar to below
{"id","1","name":"one"},{"id","2","name":"two"},{"id","3","name":"three"},
I am using sed to search and replace }, with },\n so that each dictionary is on its own line.
sed 's/},/},\\n/g' FILE >> NEW_FILE
This doesn't work all the time and I had to split the file into smaller 1GB chunks, replace the text on each smaller file then combine it.
Is there any other way to do the search and replace on this large file.

Ignore spaces, tabs and new line in SED

I tried to replace a string in a file that contains tabs and line breaks.
the command in the shell file looked something like this:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -i 's/'"$STRING_OLD"'/'"$STRING_NEW"'/' $FILE
if I manually remove the line breaks and the tabs and leave only the spaces then I can replace successfully the file. but if I leave the line breaks then SED is unable to locate the $STRING_OLD and unable to replace to the new string
thanks in advance
Kobi
sed reads lines one at a time, and usually lines are also processed one at a time, as they are read. However, sed does have facilities for reading additional lines and operating on the combined result. There are several ways that could be applied to your problem, such as:
FILE="/Somewhere"
STRING_OLD="line 1[ \t\r\n]*line 2"
sed -n "1h;2,\$H;\${g;s/$STRING_OLD/$STRING_NEW/g;p}"
That that does more or less what you describe doing manually: it concatenates all the lines of the file (but keeps newlines), and then performs the substitution on the overall buffer, all at once. That does assume, however, either that the file is short (POSIX does not require it to work if the overall file length exceeds 8192 bytes) or that you are using a sed that does not have buffer-size limitations, such as GNU sed. Since you tagged Linux, I'm supposing that GNU sed can be assumed.
In detail:
the -n option turns off line echoing, because we save everything up and print the modified text in one chunk at the end.
there are multiple sed commands, separated by semicolons, and with literal $ characters escaped (for the shell):
1h: when processing the first line of input, replace the "hold space" with the contents of the pattern space (i.e. the first line, excluding newline)
2,\$H: when processing any line from the second through the last, append a newline to the hold space, then the contents of the pattern space
\${g;s/$STRING_OLD/$STRING_NEW/g;p}: when processing the last line, perform this group of commands: copy the hold space into the pattern space; perform the substitution, globally; print the resulting contents of the pattern space.
That's one of the simpler approaches, but if you need to accommodate seds that are not as capable as GNU's with regard to buffer capacity then there are other ways to go about it. Those start to get ugly, though.

How to edit 300 GB text file (genomics data)?

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.
UPDATE: More information
The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...
One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.
Based on your update:
One more thought... Is there an approach that would allow us to add
the asterisk to the line without opening the entire text file at once.
This could be very useful given that we will have to repeat the
process an unknown number of times.
Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:
sed 'LINE_NUMBER s/^/*/' file
See an example:
$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee
If you add -i, the file will be updated:
$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee
Even though I always think it's better to do a redirection to another file
sed '3 s/^/*/' file > new_file
so that you keep intact your original file and save the updated one in new_file.
If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.
split -a4 -d -l100000 hugefile.txt part.
This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:
cat part.* > new_hugefile.txt
The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.
A basic pattern in R is to read the data in chunks, edit, and write out
fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
## txt is now 1000000 lines, add an asterix to problem lines
## bad = <create logical vector indicating bad lines here>
## txt[bad] = paste0("*", txt[bad])
writeLines(txt, fout)
}
close(fin); close(fout)
While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Empty a file while in use in linux

I'm trying to empty a file in linux while in use, it's a log file so it is continuosly written.
Right now I've used:
echo -n > filename
or
cat /dev/null > filename
but all of this produce an empty file with a newline character (or strange character that I can see as ^#^#^#^#^#^#^#^#^#^#^#^.. on vi) and I have to remove manually with vi and dd the first line and then save.
If I don't use vi adn dd I'm not able to manipulate file with grep but I need an automatic procedure that i can write in a shell script.
Ideas?
This should be enough to empty a file:
> file
However, the other methods you said you tried should also work. If you're seeing weird characters, then they are being written to the file by something else - most probably whatever process is logging there.
What's going on is fairly simple: you are emptying out the file.
Why is it full of ^#s, then, you ask? Well, in a very real sense, it is not. It does not contain those weird characters. It has a "hole".
The program that is writing to the file is writing a file that was opened with O_WRONLY (or perhaps O_RDWR) but not O_APPEND. This program has written, say, 65536 bytes into the file at the point when you empty out the file with cp /dev/null filename or : > filename or some similar command.
Now the program goes to write another chunk of data (say, 4096 or 8192 bytes). Where will that data be written? The answer is: "at the current seek offset on the underlying file descriptor". If the program used O_APPEND the write would be, in effect, preceded by an lseek call that did a "seek to current end-of-file, i.e., current length of file". When you truncate the file that "current end of file" would become zero (the file becoming empty) so the seek would move the write offset to position 0 and the write would go there. But the program did not use O_APPEND, so there is no pre-write "reposition" operation, and the data bytes are written at the current offset (which, again, we've claimed to be 65536 above).
You now have a file that has no data in byte offsets 0 through 65535 inclusive, followed by some data in byte offsets 65536 through 73727 (assuming the write writes 8192 bytes). That "missing" data is the "hole" in the file. When some other program goes to read the file, the OS pretends there is data there: all-zero-byte data.
If the program doing the write operations does not do them on block boundaries, the OS will in fact allocate some extra data (to fit the write into whole blocks) and zero it out. Those zero bytes are not part of the "hole" (they're real zero bytes in the file) but to ordinary programs that do not peek behind the curtain at the Wizard of Oz, the "hole" zero-bytes and the "non-hole" zero bytes are indistinguishable.
What you need to do is to modify the program to use O_APPEND, or to use library routines like syslog that know how to cooperate with log-rotation operations, or perhaps both.
[Edit to add: not sure why this suddenly showed up on the front page and I answered a question from 2011...]
Another way is the following:
cp /dev/null the_file
The advantage of this technique is that it is a single command, so in case it needs sudo access only one sudo call is required.
Why not just :>filename?
(: is a bash builtin having the same effect as /bin/true, and both commands don't echo anything)
Proof that it works:
fg#erwin ~ $ du t.txt
4 t.txt
fg#erwin ~ $ :>t.txt
fg#erwin ~ $ du t.txt
0 t.txt
If it's a log file then the proper way to do this is to use logrotate. As you mentioned doing it manually does not work.
I have not a linux shell here to try ir, but have you try this?
echo "" > file

vim -- How to read range of lines from a file into current buffer

I want to read line n1->n2 from file foo.c into the current buffer.
I tried: 147,227r /path/to/foo/foo.c
But I get: "E16: Invalid range", though I am certain that foo.c contains more than 1000 lines.
:r! sed -n 147,227p /path/to/foo/foo.c
You can do it in pure Vimscript, without having to use an external tool like sed:
:put =readfile('/path/to/foo/foo.c')[146:226]
Note that we must decrement one from the line numbers because arrays start from 0 while line numbers start from 1.
Disadvantages: This solution is 7 characters longer than the accepted answer. It will temporarily read the entire file into memory, which could be a concern if that file is huge.
The {range} refers to the destination in the current file, not the range of lines in the source file.
After some experimentation, it seems
:147,227r /path/to/foo/foo.c
means insert the contents of /path/to/foo/foo.c after line 227 in this file. i.e.: it ignores the 147.
Other solutions posted are great for specific line numbers. It's often the case that you want to read from top or bottom of another file. In that case, reading the output of head or tail is very fast. For example -
:r !head -20 xyz.xml
Will read first 20 lines from xyz.xml into current buffer where the cursor is
:r !tail -10 xyz.xml
Will read last 10 lines from xyz.xml into current buffer where the cursor is
The head and tail commands are extremely fast, therefore even combining them can be much faster than other approaches for very large files.
:r !head -700030 xyz.xml| tail -30
Will read line numbers from 700000 to 700030 from file xyz.xml into current buffer
This operation should complete instantly even for fairly large files.
I just had to do this in a code project of mine and did it this way:
In buffer with /path/to/foo/foo.c open:
:147,227w export.txt
In buffer I'm working with:
:r export.txt
Much easier in my book... It requires having both files open, but if I'm importing a set of lines, I usually have them both open anyway. This method is more general and easier to remember for me, especially if I'm trying to export/import a trickier set of lines using g/<search_criteria/:.w >> export.txt or some other more complicated way of selecting lines.
You will need to:
:r /path/to/foo/foo.c
:d 228,$
:d 1,146
Three steps, but it will get it done...
A range permits a command to be applied to a group of lines in the current buffer.
So, the range of read instruction means where to insert the content in the current file, but not the range of file that you want to read.

Resources