Fastest way to replace string in first row of huge file in linux command line? - linux

I have a huge plain text file (~500Gb) on linux machine. I want the replace some string in header line (the first row of the file), but all the method I known seems to be slow and low efficiency.
example file:
foo apple cat
1 2 2
2 3 4
3 4 6
...
expected file output:
bar apple cat
1 2 2
2 3 4
3 4 6
...
sed:
sed -i '1s/foo/bar/g' file
-i can change the file in place, but this command generate a tmp file on disk and use the tmp file to replace the original one. The io waste time.
vim:
ex -c '1s/foo/bar/g' -c 'wq' file
vim doesn't generate a tmp file, but this tool load the whole file in to memory, which waste a lot of time either.
Is there a better solution that only read the first row in to memory and write it back to the original file? I known that linux head command can extract the first column very fast.

Could you please try following awk command and let me know if this helps you, I couldn't test it as I don't have a huge size file like 500 GB. For sure it shouldn't create any temp file in backend as it is not using inplace substitution on Input_file.
awk 'FNR==1{$1="bar";print;next} 1' Input_file > temp_file && mv temp_file Input_file

Related

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.
Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.
When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

Unix Command Operations [duplicate]

This question already has answers here:
Filtering Rows Based On Number of Columns with AWK
(3 answers)
Closed 6 years ago.
Lets say there is a file in linux which has lines which are space separated.
e.g.
This is linux file
This is linux text
This is linux file 1
This is linux file 3
Now I want to only print those rows which has 5th column present in the lines of file. In this example my output should be 3rd and 4th line ( with 1 and 3 as 5th column )
What is the best way to do it?
This can be done with awk and its NF (number of fields) variable, as per the following transcript:
pax$ cat inputFile
This is linux file
This is linux text
This is linux file 1
This is linux file 3
pax$ awk 'NF >= 5 {print}' inputFile
This is linux file 1
This is linux file 3
This works because the basic form of an awk command is pattern { action }.
The pattern selects those lines (and sometimes things that aren't lines, such as with BEGIN and END patterns) which meet certain criteria and the action dictate what to do.
In this case, it selects lines that have five or more fields and simply prints them.
In addition to awk, you can also do it very simply in bash (or any of the shells) by reading each line into at least five fields, and then checking to insure the fifth field is populated. Something like this will work (it will read from the filename given as the first argument (or stdin if no name is given))
#!/bin/bash
fn="${1:-/dev/stdin}"
while read -r f1 f2 f3 f4 f5; do
[ -n "$f5" ] && printf "%s %s %s %s %s\n" "$f1" "$f2" "$f3" "$f4" "$f5"
done <"$fn"
Example
Using your data, the snippet above produces the following output:
$ bash prn5flds.sh dat/5fields.txt
This is linux file 1
This is linux file 3
(note: depending your your shell, read may or may not support the -r option. If it doesn't, simply omit it)

How to copy data from file to another file starting from specific line

I have two files data.txt and results.txt, assuming there are 5 lines in data.txt, I want to copy all these lines and paste them in file results.txt starting from the line number 4.
Here is a sample below:
Data.txt file:
stack
ping
dns
ip
remote
Results.txt file:
# here are some text
# please do not edit these lines
# blah blah..
this is the 4th line that data should go on.
I've tried sed with various combinations but I couldn't make it work, I'm not sure if it fit for that purpose as well.
sed -n '4p' /path/to/file/data.txt > /path/to/file/results.txt
The above code copies line 4 only. That isn't what I'm trying to achieve. As I said above, I need to copy all lines from data.txt and paste them in results.txt but it has to start from line 4 without modifying or overriding the first 3 lines.
Any help is greatly appreciated.
EDIT:
I want to override the copied data starting from line number 4 in
the file results.txt. So, I want to leave the first 3 lines without
modifications and override the rest of the file with the data copied
from data.txt file.
Here's a way that works well from cron. Less chance of losing data or corrupting the file:
# preserve first lines of results
head -3 results.txt > results.TMP
# append new data
cat data.txt >> results.TMP
# rename output file atomically in case of system crash
mv results.TMP results.txt
You can use process substitution to give cat a fifo which it will be able to read from :
cat <(head -3 result.txt) data.txt > result.txt
head -n 3 /path/to/file/results.txt > /path/to/file/results.txt
cat /path/to/file/data.txt >> /path/to/file/results.txt
if you can use awk:
awk 'NR!=FNR || NR<4' Result.txt Data.txt

text file contains lines of bizarre characters - want to fix

I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks
If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2
BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file

Editing large text files on linux ( 5 - 10gb)

Basically, i need a file of specified format and large size(Around 10gb). To get this, i am copying the contents of my original file into the same file, multiple times, to increase its size. I dont care about the contents of the file as long as they have the required format.
Initially, i tried to do this using gedit, which failed miserably after few 100mbs. I'm looking for an editor which will help me do this. Or, may be a suggestion on alternate ways
You could make 2 files and repeatedly append them to each other:
cp file1 file2
for x in `seq 1 200`; do
cat file1 >> file2
cat file2 >> file1
done;
In Windows, from the command line:
copy file1.txt+file2.txt file3.txt
concats 1 and 2, places in 3 - repeat or add +args until you get the size you need.
For Unix,
cat file1.txt file2.txt >> file3.txt
concats 1 and 2, places in 3 - repeat or add more input files until you get the size you need.
There are probably many other ways to do this in Unix.

Resources