remove lines from text file that contain specific text - linux

I'm trying to remove lines that contain 0/0 or ./. in column 71 "FORMAT.1.GT" from a tab delimited text file.
I've tried the following code but it doesn't work. What is the correct way of accomplishing this? Thank you
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";

You can either call a one-liner as borodin and zdim said. Which one is right for you is still not clear because you don't tell whether 71st column means the 71st tab-separated field of a line or the 71st character of that line. Consider
12345\t6789
Now what is the 2nd column? Is it the character 2 or the field 6789? Borodin's answer assumes it's 6789 while zdim assumes it's 2. Both showed a solution for either case but these solutions are stand-alone solutions. Programs of its own to be run from the commandline.
If you want to integrate that into your Perl script you could do it like this:
Replace this line:
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";
with this snippet:
open( my $fh_in, '<', $Variantlinestsvfile ) or die "cannot open $Variantlinestsvfile: $!\n";
open( my $fh_out, '>', $MDLtsvfile ) or die "cannot open $MDLtsvfile: $!\n";
while( my $line = <$fh_in> ) {
# character-based:
print $fh_out $line unless (substr($line, 70, 3) =~ m{(?:0/0|\./\.)});
# tab/field-based:
my #fields = split(/\s+/, $line);
print $fh_out $line unless ($fields[70] =~ m|([0.])/\1|);
}
close($fh_in);
close($fh_out);
Use either the character-based line or the tab/field-based lines. Not both!
Borodin and zdim condensed this snippet to a one-liner, but you must not call that from a Perl script.

Since you need the exact position and know string lenghts substr can find it
perl -ne 'print if not substr($_, 70, 3) =~ m{(?:0/0|\./\.)}' filename
This prints lines only when a three-character long string starting at 71st column does not match either of 0/0 and ./.
The {} delimiters around the regex allow us to use / and | inside without escaping. The ?: is there so that the () are used only for grouping, and not capturing. It will work fine also without ?: which is there only for efficiency's sake.

perl -ane 'print unless $F[70] =~ m|([0.])/\1|' myfile > newfile

The problem with your command is that you are attempting to capture the output of a command which produces no output - all the matches are redirected to a file, so that's where all the output is going.
Anyway, calling grep from Perl is just wacky. Reading the file in Perl itself is the way to go.
If you do want a single shell command,
grep -Ev $'^([^\t]*\t){70}(\./\.|0/0)\t' file
would do what you are asking more precisely and elegantly. But you can use that regex straight off in your Perl program just as well.

Try it!
awk '{ if ($71 != "./." && $71 != ".0.") print ; }' old_file.txt > new_file.txt

Related

Append output of a command to each line in large file

I need to add a random guid to each line in a large text file. I need that guid to be different for each line.
This works except that the guid is the same for every line:
sed -e "s/$/$(uuidgen -r)/" text1.log > text2.log
Here is a way to do it using awk:
awk -v cmd='uuidgen' 'NF{cmd | getline u; print $0, u > "test2.log"; close(cmd)}' test1.log
Condition NF (or NF > 0) ensures we do it only for non-empty lines.
Since we are calling close(cmd) each time there will be a new call to uuidgen for every record.
However since uuidgen is called for every non-empty line, it might be slow for huge files.
That's because the command substitution will get evaluated before the commands gets started.
The shell will first execute uuidgen -r, and replace the command substitution be it's result, let's say 0e4e5a48-82d1-43ea-94b6-c5de7573bdf8. The shell will then execute sed like this:
sed -e "s/$/0e4e5a48-82d1-43ea-94b6-c5de7573bdf8/" text1.log > text2.log
You can use a while loop in the shell to achieve your goal:
while read -r line ; do
echo "$line $(uuidgen -r)"
done < file > file_out
Rather than run a whole new uuidgen process for each and every line, I generated a new UUID for each line in Perl which is just a function call:
#!/usr/bin/perl
use strict;
use warnings;
use UUID::Tiny ':std';
my $filename = 'data.txt';
open(my $fh,'<',$filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
my $uuid = create_uuid(UUID_V4);
my $str = uuid_to_string($uuid);
print "$row $str\n";
}
To test, I generated a 1,000,000 line CSV as shown here.
It takes 10 seconds to add the UUID to the end of each line of the 1,000,000 record file on my iMac.

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?
Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"
awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.
Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"
You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.
The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

Use sed or awk to replace line after match

I'm trying to create a little script that basically uses dig +short to find the IP of a website, and then pipe that to sed/awk/grep to replace a line. This is what the current file looks like:
#Server
123.455.1.456
246.523.56.235
So, basically, I want to search for the '#Server' line in a text file, and then replace the two lines underneath it with an IP address acquired from dig.
I understand some of the syntax of sed, but I'm really having trouble figuring out how to replace two lines underneath a match. Any help is much appreciated.
Based on the OP, it's not 100% clear exactly what needs to replaced where, but here's a a one-liner for the general case, using GNU sed and bash. Replace the two lines after "3" with standard input:
echo Hoot Gibson | sed -e '/3/{r /dev/stdin' -e ';p;N;N;d;}' <(seq 7)
Outputs:
1
2
3
Hoot Gibson
6
7
Note: sed's r command is opaquely documented (in Linux anyway). For more about r, see:
"5.9. The 'r' command isn't inserting the file into the text" in this sed FAQ.
here's how in awk:
newip=12.34.56.78
awk -v newip=$newip '{
if($1 == "#Server"){
l = NR;
print $0
}
else if(l>0 && NR == l+1){
print newip
}
else if(l==0 || NR != l+2){
print $0
}
}' file > file.tmp
mv -f file.tmp file
explanation:
pass $newip to awk
if the first field of the current line is #Server, let l = current line number.
else if the current line is one past #Server, print the new ip.
else if the current row is not two past #Server, print the line.
overwrite original file with modified version.

How to remove all lines from a text file starting at first empty line?

What is the best way to remove all lines from a text file starting at first empty line in Bash? External tools (awk, sed...) can be used!
Example
1: ABC
2: DEF
3:
4: GHI
Line 3 and 4 should be removed and the remaining content should be saved in a new file.
With GNU sed:
sed '/^$/Q' "input_file.txt" > "output_file.txt"
With AWK:
$ awk '/^$/{exit} 1' test.txt > output.txt
Contents of output.txt
$ cat output.txt
ABC
DEF
Walkthrough: For lines that matches ^$ (start-of-line, end-of-line), exit (the whole script). For all lines, print the whole line -- of course, we won't get to this part after a line has made us exit.
Bet there are some more clever ways to do this, but here's one using bash's 'read' builtin. The question asks us to keep lines before the blank in one file and send lines after the blank to another file. You could send some of standard out one place and some another if you are willing to use 'exec' and reroute stdout mid-script, but I'm going to take a simpler approach and use a command line argument to let me know where the post-blank data should go:
#!/bin/bash
# script takes as argument the name of the file to send data once a blank line
# found
found_blank=0
while read stuff; do
if [ -z $stuff ] ; then
found_blank=1
fi
if [ $found_blank ] ; then
echo $stuff > $1
else
echo $stuff
fi
done
run it like this:
$ ./delete_from_empty.sh rest_of_stuff < demo
output is:
ABC
DEF
and 'rest_of_stuff' has
GHI
if you want the before-blank lines to go somewhere else besides stdout, simply redirect:
$ ./delete_from_empty.sh after_blank < input_file > before_blank
and you'll end up with two new files: after_blank and before_blank.
Perl version
perl -e '
open $fh, ">","stuff";
open $efh, ">", "rest_of_stuff";
while(<>){
if ($_ !~ /\w+/){
$fh=$efh;
}
print $fh $_;
}
' demo
This creates two output files and iterates over the demo data. When it hits a blank line, it flips the output from one file to the other.
Creates
stuff:
ABC
DEF
rest_of_stuff:
<blank line>
GHI
Another awk would be:
awk -vRS= '1;{exit}' file
By setting the record separator RS to be an empty string, we define the records as paragraphs separated by a sequence of empty lines. It is now easily to adapt this to select the nth block as:
awk -vRS= '(FNR==n){print;exit}' file
There is a problem with this method when processing files with a DOS line-ending (CRLF). There will be no empty lines as there will always be a CR in the line. But this problem applies to all presented methods.

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Resources