Searching a list of strings

Searching a list of strings - linux

I have values in file A like
301310
304790
500011
600462
607348
614269
I want to search all these values in a text file B which has lines like this
1.35|10|5|11|1p36.31|GPR153|P|G protein-coupled receptor 153||614269|REc|||| | |4(Gpr153)|
1.36|3|24|06|1p36.31|HES2|P|Hairy/enhancer of split, Drosophila, homolog of, 2||609970|REc|||| | ||
1.37|3|24|06|1p36.31|HES3|P|Hairy/enhancer of split, Drosophila, homolog of, 3||609971|REc|||| | ||
1.38|3|24|06|1p36.33|HES4|P|Hairy/enhancer of split, Drosophila, homolog of, 4||608060|REc|||| | ||
1.39|3|24|06|1p36.32|HES5|P|Hairy/enhancer of split, Drosophila, homolog of, 5||607348|REc|||| | ||
I want to print all lines which contain any of the term from file A and print them in an output file.
I tried following command but it is not working
grep -w -F -f fileA fileB >file C

When you posted your question, you had your fileA like this:
301310
304790
500011
I edited the post for formatting, and removed the empty lines figuring it was just a editing typo on your part. If your input file actually does have empty lines in it, you'll want to remove them, as grep will interpret them as "match an empty pattern", which matches every line. It would equate to:
grep "" fileB
So, just wipe the empty lines from fileA and your posted grep command should work just fine.

If you're grep isn't working then maybe a perlish solution?
Although I will point out - given your data examples, no matches are found, which could be part of the problem.
#!/usr/bin/perl
use strict;
use warnings;
open ( my $file1, '<', "FileA" ) or die $!;
my #search_for = <$file1>;
close ( $file1 );
my $search_regex = join ( "|", map {quotemeta} #search_for );
$search_regex = qr/$search_regex/;
open ( my $file2, '<', "fileB" ) or die $!;
while ( <$file2> ) {
print if m/$search_regex/;
}
close ( $file2 );
If you want to ensure you don't have blank lines:
my $search_regex = join ( "|", grep { m/^\w+$/ } map {quotemeta} #search_for );
$search_regex = qr/$search_regex/;

The following command worked
grep -w -F -f fileA fileB > fileC
Initially it did not work because there was a blank line in fileA, that's why it was printing everything from fileB. Now I removed blank line from fileA and got my required lines.

Related

Sed/awk: Aligning words in a file

I have a file with the following structure:
# #################################################################
# TEXT: MORE TEXT
# TEXT: MORE TEXT
# #################################################################
___________________________________________________________________
ITEM 1
___________________________________________________________________
PROPERTY1: VALUE1_1
PROPERTY222: VALUE2_1
PROPERTY33: VALUE3_1
PROPERTY4444: VALUE4_1
PROPERTY55: VALUE5_1
Description1: Some text goes here
Description2: Some text goes here
___________________________________________________________________
ITEM 2
___________________________________________________________________
PROPERTY1: VALUE1_2
PROPERTY222: VALUE2_2
PROPERTY33: VALUE3_2
PROPERTY4444: VALUE4_2
PROPERTY55: VALUE5_2
Description1: Some text goes here
Description2: Some text goes here
I want to add another item to the file, using sed or awk:
sed -i -r "\$a$PROPERTY1: VALUE1_3" file.txt
sed -i -r "\$a$PROPERTY2222: VALUE2_3" file.txt
etc. So my next item looks like this:
___________________________________________________________________
ITEM 3
___________________________________________________________________
PROPERTY1: VALUE1_3
PROPERTY222: VALUE2_3
PROPERTY33: VALUE3_3
PROPERTY4444: VALUE4_3
PROPERTY55: VALUE5_3
Description1: Some text goes here
Description2: Some text goes here
The column values is jagged. How do I align my values to the left like for previous items? I can see 2 solutions here:
To align the values while inserting them into the file.
To insert the values into the file the way I did it and align them next.
The command
sed -i -r "s|.*:.*|&|g" file.txt
catches the properties and values I want to align, but I haven't been able to align them properly, i.e.
awk '/^.*:.*$/{ printf "%-40s %-70s\n", $1, $2 }' file.txt
It prints out the file, but it includes the description values and tags, cuts the values if they include spaces or dashes. It just a big mess.
I've tried more commands based on what I've found on Stack Overflow and some blogs, but nothing does what I need.
Note: Values of the description tags are not jagged- this is because I write them to the file in a separate way.
What is wrong with my commands? How do I achieve what I need?

When your file is without tabs, try this:
sed -r 's/: +/:\t/' file.txt | expand -20
When this works, redirect the output to a tmpfile and move the tmpfile to file.txt.

You can use gensub and thoughtful field seperators to take care of this:
for i in {1..5}; do
echo $(( 10 ** i )): $i;
done | awk -F ':::' '/^[^:]+:.+/{
$0 = gensub(/: +/, ":::", $0 );
key=( $1 ":" );
printf "%-40s %s\n", key, $2;
}'
The relevant part being where we swap out ": +" for just ":::" and then do a printf to bring it back together.

You could use \t to insert tabs (rather than spaces which is why you get 'jagged' values)
instead of
sed -i -r "\$a$PROPERTY1: VALUE1_3" file.txt
use
sed -i -r "\$a$PROPERTY1:\t\tVALUE1_3" file.txt

All you need to do is remember the existing indentation when inserting the new line, e.g.:
echo 'PROPERTY732: VALUE9_8_7' |
awk -v prop="PROPERTY1" -v val="VALUE1_3" '
match($0,/^PROPERTY[^[:space:]]+[[:space:]]+/) { wid=RLENGTH }
{ print }
END { printf "%-*s%s\n", wid, prop":", val }
'
PROPERTY732: VALUE9_8_7
PROPERTY1: VALUE1_3
but it's not clear that adding 1 line at a time makes sense or where all of the other text you're adding is coming from.
The above will work with any awk on any UNIX system.
If your "properties" don't actually start with the word PROPERTY then you just need to edit your question to show more realistic sample input/output and tell/show us how to distinguish a PROPERTY line from a Description line and, again, the solution will be trivial with awk.

How to pad CSV file missing columns

I have a problem with some CSV files comming from a soft and that I want to use to make PostgreSQL import (function COPY FROM CSV). The problem is that some last columns are missing like this (letter for headers, number for values, _ for the TAB delimiter):
a_b_c_d
1_2_3_4
5_6_7 <- last column missing
8_9_0_1
2_6_7 <- last column missing
COPY in_my_table FROM file.csv result is :
ERROR: missing data for column "d"
Sample of a correct file for import :
a_b_c_d
1_2_3_4
5_6_7_ <- null column but not missing
8_9_0_1
2_6_7_ <- null column but not missing
My question : is there some commands in bash / linux shell to add the TAB delimiter to make a correct / comlete / padded csv file with all columns.
Thanks for help.

Ok, so in fact I found this:
awk -F'\t' -v OFS='\t' 'NF=50' input.csv > output.csv
where 50 is the number of TAB + 1.

Don't knew much about linux but this could be easily done in postgresql via simple command like
copy tableName from '/filepath/name.csv' delimiter '_' csv WITH NULL AS 'null';

You can use a combination of sed and regular expressions:
sed -r 's/^[0-9](_[0-9]){2}$/\0_/g' file.csv
You only need to replace _ by your delimiter (\t).

Awk is good for this.
awk -F"\t" '{ # Tell awk we are working with tabs
if ($4 =="") # If the last field is empty
print $0"\t" # print the whole line with a tab
else
print $0 # Otherwise just print the line
}' your.csv > your.fixed.csv

Perl has a CSV module, which might be handy to fix even more complicated CSV errors. On my Ubuntu test system it is part of the package libtext-csv-perl.
This fixes your problem:
#! /usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, eol => $/, sep_char => '_' });
open my $broken, '<', 'broken.csv';
open my $fixed, '>', 'fixed.csv';
while (my $row = $csv->getline ($broken)) {
$#{$row} = 3;
$csv->print ($fixed, $row);
}
Change sep_char to "\t", if you have a tabulator delimited file and keep in mind that Perl treats "\t" and '\t' differently.

remove lines from text file that contain specific text

I'm trying to remove lines that contain 0/0 or ./. in column 71 "FORMAT.1.GT" from a tab delimited text file.
I've tried the following code but it doesn't work. What is the correct way of accomplishing this? Thank you
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";

You can either call a one-liner as borodin and zdim said. Which one is right for you is still not clear because you don't tell whether 71st column means the 71st tab-separated field of a line or the 71st character of that line. Consider
12345\t6789
Now what is the 2nd column? Is it the character 2 or the field 6789? Borodin's answer assumes it's 6789 while zdim assumes it's 2. Both showed a solution for either case but these solutions are stand-alone solutions. Programs of its own to be run from the commandline.
If you want to integrate that into your Perl script you could do it like this:
Replace this line:
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";
with this snippet:
open( my $fh_in, '<', $Variantlinestsvfile ) or die "cannot open $Variantlinestsvfile: $!\n";
open( my $fh_out, '>', $MDLtsvfile ) or die "cannot open $MDLtsvfile: $!\n";
while( my $line = <$fh_in> ) {
# character-based:
print $fh_out $line unless (substr($line, 70, 3) =~ m{(?:0/0|\./\.)});
# tab/field-based:
my #fields = split(/\s+/, $line);
print $fh_out $line unless ($fields[70] =~ m|([0.])/\1|);
}
close($fh_in);
close($fh_out);
Use either the character-based line or the tab/field-based lines. Not both!
Borodin and zdim condensed this snippet to a one-liner, but you must not call that from a Perl script.

Since you need the exact position and know string lenghts substr can find it
perl -ne 'print if not substr($_, 70, 3) =~ m{(?:0/0|\./\.)}' filename
This prints lines only when a three-character long string starting at 71st column does not match either of 0/0 and ./.
The {} delimiters around the regex allow us to use / and | inside without escaping. The ?: is there so that the () are used only for grouping, and not capturing. It will work fine also without ?: which is there only for efficiency's sake.

perl -ane 'print unless $F[70] =~ m|([0.])/\1|' myfile > newfile

The problem with your command is that you are attempting to capture the output of a command which produces no output - all the matches are redirected to a file, so that's where all the output is going.
Anyway, calling grep from Perl is just wacky. Reading the file in Perl itself is the way to go.
If you do want a single shell command,
grep -Ev $'^([^\t]*\t){70}(\./\.|0/0)\t' file
would do what you are asking more precisely and elegantly. But you can use that regex straight off in your Perl program just as well.

Try it!
awk '{ if ($71 != "./." && $71 != ".0.") print ; }' old_file.txt > new_file.txt

Efficient way to replace strings in one file with strings from another file

Searched for similar problems and could not find anything that suits my needs exactly:
I have a very large HTML file scraped from multiple websites and I would like to replace all
class="key->from 2nd file"
with
style="xxxx"
At the moment I use sed - it works well but only with small files
while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g"
file_to_process; done < keys
When I'm trying to process something larger it takes ages
Example:
keys - Count: 1233 lines
file_to_ process - Count: 1946 lines
It takes about 40 s to complete only 1/10 of processing I need
real 0m40.901s
user 0m8.181s
sys 0m15.253s

Untested since you didn't provide any sample input and expected output:
awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process

I think it's time to Perl (untested):
my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;
my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for my $line (<$htmlfile>) {
my $newline = $line;
while($line =~ m/(class="([^"]+)")/) {
if(defined($keys{$2}) {
$newline =~ s/$1/$newchunk/g;
}
}
print $newline;
}
This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.

Try to generate a very long sed script with all sub commands from the keys file, something like:
s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...
and use this file.
This way you will read the input file only once.

Here's one way using GNU awk:
awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt
Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.

First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.
Try the following:
sed -i -r "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111

Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.

You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.

grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.

use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Searching a list of strings - linux

The following command worked grep -w -F -f fileA fileB > fileC Initially it did not work because there was a blank line in fileA, that's why it was printing everything from fileB. Now I removed blank line from fileA and got my required lines.

Related

Sed/awk: Aligning words in a file

How to pad CSV file missing columns

remove lines from text file that contain specific text

Efficient way to replace strings in one file with strings from another file

Bash: How to keep lines in a file that have fields that match lines in another file?

Categories

Resources