I have a file which contains some strings for example in file my_file.txt, I have strings
foo_eusa.r1
foo_chnc.r5
foo_deu.r10
.
.
.
Now I wanted to check for a substring whether it exists in the file and if it exists I wanted to modify the entire string or if it does not I will add it.
For example I have a new string foo_eusa.r4 I wanted to search for all occurences of substring foo_eusa in the file. If it exists (in the above it exists foo_eusa.r1) then replace r1 with r4 so the string becomes foo_eusa.r4 instead of foo_eusa.r1 in all occurences. If foo_eusa does not exist then the new string foo_eusa.r4 is to be added
I treid checking using grep -q but it gives only the first match and also could not find a way to replace the sub strings
Perl to the rescue:
perl -i~ -pe 's/^foo_eusa\..*/foo_eusa.r4/ and $changed = 1;
END { print "foo_eusa.r4\n" unless $changed }' -- file
If you need to process more than one file, it's a bit more complex:
perl -i~ -ne '
s/^foo_eusa\..*/foo_eusa.r4/ and $changed = 1;
print;
if (eof) {
print "foo_eusa.r4\n" unless $changed;
$changed = 0;
}' -- file*
-i~ modifies the file "in place", leaving a backup with the ~ extension
-p reads the input line by line and outputs each line after processing
-n is like -p, but it doesn't output the line unless told to print
$changed is used as a flag. When the substitution is triggered, it's set to 1. If it's not 1 at the END (i.e. when the processing is finished), the string is added to the output. In case of several files, the flag must be handled for each file, so eof is used to indicate the end of file.
Related
I have a directory that includes a lot of java files, and in each file I have a class variable:
String system = "x";
I want to be able to create a bash script which I execute in the same directory, which will go to only the java files in the directory, and replace this instance of x, with y. Here x and y are a word. Now this may not be the only instance of the word x in the java script, however it will definitely be the first.
I want to be able to execute my script in the command line similar to:
changesystem.sh -x -y
This way I can specify what the x should be, and the y I wish to replace it with. I found a way to find and print the line number at which the first instance of a pattern is found:
awk '$0 ~ /String system/ {print NR}' file
I then found how to replace a substring on a given line using:
awk 'NR==line_number { sub("x", "y") }'
However, I have not found a way to combine them. Maybe there is also an easier way? Or even, a better and more efficient way?
Any help/advice will be greatly appreciated
You may create a changesystem.sh file with the following GNU awk script:
#!/bin/bash
for f in *.java; do
awk -i inplace -v repl="$1" '
!x && /^\s*String\s+system\s*=\s*".*";\s*$/{
lwsp=gensub(/\S.*/, "", 1);
print lwsp"String system = \""repl"\";";
x=1;next;
}1' "$f";
done;
Or, with any awk:
#!/bin/bash
for f in *.java; do
awk -v repl="$1" '
!x && /^[[:space:]]*String[[:space:]]+system[[:space:]]*=[[:space:]]*".*";[[:space:]]*$/{
lwsp=$0; sub(/[^[:space:]].*/, "", lwsp);
print lwsp"String system = \""repl"\";";
x=1;next
}1' "$f" > tmp && mv tmp "$f";
done;
Then, make the file executable:
chmod +x changesystem.sh
Then, run it like
./changesystem.sh 'new_value'
Notes:
for f in *.java; do ... done iterates over all *.java files in the current directory
-i inplace - GNU awk feature to perform replacement inline (not available in a non-GNU awk)
-v repl="$1" passes the first argument of the script to the awk command
!x && /^\s*String\s+system\s*=\s*".*";\s*$/ - if x is false and the record starts with any amount of whitespace (\s* or [[:space:]]*), then String, any 1+ whitespaces, system, = enclosed with any zero or more whitesapces, and then a " char, then has any text and ends with "; and any zero or more whitespaces, then
lwsp=gensub(/\S.*/, "", 1); puts the leading whitespace in the lwsp variable (it removes all text starting with the first non-whitespace char from the line matched)
lwsp=$0; sub(/[^[:space:]].*/, "", lwsp); - same as above, just in a different way since gensub is not supported in non-GNU awk and sub modifies the given input string (here, lwsp)
{print "String system = \""repl"\";";x=1;next}1 - prints the String system = " + the replacement string + ";, assigns 1 to x, and moves to the next line, else, just prints the line as is.
You don't need to pre-compute the line number. The whole job can be done by one not-too-complicated sed command. You probably do want to script it, though. For example:
#!/bin/bash
[[ $# -eq 3 ]] || {
echo "usage: $0 <context regex> <target regex> <replacement text>" 1>&2
exit 1
}
sed -si -e "/$1/ { s/\\<$2\\>/$3/; t1; p; d; :1; n; b1; }" ./*.java
That assumes that the files to modify are java source files in the current working directory, and I'm sure you understand the (loose) argument check and usage message.
As for the sed command itself,
the -s option instructs sed to treat each argument as a separate stream, instead of operating as if by concatenating all the inputs into one long stream.
the -i option instructs sed to modify the designated files in-place.
the sed expression takes the default action for each line (printing it verbatim) unless the line matches the "context" pattern given by the first script argument.
for lines that do match the context pattern,
s/\\<$2\\>/$3/ - attempt to perform the wanted substitution
the \< and \> match word start and end boundaries, respectively, so that the specified pattern will not match a partial word (though it can match multiple complete words if the target pattern allows)
t1 - if a substitution was made, then branch to label 1, otherwise
p; d - print the current line and immediately start the next cycle
:1; n; b1 - label 1 (reachable only by branching): print the current line and read the next one, then loop back to label 1. This prints the remainder of the file without any more tests or substitutions.
Example usage:
/path/to/replace_first.sh 'String system' x y
It is worth noting that that does expose the user to some details of seds interpretation of regular expressions and replacement text, though that does not manifest for the example usage.
Note that that could be simplified by removing the context pattern bit if you are sure you want to modify the overall first appearance of the target in each file. You could also hard-code the context, the target pattern, and/or the replacement text. If you hard-code all three then the script would no longer need any argument handling or checking.
I have a file with one column containing 2059 ID numbers.
I want to add a second column with the word 'pop1' for all the 2059 ID numbers.
The second column will just mean that the ID number belongs to population 1.
How can I do this is linux using awk or sed?
The file currently has one column which looks like this
45958
480585
308494
I want it to look like:
45958 pop1
480585 pop1
308494 pop1
Maybe not the most elegant solution, and it doesn't use sed or awk, but I would do that:
while read -r line; do echo ""$line" pop1" >> newfile; done < test
This command will append stuff in the file 'newfile', so be sure that it's empty or it doesn't exist before executing the command.
Here is the resource I used, on reading a file line by line : https://www.cyberciti.biz/faq/unix-howto-read-line-by-line-from-file/
A Perl solution.
$ perl -lpi -e '$_ .= " pop1"' your-file-name
Command line options:
-l : remove newline from input and replace it on output
-p : put each line of input into $_ and print $_ at the end of each iteration
-i : in-place editing (overwrite the input file)
-e : run this code for each line of the input
The code ($_ .= " pop1") just appends your string to the input record.
I've one string like this:
myString='value1|value57|value31|value21'
and I've a file, called values_to_remove.txt containing a list of values, one per line, in this way
values_to_remove.txt
value1
value31
In bash, how can I remove the values contained in "values_to_remove.txt" from the string, taking into account that the values are separated by pipe and of course if I remove a value I have to removee also the preceding and the following pipe if any.
I've achieved this in python and called the python script from bash, but I need to do this directly in bash with one line command, rather than small script, otherwise I can already use my little python script.
That's the python code
myString = 'value1|value2|value3|value4'
arrString = myString.split("|")
with open("myfile.txt", encoding="utf-8") as file:
for l in file:
if l in arrString:
arrString.remove(l)
myNewString = "|".join(arrString)
Note that: the values separeted by pipe can be anything string.
Thank you
You may use this awk:
awk -v str="$myString" 'BEGIN {
n = split(str, a, /\|/)
}
{
val[$1]
}
END {
for (i=1; i<=n; i++)
if (!(a[i] in val))
s = (s == "" ? "" : s "|") a[i]
print s
}' values_to_remove.txt
value57|value21
This awk first uses a split function to split input string on |
It stores all values to be removed in another array val
In the end block it loops through split array and builds a string if value is not found in to-be-removed array.
Here is a bash solution (The if statement is a runtime optimization to skip the repacement in case of no match, thanks #Inian):
for val in value1 value31; do
if [[ "$mystring" =~ \|$val|$val\| ]]; then
mystring=${mystring/$BASH_REMATCH/}
fi
done
This looks in pure bash for the first regular expression that matches either |value or value| and removes it. Note you can match both at the same times because then you will delete too many separators. If there is a chance there are no separators you need to use ? after each pipe (maybe just the second one is enough).
You can also avoid regular expressions and just attempt to delete both a prior and a posterior pipe:
for val in value1 value31; do
mystring=${mystring/|$val/};
mystring=${mystring/$val|/};
done
All of these can be written on one line if you really need to:
for val in value1 value31; do [[ "$mystring" =~ \|$val|$val\| ]]; mystring=${mystring/$BASH_REMATCH/}; done
A pure bash solution:
#!/usr/bin/env bash
# Define the location of the values-to-be-removed file
: ${PATH_TO_FILE:=${1:-"./values_to_remove.txt"}}
# Define the string we will be working with
: ${MY_STRING:=${2:-"value1|value57|value31|value21"}}
# Process all entries in PATH_TO_FILE, one by one
while read -r substring || [[ -n "$line" ]]; do
# Remove "substring|" from the beginning of MY_STRING
MY_STRING=${MY_STRING#${substring}|}
# Remove "|substring" from the rest of MY_STRING
MY_STRING=${MY_STRING//|${substring}}
done < "${PATH_TO_FILE}"
# Return the results
echo ${MY_STRING}
Why do we...
Use ${VAR_NAME:=${1:-"DEFAULT_VALUE"}} notation - To allow the user to customise script's inputs either via environment variables or script arguments. Basically, this notation says:
If VAR_NAME environment variable exists, then use it;
If VAR_NAME doesn't exist, then set VAR_NAME to the value of the first argument to the script;
If the first argument doesn't exist either, then set VAR_NAME to the DEFAULT_VALUE.
Use read -r substring || [[ -n "$line" ]] to read the file? – read allows us to read content of ./values_to_remove.txt file, line by line. The [[ -n "$line" ]] bit is there to catch the last line in the file if it doesn't end with a newline.
References:
Assign a default value in bash
Return default value in bash
Bash substring removal
Bash search and replace
What is the best way to remove all lines from a text file starting at first empty line in Bash? External tools (awk, sed...) can be used!
Example
1: ABC
2: DEF
3:
4: GHI
Line 3 and 4 should be removed and the remaining content should be saved in a new file.
With GNU sed:
sed '/^$/Q' "input_file.txt" > "output_file.txt"
With AWK:
$ awk '/^$/{exit} 1' test.txt > output.txt
Contents of output.txt
$ cat output.txt
ABC
DEF
Walkthrough: For lines that matches ^$ (start-of-line, end-of-line), exit (the whole script). For all lines, print the whole line -- of course, we won't get to this part after a line has made us exit.
Bet there are some more clever ways to do this, but here's one using bash's 'read' builtin. The question asks us to keep lines before the blank in one file and send lines after the blank to another file. You could send some of standard out one place and some another if you are willing to use 'exec' and reroute stdout mid-script, but I'm going to take a simpler approach and use a command line argument to let me know where the post-blank data should go:
#!/bin/bash
# script takes as argument the name of the file to send data once a blank line
# found
found_blank=0
while read stuff; do
if [ -z $stuff ] ; then
found_blank=1
fi
if [ $found_blank ] ; then
echo $stuff > $1
else
echo $stuff
fi
done
run it like this:
$ ./delete_from_empty.sh rest_of_stuff < demo
output is:
ABC
DEF
and 'rest_of_stuff' has
GHI
if you want the before-blank lines to go somewhere else besides stdout, simply redirect:
$ ./delete_from_empty.sh after_blank < input_file > before_blank
and you'll end up with two new files: after_blank and before_blank.
Perl version
perl -e '
open $fh, ">","stuff";
open $efh, ">", "rest_of_stuff";
while(<>){
if ($_ !~ /\w+/){
$fh=$efh;
}
print $fh $_;
}
' demo
This creates two output files and iterates over the demo data. When it hits a blank line, it flips the output from one file to the other.
Creates
stuff:
ABC
DEF
rest_of_stuff:
<blank line>
GHI
Another awk would be:
awk -vRS= '1;{exit}' file
By setting the record separator RS to be an empty string, we define the records as paragraphs separated by a sequence of empty lines. It is now easily to adapt this to select the nth block as:
awk -vRS= '(FNR==n){print;exit}' file
There is a problem with this method when processing files with a DOS line-ending (CRLF). There will be no empty lines as there will always be a CR in the line. But this problem applies to all presented methods.
I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA