How can i add a string with an integer in bash? [duplicate] - string

This question already has answers here:
Are shell scripts sensitive to encoding and line endings?
(14 answers)
Closed 2 years ago.
I want to read a random txt which contains some random integers in this form:
2:10
4:4
10:15
22:5
Then i want to find the sum of each column . Firstly , i thought of spliting each line like if every line is a string:
columnA="$(cut -d':' -f1 <<<$line)"
columnB="$(cut -d':' -f2 <<<$line)"
columnAcontains the elements of the first column and columnB the elements of the second one . Then i created a variable sumA=0 and i tried to take the sum of each column like that:
sumA=$((columnA+sumA))
I am getting the result i want but with this message as well
")syntax error: operand expected (error token is "
Same for the second column :
sumB=$((columnB+sumB))
The time i am getting this error and i dont get the result i want:
")syntax error: invalid arithmetic operator (error token is "
This is the code in general :
sumA=0
sumB=0
while IFS= read -r line
do
columnA="$(cut -d':' -f1 <<<$line)"
sumA=$((columnA+sumA))
columnB="$(cut -d':' -f2 <<<$line)"
sumB=$((columnB+sumB))
done < "random.txt"
echo $sumA
echo $sumB
Any thoughts?

It could be simplified just to
awk -F: '{sumA+=$1; sumB+=$2} END {printf "%s\n%s\n", sumA, sumB}' random.txt
From the manual:
$ man awk
...
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS predefined variable).
...

instead of using "cut" use built in bash references: "${variable}"
ColumnA=0
ColumnB=0
while read l
do
ColumnA=$((${l//:*}+$ColumnA))
ColumnB=$((${l//*:}+$ColumnB))
done < random.txt
echo $ColumnA $ColumnB

Related

AWK - string containing required fields

I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)

How to substitute a substring delimited by separator in a specific position with sed [duplicate]

This question already has answers here:
Sed replace at second occurrence
(3 answers)
Closed 1 year ago.
I have a record like this:
aaa|11|bcvdgsf|11|eyetwrt|11|kkjdksjk
I would substitute the second occurrence of "11" with nother substring, for example XX. So the output would be
aaa|11|bcvdgsf|XX|eyetwrt|11|kkjdksjk
I tryed to use this command:
#echo "z|11|a|11|b|11|" | sed 's/\(|11\{2\}\)/\|XX/'
but record does not change
You could use this to say "replace the second instance of 11 delimited by word boundaries on both sides with XX:"
$ sed 's/\b11\b/XX/2' <<< 'aaa|11|bcvdgsf|11|eyetwrt|11|kkjdksjk'
aaa|11|bcvdgsf|XX|eyetwrt|11|kkjdksjk
This requires GNU sed for \b support.
If only whole field has to be matched:
$ cat ip.txt
z|11|a|11|b|11|
aaa|11|bcvdgsf|11|eyetwrt|11|kkjdksjk
11|11.2|abc|11|cca
11||11
11|11|ac
a|11|asd|11
$ awk 'BEGIN{FS=OFS="|"}
{c=0; for(i=1; i<=NF; i++) if($i=="11" && ++c==2) $i="XX"}
1' ip.txt
z|11|a|XX|b|11|
aaa|11|bcvdgsf|XX|eyetwrt|11|kkjdksjk
11|11.2|abc|XX|cca
11||XX
11|XX|ac
a|11|asd|XX
FS=OFS="|" use | as input and output field separator
c=0 initialize counter for every line
for(i=1; i<=NF; i++) to loop over all input fields
$i=="11" && ++c==2 if field content is exactly 11, increment counter and check if it is the second match
$i="XX" change field content as needed
1 idiomatic way to print $0
Similar logic with perl using lookarounds to match field boundary:
perl -lpe '$c=0; s/(?<![^|])11(?![^|])/++$c==2 ? "XX" : $&/ge'
Please try my (working) solution:
echo "z|11|a|11|b|11|" | sed -r 's/^([a-z]+\|11\|[a-z]+\|)(11)(.+)$/\1XX\3/'

Can't input date variable in bash

I have a directory /user/reports under which many files are there, one of them is :
report.active_user.30092018.77325.csv
I need output as number after date i.e. 77325 from above file name.
I created below command to find a value from file name:
ls /user/reports | awk -F. '/report.active_user.30092018/ {print $(NF-1)}'
Now, I want current date to be passed in above command as variable and get result:
ls /user/reports | awk -F. '/report.active_user.$(date +'%d%m%Y')/ {print $(NF-1)}'
But not getting required output.
Tried bash script:
#!/usr/bin/env bash
_date=`date +%d%m%Y`
active=$(ls /user/reports | awk -F. '/report.active_user.${_date}/ {print $(NF-1)}')
echo $active
But still output is blank.
Please help with proper syntax.
As #cyrus said you must use double quotes in your variable assignment because simple quote are use only for string and not for containing variables.
Bas use case
number=10
string='I m sentence with or wihtout var $number'
echo $string
Correct use case
number=10
string_with_number="I m sentence with var $number"
echo $string_with_number
You can use simple quote but not englobe all the string
number=10
string_with_number='I m sentence with var '$number
echo $string_with_number
Don't parse ls
You don't need awk for this: you can manage with the shell's capabilities
for file in report.active_user."$(date "+%d%m%Y")"*; do
tmp=${file%.*} # remove the extension
number=${tmp##*.} # remove the prefix up to and including the last dot
echo "$number"
done
See https://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Print a string separated by dots in reverse order [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to print a string in reverse order in bash shell.
[deeps#host1:~]echo maps.google.com|rev
moc.elgoog.spam
[deeps#host1:~]
But I need it as "com.google.maps". I need it in general for any string separated by period (.).
It should print in reverse order. How do I do that?
I need the solution in Perl as well.
Split by . then reverse the results, then join them up again.
Perl Command Switches:
-l means "remove/add the newline character from input/output."
-p means "execute the program for each line of input, and print $_ after each execution."
-e means "the following argument is the code of the program to run."
perldoc perlrun for more details.
echo maps.google.com | perl -lpe '$_ = join ".", reverse split /\./;'
output
com.google.maps
It also works if you have a data file with lots of rows.
input
maps.google.com
translate.google.com
mail.google.com
run
perl -lpe '$_ = join ".", reverse split /\./;' input
output
com.google.maps
com.google.translate
com.google.mail
Using a bunch of utils:
$ tr '.' $'\n' <<< 'maps.google.com' | tac | paste -s -d '.'
com.google.maps
This replaces all periods with newlines (tr), then reverses the order of the lines (tac), then pastes the lines serially (paste -s), with the period as the delimiter (-d '.').
Considerably uglier (or just wordier?) in pure Bash:
# Read string into array 'arr', split at periods by setting IFS
IFS=. read -a arr <<< 'maps.google.com'
# Loop over array from the end
for (( i = $(( ${#arr[#]}-1 )); i >= 0; --i )); do
# Append element plus a period to result string
res+=${arr[i]}.
done
# Print result string minus the last period
echo "${res%.}"
$ echo maps.google.com | awk -F. '{for (i=NF;i>0;i--) printf "%s%s",$i,(i==1?"\n":".")}'
com.google.maps
How it works
-F.
This tells awk to use a period as the field separator
for (i=NF;i>0;i--) printf "%s%s",$i,(i==1?"\n":".")
This loops over all fields, starting with the last and ending with the first and printing them, followed by a period (except for the first field which is followed by a newline).
The one tricky part above is (i==1?"\n":"."). This is called a ternary statement. The part before the ? is a logical condition. If the condition is true then the value after the question mark, but before the :, is used. If it is false, then the value after the : is used. In this case, that means that, when we are on the first field i==1, then the statement returns a newline, \n. If we are on any other field, it returns a period, .. We use this to put a period after all the fields except for the first (which, in the output, is printed last). After it, we put a newline.
For more on ternary statements, see the GNU docs.
Solution in Perl:
$str = "maps.google.com";
#arr =split('\.',$str);
print join(".",reverse #arr);
output:
com.google.maps
Split the string on "." and reverse the array. Join the reversed array using ".".

Resources