how to add numbers in C-shell - linux

I have a question about C-shell. In my script, I want to automatically add all the numbers and get the total number. How to implement such function in C-shell?
My script is shown below:
#!/bin/csh -f
set log_list = $1
echo "Search begins now"
foreach subdir(`cat $log_list`)
grep "feature identified" "$subdir" -A1 | grep "ne=" | awk '{print $7}'
echo "done"
end
For this script, it will grep the log file "log_list" for keyword "feature identified" and the next line containing keyword "ne=". I care about the number after "ne=", for example ne=140.
Then the grep output will be like this:
ne=100
ne=115
ne=120
...
There are more than 1K lines of such numbers. Of course I can redirect the grep data to a new file(in Linux). Then copy all the data into Excel spreadsheet to add them up. But I want to do this in the script. And it will make thing easier.
The final result should be like this:
total_ne=335
Do you know how to do this in the C-shell? Thanks!

Related

Using awk to delete multiple lines using argument passed on via function

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903
Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks
This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Paste output into a CSV file in bash with paste command

I am using the paste command as below:
message+=$(paste <(echo "${arr[0]}") <(echo "$starttimeF") <(echo "$ttime") <(echo "$time") | column -t)
echo "$message"
It gives me the following output:
ASPPPLD121 11:45:00 00:00:16 00:02:23
FPASDDF123 11:45:00 00:00:16 00:02:23
ZWASD77F0D 09:04:58 02:40:18 03:51:10
DDPADSDSD5 11:29:41 00:15:35 01:17:33
How do I redirect this to a CSV or an EXCEL FILE?
OR
How do I put it into HTML table?
You have three questions in one. The easiest to solve is the CSV file output:
echo $message | awk 'BEGIN{OFS=","}{$1=$1}1' > mycsvfile.csv
This will also open in excel since excel can read csv files, so maybe that solves the second question?
Awk is opening the file and reading line by line. We set the OFS (Output Field Seperator) to a comma. Then we just set one of the fields equal to itself, which is a just a trick to get awk to process the record, and let it print out the results to the CSV file.
I suppose you could use awk to print out a HTML table too, but it seems like a gross way of doing things:
echo $message | awk 'BEGIN{print "<table>"} {print "<tr>";for(i=1;i<=NF;i++)print "<td>" $i"</td>";print "</tr>"} END{print "</table>"}'
html output awk script stolen from here

Linux Bash Script for calculating average of multiple files

I am writing a scipt which will take parameter of the folder which it will do the job. The aim is to calculate average number of reviews and print the result next to the name of the file. I wrote the script for only one file it works okay but I couldn't find any solutions to do it on multiple files. I should get an output like ;
% ./averagereviews.sh path_to_folder
hotel_11212 3.51
hotel_2121 2.62
hotel_31212 2.43
...
I done this task for only one hotel and the code is like this;
grep "<Overall>" $1 | sed 's/<Overall>//g'| awk '{SUM += $1} END {print SUM/NR}'
This simply searches the word "" in the file and gets the number next to it, then adds these numbers and divides the sum with NR to find average.
when I run it the output is the average value for the given hotel
./averagereviews.sh hotel_190158.dat
4.00578
But I should do this to multiple .dat files in a folder with printing the name of hotel. How can I do that ?
You could "cheat"
> cat averagereviews.sh
#!/bin/bash
SUM=0
data_files=$(ls $1/dataFile*.dat)
cat $data_files | grep "<Overall>" | sed -e 's/<Overall>//g' | awk '{SUM += $1} END {print SUM/NR}'
and run (from wherever, with whichever paths you need)
> ~/tools/averagereviews.sh /tmp/data/
Simply, I'm cating all the files first, and apply your command to the rest - having it behave like the pipe is a single file.

Extracting part of a string to a variable in bash

noob here, sorry if a repost. I am extracting a string from a file, and end up with a line, something like:
abcdefg:12345:67890:abcde:12345:abcde
Let's say it's in a variable named testString
the length of the values between the colons is not constant, but I want to save the number, as a string is fine, to a variable, between the 2nd and 3rd colons. so in this case I'd end up with my new variable, let's call it extractedNum, being 67890 . I assume I have to use sed but have never used it and trying to get my head around it...
Can anyone help? Cheers
On a side-note, I am using find to extract the entire line from a string, by searching for the 1st string of characters, in this case the abcdefg part.
Pure Bash using an array:
testString="abcdefg:12345:67890:abcde:12345:abcde"
IFS=':'
array=( $testString )
echo "value = ${array[2]}"
The output:
value = 67890
Here's another pure bash way. Works fine when your input is reasonably consistent and you don't need much flexibility in which section you pick out.
extractedNum="${testString#*:}" # Remove through first :
extractedNum="${extractedNum#*:}" # Remove through second :
extractedNum="${extractedNum%%:*}" # Remove from next : to end of string
You could also filter the file while reading it, in a while loop for example:
while IFS=' ' read -r col line ; do
# col has the column you wanted, line has the whole line
# # #
done < <(sed -e 's/\([^:]*:\)\{2\}\([^:]*\).*/\2 &/' "yourfile")
The sed command is picking out the 2nd column and delimiting that value from the entire line with a space. If you don't need the entire line, just remove the space+& from the replacement and drop the line variable from the read. You can pick any column by changing the number in the \{2\} bit. (Put the command in double quotes if you want to use a variable there.)
You can use cut for this kind of stuff. Here you go:
VAR=$(echo abcdefg:12345:67890:abcde:12345:abcde |cut -d":" -f3); echo $VAR
For the fun of it, this is how I would (not) do this with sed, but I'm sure there's easier ways. I guess that'd be a question of my own to future readers ;)
echo abcdefg:12345:67890:abcde:12345:abcde |sed -e "s/[^:]*:[^:]*:\([^:]*\):.*/\1/"
this should work for you: the key part is awk -F: '$0=$3'
NewVar=$(getTheLineSomehow...|awk -F: '$0=$3')
example:
kent$ newVar=$(echo "abcdefg:12345:67890:abcde:12345:abcde"|awk -F: '$0=$3')
kent$ echo $newVar
67890
if your text was stored in var testString, you could:
kent$ echo $testString
abcdefg:12345:67890:abcde:12345:abcde
kent$ newVar=$(awk -F: '$0=$3' <<<"$testString")
kent$ echo $newVar
67890

Resources