I have two scripts, A and B.
I want to execute them and read respectively two values. V and VALS.
V is just a floating point number, let's say 0.5
VALS has the following format:
1 10
2 20
3 60
4 45
and so on.
What I'm trying to do is to get a new variable where the second column of VALS (10, 20, ...) is divided by V.
As I understand this can be implemented with a mix of xargs and cut but I'm not really familiar with these tools.
#!/bin/bash
V=`./A`
VALS=`./B`
RESULT=#magic happens
The final result with the previous data should be:
1 20
2 40
3 120
4 90
Bash's builtin arithmetic expansion only works for integers. You can use awk for data extraction and floating point numbers.
V=`./A`
# No VALS needed
RESULT=($(./B | awk "{print \$2 / $V"}))
Note the escaped dollar sign in \$2.
Related
I pipe different values to bc.
If the value is a number, it works fine. If it's a string with lowercase letters, it returns 0 which makes sense to me, but if it's uppercase letters, bc converts it to 9 as the length of the input characters:
echo 1 | bc
1
echo aaa | bc
0
echo AAA | bc
999
echo FO | bc
99
echo null | bc
0
echo NULL | bc
9999
Why does bc have this behavior? What's the best way to work with unexpected string values?
According to https://www.gnu.org/software/bc/manual/html_mono/bc.html
(emphasis by me):
A simple expression is just a constant. bc converts constants into internal decimal numbers using the current input base, specified by the variable ibase. (There is an exception in functions.) The legal values of ibase are 2 through 16. Assigning a value outside this range to ibase will result in a value of 2 or 16. Input numbers may contain the characters 0-9 and A-F. (Note: They must be capitals. Lower case letters are variable names.) Single digit numbers always have the value of the digit regardless of the value of ibase. (i.e. A = 10.) For multi-digit numbers, bc changes all input digits greater or equal to ibase to the value of ibase-1. This makes the number FFF always be the largest 3 digit number of the input base.
So, assuming that your ibase is 10 your observation is explained.
It is unrelated to "unexected string values" or "the length of the input characters". bc does consider them (somewhat odd) attempts to provide numeric values and converts and uses them according to the quoted rule.
I have
while read $field1 $field2 $field3 $field4
do
$trimmed=$field2 | sed 's/ *$//g'
echo "$trimmed","$field3" >> new.csv
done < "$FEEDS"/"$DLFILE"
Now the problem is with read I can't make it split fields csv style, can I? See the input csv format below.
I need to get columns 3 and 4 out, stripping the padding from col 2, and I don't need the quotes.
Csv format with col numbers:
12 24(")25(,)26(")/27(Field2values) 42(")/43(,)/44(Field3 decimal values)
"Field1_constant_value","Field2values ",Field3,Field4
Field1 is constant and irrelevant. Data is quoted, goes from 2-23 inside the quotes.
Field2 fixed with from cols 27-41 inside quotes, with the data at the left and padded by spaces on the right.
Field3 is a decimal number with 1,2, or 3 digits before the decimal and 2 after, no padding. Starts at col 74.
Field4 is a date and I don't much care about it right now.
Yes, you can use read; all you've got to do is reset the environment variable IFS -- Internal Field Separator --, so that it won't split lines by its current value (default to whitespace), but by your own delimiter.
Considering an input file "a.csv", with the given contents:
1,2,3,4
2,3,4,5
6,3,2,1
You can do this:
IFS=','
while read f1 f2 f3 f4; do
echo "fields[$f1 $f2 $f3 $f4]"
done < a.csv
And the output is:
fields[1 2 3 4]
fields[2 3 4 5]
fields[6 3 2 1]
A couple of good starting points for you are here: http://backreference.org/2010/04/17/csv-parsing-with-awk/
How would I print out even numbers between two numbers?
I have a script where a user enters in two values and them two values are placed into their respective array elements. How would I print the even numbers between the two values?
See man seq. You can use
seq first incr last
for example
seq 4 2 18
to print even numbers from 4 to 18 (inclusive)
If you have bash.
printf '%s\n' {4..18..2}
Or a c-style for loop
for
for ((i=4;i<=18;i+=2)); do echo "$i"; done
I am trying to count how many times each ASCII printable character is present in a file. I thought a good way to do this might be to list the printable characters in a { } enclosed list, and use grep on each item within the braces. An example code is below. I would like to expand the char list to include all 64 ASCII printable characters. I cannot figure out how to get the code to read and use each characters between the braces separately. I would really like to output a file in the format "character\tcharacterCount". Any suggestions?
char={" ",!,\",#,"\$"}
cat PHRED_scores.txt | grep -e "$char" | wc -m
Below command will display the special characters present in the file and their total count.
grep -oP '[ !\\$#]' file | sort | uniq -c
Explanation:
o - print the match only.
P - grep with Perl-regexp option.
[ !\\&#] - Special characters are included in the character class. You have to escape \ so that it means a literal \
sort Output would be sorted.
uniq -c All the duplicates are counted and then it will be combined into one.
There is a way to avoid listing all 64 characters individually to match the ASCII character set. Bash provides character classes and allows ranges to represent numerous characters without listing each individual character. Some examples are:
[a-z] match all lowercase characters
[A-Z] match all uppercase characters
[0-9] match all digits
[[:print:]] all printable characters
So with very little effort, you can match all upper and lowercase characters and all digits with:
[a-zA-Z0-9]
You can then add the additional printable characters, but you must take care to escape or avoid those with special meaning to regular expressions themselves. An example (not intended to be all-inclusive is)
[a-zA-Z0-0:;~!##$%&*()_-+=]
or you can use the predefined class:
[:print:]
You can add as required. To solve your problem, as Avinash provided sort | uniq -c can provide the individual count. Adding an additional call to wc -m will provide the total. With that, it is not difficult to develop a script that will take the filename as an argument and give the total and individual character counts you require. Something similar to the following will work:
#!/bin/bash
echo -n "Total character count: "
grep $cclass "$1" | wc -m # obtain the total character count
echo -e " Individual frequency:"
grep -o [[:print:]] "$1" | sort | uniq -c # obtain the individual frequency
exit 0
Sample output:
Total character count: 455
Individual frequency:
6 =
10 _
7 -
4 ,
12 ;
1 /
4 .
6 "
9 (
9 )
2 {
2 }
2 *
5 \
2 #
4 %
4 0
3 a
17 b
11 c
1 C
24 d
4 D
28 e
1 E
...
I have a file 1.blast with coordinate information like this
1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3
27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46
35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46
35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27
46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
and a file 1.fasta with sequence information like this
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG
I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...
(please notice that the first three entries from >1 are not in this sequence)
The IDs are consecutive, meaning I can extract the required information like this:
awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast
This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted
Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command?
I can get specific rows like this:
sed -n 3,4p 1.fasta
and the string that I want to remove e.g. via
sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'
But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.
If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.
In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.
Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.
With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.
I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.
This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.
foreach row in blast:
get the proper (blast[$1]) sequence from fasta
drop bases (blast[$7..$8]) from sequence
print blast[$1], shortened_sequence
If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.
You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.
As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:
Content of script.awk:
## Process first file from arguments.
FNR == NR {
## Save ID and the range of characters to remove from sequence.
blast[ $1 ] = $(NF-1) " " $NF
next
}
## Process second file. For each FASTA id...
$1 ~ /^>/ {
## Get number.
id = substr( $1, 2 )
## Read next line (the sequence).
getline sequence
## if the ID is one found in the other file, get ranges and
## extract those characters from sequence.
if ( id in blast ) {
split( blast[id], ranges )
sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
## Print both lines with the shortened sequence.
printf "%s\n%s\n", $0, sequence
}
}
Assuming your 1.blasta of the question and a customized 1.fasta to test it:
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA
Run the script like:
awk -f script.awk 1.blast 1.fasta
That yields:
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA
Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.
Updated the answer:
awk '
NR==FNR && NF {
id=substr($1,2)
getline seq
a[id]=seq
next
}
($1 in a) && NF {
x=substr(a[$1],$7,$8)
sub(x, "", a[$1])
print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast