adding integers from multiple lines in a file - linux

I have a file called names.txt that has information looking like this
900608999 Hunter Price 60 70
900708988 Rachel Reed 70 80
I need a bash script that reads the test scores from test 1 (4th column), so we would add Hunters 60 and Rachel's 70, find the average and print it. Then do the same for test 2 (70 and 80) I believe you have to use a for loop but I am having trouble piecing it together. What I have so far is the basic for loop layout I was planning on using that simply returns all pieces of the file in a unorganized manner.
for x in $(cat names.txt)
do
echo $x
done

Will this do?
awk '{sum4 += $4; sum5 += $5} END {print sum4/NR, sum5/NR}' names.txt
Output:
65 75

Related

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

AWK field contains number range

I'm trying to use awk to output lines from a semi-colon (;) delimited text file in which the third field contains a number from a certain range. e.g.
[root#example ~]# cat foo.csv
john doe; lawyer; section 4 stand 356; area 5
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
I want awk to give me all lines in which the third field contains a number between 200 and 300 regardless of the other text in the field.
You may use a regular expression, like this:
awk -F\; '$3 ~ /\y2[0-9][0-9]\y/' a.csv
A better version that allows you to simply pass the boundaries at the command line without changing the regular expression could look like the following:
(Since it is a more complex script I recommend to save it to a file)
filter.awk
BEGIN { FS=";" }
{
# Split the 3rd field by sequences of non-numeric characters
# and store the pieces in 'a'. 'a' will contain the numbers
# of the 3rd field (plus an optional empty strings if $3 does
# not start or end with a number)
split($3, a, "[^0-9]+")
# iterate through a and check if a number is within the range
for(i in a){
if(a!="" && a[i]>=low && a[i]<high){
print
next
}
}
}
Call it like this:
awk -v high=300 -v low=200 -f filter.awk a.csv
grep alternative:
grep '^[^;]*;[^;]*;[^;]*\b2[0-9][0-9]\b' foo.csv
The output:
chris thomas; carpenter; stand 289 section 2; area 5
tom sawyer; politician; stan 210 section 4; area 6
If 300 should be inclusive boundary you may use the following:
grep '^[^;]*;[^;]*;[^;]*\b\(2[0-9][0-9]\|300\)\b' foo.csv

Get a list of lines from a file

I have a huge file (millions of lines). I want to get a random sample from it, I've generated a list of unique random numbers and now I want to get all the lines whose line number would match my random numbers generated.
Sorting the random numbers is not a problem, so I was thinking I can take the difference between consecutive numbers and just jump the difference with the cursor in the file.
I think I should use sed or awk.
Why don't you directly use shuf to get random lines:
shuf -n NUMBER_OF_LINES file
Example
$ seq 100 >a # the file "a" contains number 1 to 100, each one in a line
$ shuf -n 4 a
54
46
30
53
$ shuf -n 4 a
50
37
63
21
Update
Can I somehow store the number of lines shuf chose? – Pio
As I did in How to efficiently get 10% of random lines out of the large file in Linux?, you can do something like this:
shuf -i 1-1000 -n 5 > rand_numbers # store the list of numbers
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' list_of_numbers a #print those lines
You can use awk and shuf:
shuf file.txt > shuf.txt
awk '!a[$0]++' shuf.txt > uniqed.txt
This awk is best tool for removing duplicates.

Extract rows and substrings from one file conditional on information of another file

I have a file 1.blast with coordinate information like this
1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3
27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46
35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46
35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27
46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
and a file 1.fasta with sequence information like this
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG
I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...
(please notice that the first three entries from >1 are not in this sequence)
The IDs are consecutive, meaning I can extract the required information like this:
awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast
This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted
Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command?
I can get specific rows like this:
sed -n 3,4p 1.fasta
and the string that I want to remove e.g. via
sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'
But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.
If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.
In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.
Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.
With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.
I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.
This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.
foreach row in blast:
get the proper (blast[$1]) sequence from fasta
drop bases (blast[$7..$8]) from sequence
print blast[$1], shortened_sequence
If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.
You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.
As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:
Content of script.awk:
## Process first file from arguments.
FNR == NR {
## Save ID and the range of characters to remove from sequence.
blast[ $1 ] = $(NF-1) " " $NF
next
}
## Process second file. For each FASTA id...
$1 ~ /^>/ {
## Get number.
id = substr( $1, 2 )
## Read next line (the sequence).
getline sequence
## if the ID is one found in the other file, get ranges and
## extract those characters from sequence.
if ( id in blast ) {
split( blast[id], ranges )
sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
## Print both lines with the shortened sequence.
printf "%s\n%s\n", $0, sequence
}
}
Assuming your 1.blasta of the question and a customized 1.fasta to test it:
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA
Run the script like:
awk -f script.awk 1.blast 1.fasta
That yields:
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA
Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.
Updated the answer:
awk '
NR==FNR && NF {
id=substr($1,2)
getline seq
a[id]=seq
next
}
($1 in a) && NF {
x=substr(a[$1],$7,$8)
sub(x, "", a[$1])
print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast

Print header information using awk every 20 lines

I have a big data project that has thousands of entries. The data has roughly 20 columns including cylinders, gas mileage, make, model etc. I'm using awk to output all the data. I have to organize the data into a nice table.
I'm using a script like this:
#!/bin/bash
while read x
do
echo $x | awk -F ',' ' { print $1":"$2":"$4":"$7":"$8":"$10":"$11":"$12":"$22":"$24 } '
done
There will be title headings where the colons are. I need to repeat those every 20 lines and must have a line break after line 20 and the header. Also, the last line should output the number of entries.
I'm stuck on the last 3 things to do.
There's no point using the while read loop, and in fact it complicates things since it makes it difficult for awk to keep a count of the line numbers. Try:
awk -F, 'NR % 20 == 1 { print "header columns" }
{ print $1,$2,$4,$7,$8,$10,$11,$12,$22,$24 }' OFS=: input-file

Resources