how can I use multiple operation in awk to edit text file - linux

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt
I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?

If the records with the same ID are always consecutive, you can do that:
awk 'ID!=$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio.
The second block filter unwanted records.
The END block displays the result for the last ID.

Related

Extract columns from multiple text files with bash or awk or sed?

I am trying to extract column1 and column4 from multiple text files.
file1.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 2 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 32 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 2 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
file2.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 5 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 40 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 6 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name.
Output file looks like:
#rname file1_numreads file2_numreads
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
CFLAU10s46988|kraken:taxid|33189 2 6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 113
Your suggestions would be appreciated.
Here is something I put together. awk gurus might have a simpler - shorter version but I am still learning awk.
Create a file script.awk and make it executable. Put in it:
#!/usr/bin/awk -f
BEGIN { FS="\t" }
# process files, ignoring comments
!/^#/ {
# keep the first column values.
# Only add a new value if it is not already in the array.
if (!($1 in firstcolumns)) {
firstcolumns[$1] = $1
}
# extract the 4th column of file1, put it in the array (column 1).1
if (FILENAME == ARGV[1]) {
results[$1 ".1"] = $4
}
# extract the 4th column of file2, put it in the array (column 1).2
if (FILENAME == ARGV[2]) {
results[$1 ".2"] = $4
}
}
# print the results
END {
# for each first column value...
for (key in firstcolumns) {
# Print the first column, then (column 1).1, then (column 1).2
print key "\t" results[key ".1"] "\t" results[key ".2"]
}
}
Call it like this: ./script.awk file1.txt file2.txt.
Since awk parses the files line per line, I keep the possible values of the first column in an array (firstcolumns).
For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.
Results:
$ ./so.awk file1.txt file2.txt
AUZW01004514.1 C4 C4
CFLAU10s46988|kraken:taxid|33189 2 6
CFLAU10s46802|kraken:taxid|33189 2 5
AUZW01004739.1 C4 C4
CFLAU10s46898|kraken:taxid|33189 32 40

Critera Range is ignoring second criteria in Current Region

When extracting and moving data, the first column of the criteria is working but the second criteria is not engaging. It is returning the movement for all stores if they had sold that item.
List of column headers.
R2=Left Len,
S2=Store
A2=Left Len,
B2=UPC,
C2=Store,
D2=Movement,
The file is just short of 900k rows of data in total.
I believe it to be an issue with Current Region.
Also need for this to return zero if there is no movement for that store. This will be repeated 39 more times to the right in order to get results for each location.
Ultimate goal is to find the Zero Movers that need to be addressed. So the rows of upc's would need to stay aligned with the criteria.
Any help would be greatly appreciated.
Using Windows 7,
Office 2016
Sub Find_Fill_Data()
Range("u2:x" & Range("x" & Rows.Count).End(xlUp).Row).ClearContents
Range("a2:D" & Range("D" & Rows.Count).End(xlUp).Row).AdvancedFilter Action:=xlFilterCopy, criteriarange:=Range("r2").CurrentRegion, copytorange:=Range("u2"), unique:=False
Range("q4").Select
End Sub
**Left Len Item 5 7 8 9**
1070002152 MILK DUDS THEATER BOX 123 254 181 196
1070002385 WHOPPERS MALT BALLS 19 0 28 42
1070002440 WHOPPERS MALT BALLS 92 188 79 133
1070002660 WHOPPERS MALT BALLS 22 21 11 22
1070006080 CANDY BAR 575 463 446 303
1070006611 WHOPPER ROBIN EGGS 22 28 25 0
1070008807 CANDY 132 57 59 0
1070008813 THEATER BOX 331 127 101 171
1070013272 J/RANCHER CRNCH CHEW ASST 61 0 0 0
1070050180 WHOPPERS MALT BALLS CARTN 119 24 99 99

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Excel date/product count to specified limit

Column A "Sales Dates", Column B "=A2-A1" for "Date Diff", Column C "Customer Name", Column D "Item", Column E "Items Ordered Count"
My issue is I have to do a running 30 day total for each customer to see that specific items are not being ordered above "x" number within any 30-day period.
Does anyone have any ideas?
I may not be fully understanding your question, but I don't think you can do what you ask in excel. This might be a situation where a database that can do SQL might come in handy.
The best I can come up with in excel is a Pivot Table, with the customers as rows, dates as columns (group by month), and sum of Items Ordered in the data area. Then conditional format the data area to highlight values > your limit.
Perhaps if you provide some sample data & output I can come up with something more like what you need.
The formula would look something like this:
{=SUM(IF((A$2:A2>=A2-29)*(D$2:D2=D2),E$2:E2,0))}
It should be entered into cell F2 and copied down to the last row of your data. I pasted in a test spreadsheet below so you can see where things go (sorry for the formatting--hopefully it will look better if you paste it into Excel).
IMPORTANT: This is an array formula, so after you type in the formula (and don't type in the braces {} when you do), you must press Ctrl-Shift-Enter instead of just Enter (see this link for more details).
What does the formula do? It does two loops:
First, it loops through all the Sales Dates from the beginning of the log to the current row and checks if each date is between the date of the current row and 29 days earlier (which makes a 30-day window). (By "current row" I mean the row where the formula is located.)
Second, it loops through all the Items from the beginning of the log to the current row and checks if there is a match with the Item of the current row.
For any row where both checks are true (the "*" in the formula does an "and" operation), Items Ordered Count is added to the sum, otherwise zero is added to the sum. So, when it's finished, you have a count for each row of how many orders there were in the past 30 days for that item.
HTH,
-Dan
Sales Dates Date Diff Customer Name Item Items Ordered Count 30-Day Count
1/1/2009 0 dfsadf 11336 70 70
1/2/2009 1 asdfd 10218 121 121
1/3/2009 1 fsdfjkfl 10942 101 101
1/6/2009 3 slkdjflsk 13710 80 80
1/7/2009 1 slkdjls 10480 127 127
1/9/2009 2 sdjjf 11336 143 213
1/11/2009 2 woieuriwe 11501 84 84
1/14/2009 3 owqieyurtn 10191 78 78
1/15/2009 1 weisd 10480 113 240
1/16/2009 1 woieuriwe 12024 133 133
1/17/2009 1 vkcjl 13818 125 125
1/20/2009 3 sdflkj 11336 128 341
1/23/2009 3 jnbkdl 10480 141 381
1/25/2009 2 pqcvnlz 10480 137 518
1/27/2009 2 hwodkjgfh 12878 80 80
1/28/2009 1 zjdnfg;pwlkd 10942 123 224
1/31/2009 3 zlkdjnf;psod 13173 93 93
2/2/2009 2 zlknpdodfg 11336 119 390
2/4/2009 2 zjhdfpwskjh 12004 57 57
2/5/2009 1 asdfd 10218 121 121
2/8/2009 3 fsdfjkfl 10942 101 224
2/11/2009 3 slkdjflsk 13710 80 80
2/14/2009 3 slkdjls 10480 127 405
2/16/2009 2 sdjjf 11336 143 390
2/18/2009 2 woieuriwe 11501 84 84
2/21/2009 3 owqieyurtn 10191 78 78
2/24/2009 3 weisd 10480 113 240
2/25/2009 1 woieuriwe 12024 133 133
2/27/2009 2 vkcjl 13818 125 125
2/28/2009 1 sdflkj 11336 128 390

Resources