Compare specific parts of two columns in a text file in Linux - linux

I have a text file with several columns separated by tab character as below:
1 ATGCCCAGA AS:i:10 XS:i:10
2 ATGCTTGA AS:i:10 XS:i:5
3 ATGGGGGA AS:i:10 XS:i:1
4 ATCCCCGA AS:i:20 XS:i:20
I now want to compare the last two columns AS:i:(n1) and XS:i:(n2) to obtain only lines with n1 different to n2. So, my desired output would be:
2 ATGCTTGA AS:i:10 XS:i:5
3 ATGGGGGA AS:i:10 XS:i:1
Could you suggest me some ways that I can compare n1 and n2 and print out the output? Thanks in advance.

As Shawn says, you coudl do this in awk... or perl ... or sed.
An AWK example might be
awk '{split($3,a,":");split($4,b,":");if(a[3]!=b[3]) print $0}' infile.txt
If you are familiar with awk this should be fairly self explanatory

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Filtering by author and counting all numbers im txt file - Linux terminal, bash

I need help with two hings
1)the file.txt has the format of a list of films
, in which they are authors in different lines, year of publication, title, e.g.
author1
year1
title1
author2
year2
title2
author3
year3
title3
author4
year4
title4
I need to show only book titles whose author is "Joanne Rowling"
2)
one.txt contains numbers and letters for example like:
dada4dawdaw54 232dawdawdaw 53 34dadasd
77dkwkdw
65 23 laka 23
I need to sum all of them and receive score - here it should 561
I tried something like that:
awk '{for(i=1;i<=NF;i++)s+=$i}END{print s}' plik2.txt
but it doesn't make sense
For the 1st question, the solution of okulkarni is great.
For the 2nd question, one solution is
sed 's/[^0-9]/ /g' one.txt | awk '{for(i=1;i<=NF;i++) sum+= $i} END { print sum}'
The sed command converts all non-numeric characters into spaces, while the awk command sums the numbers, line by line.
For the first question, you just need to use grep. Specifically, you can do grep -A 2 "Joanne Rowling" file.txt. This will show all lines with "Joanne Rowling" and the two lines immediately after.
For the second question, you can also use grep by doing grep -Eo '[0-9]+' | paste -sd+ | bc. This will put a + between every number found by grep and then add them up using bc.

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

script to compare two large 900 x 900 comma delimited files

I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files.
If you just want a rough answer, possibly the simplest thing is to do something like:
tr , \\n file1 > /tmp/output
tr , \\n file2 | diff - /tmp/output
That will convert each file to one column and run diff. You can compute the cells that differ from the line numbers of the output.
Simplest way with awk without accounting for newlines inside fields,quoted commas etc.
Print the same
awk 'BEGIN{RS=",|"RS}a[FNR]==$0;{a[NR]=$0}' file{,2}
Print differences
awk 'BEGIN{RS=",|"RS}FNR!=NR&&a[FNR]!=$0;{a[NR]=$0}' file{,2}
Print which are the same different
awk 'BEGIN{RS=",|"RS}FNR!=NR{print "cell"FNR (a[FNR]==$0?"":" not")" the same"}{a[NR]=$0}' file{,2}
Input
file
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
file2
1,2,3,4,5
2,7,1,9,12
1,1,1,1,12
Output
same
1
2
3
4
5
7
9
Different
2
1
12
1
1
1
1
12
Same different
cell1 the same
cell2 the same
cell3 the same
cell4 the same
cell5 the same
cell6 not the same
cell7 the same
cell8 not the same
cell9 the same
cell10 not the same
cell11 not the same
cell12 not the same
cell13 not the same
cell14 not the same
cell15 not the same

Grep find lines that have 4,5,6,7 and 9 in zip code column

I'm using grep to display all lines that have ONLY 4,5,6,7 and 9 in the zipcode column.
How do i display only the lines of the file that contain the numbers 4,5,6,7 and 9 in the zipcode field?
A sample row is:
15 m jagger mick 41 4th 95115
Thanks
I am going to assume you meant "How do I use grep to..."
If all of the lines in the file have a 5 digit zip at the end of each line, then:
egrep "[45679]{5}$" filename
Should give you what you want.
If there might be whitespace between the zip and the end of the line, then:
egrep "[45679]{5}[[:space:]]*$" filename
would be more robust.
If the problem is more general than that, please describe it more accurately.
Following regex should fetch you desired result:
egrep "[45679]+$" file
If by "grep" you mean, "the correct tool", then the solution you seek is:
awk '$7 ~ /^[45679]*$/' input
This will print all lines of input in which the 7th field consists only of the characters 4,5,6,7, and 9. If you want to specify 'the last column' rather than the 7th, try
awk '$NF ~ /^[45679]*$/' input

Resources