remove portion of a column in a .tab in unix - linux

Can someone please help! I'm trying to delete the last portion (following "_cO") in the second column of the following list in the bash shell. E.G where it says "_seq1" in this particular list. I do not want to change any other info in the remaining columns.
Thanks!
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3

Here you go, a simple substitution using sed:
sed -e s/_seq1//

Using sed:
sed -i.bak 's/^\(.*_c0\)[^ ]*\( .*\)$/\1\2/' file
OR using awk:
awk '{sub(/_c0[^ ]*/, "_c0", $2)} 1' file

Related

Remove \r\ character from String pattern matched in AWK

I'm quite new to AWK so apologies for the basic question. I've found many references for removing windows end-line characters from files but none that match a regular expression and subsequently remove the windows end line characters.
I have a file named infile.txt that contains a line like so:
...
DATAFILE data5v.dat
...
Within a shell script I want to capture the filename argument data5v.dat from this infile.txt and remove any carriage return character, \r, IF present. The carriage return may not always be present. So I have to match a word and then remove the \r subsequently.
I have tried the following but it is not working how I expect:
FILENAME=$(awk '/DATAFILE/ { print gsub("\r", "", $2) }' $INFILE)
Can I store the string returned from matching my regex /DATAFILE/ in a variable within my AWK statement to subsequently apply gsub?
File names can contain spaces, including \rs, blanks and tabs, so to do this robustly you can't remove all \rs with gsub() and you can't rely on there being any field, e.g. $2, that contains the whole file name.
If your input fields are tab-separated you need:
awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' file
or this otherwise:
awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' file
The above assumes your file names don't start with spaces and don't contain newlines.
To test any solution for robustness try:
printf 'DATAFILE\tfoo \r bar\r\n' | awk '...' | cat -TEv
and make sure that the output looks like it does below:
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^\t]+\t/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
$ printf 'DATAFILE\tfoo \r\tbar\r\n' | awk '/DATAFILE/ { sub(/[^[:space:]]+[[:space:]]+/,""); sub(/\r$/,""); print }' | cat -TEv
foo ^M^Ibar$
Note the blank, ^M (CR), and ^I (tab) in the middle of the file name as they should be but no ^M at the end of the line.
If your version of cat doesn't support -T or -E then do whatever you normally do to look for non-printing chars, e.g. od -c or vi the output.
With GNU awk, would you please try the following:
FILENAME=$(awk -v RS='\r?\n' '/DATAFILE/ {print $2}' "$INFILE")
echo "$FILENAME"
It assigns the record separator RS to a sequence of zero or one \r followed by \n.
As a side note, it is not recommended to use uppercases for user's variable names because it may conflict with system reserved variable names.
Awk simply applies each line of script to each input line. You can easily remove the carriage return and then apply some other logic to the input line. For example,
FILENAME=$(awk '/\r/ { sub(/\r/, "") }
/DATAFILE/ { print $2 }' "$INFILE")
Notice also When to wrap quotes around a shell variable.
who says you need gnu-awk :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| mawk NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015
gawk -P posix mode is also fine with it :
gecho -ne "test\r\nabc\n\rdef\n" \
\
| gawk -Pe NF=NF FS='\r' OFS='' | odview
0000000 1953719668 1667391754 1717920778 10
t e s t \n a b c \n d e f \n
164 145 163 164 012 141 142 143 012 144 145 146 012
t e s t nl a b c nl d e f nl
116 101 115 116 10 97 98 99 10 100 101 102 10
74 65 73 74 0a 61 62 63 0a 64 65 66 0a
0000015

Compare two files and write the unmatched numbers in a new file

I have two files where ifile1.txt is a subset of ifile2.txt.
ifile1.txt ifile2.txt
2 2
23 23
43 33
51 43
76 50
81 51
100 72
76
81
89
100
Desire output
ofile.txt
33
50
72
89
I was trying with
diff ifile1.txt ifile2.txt > ofile.txt
but it is giving different format of output.
Since your files are sorted, you can use the comm command for this:
comm -1 -3 ifile1.txt ifile2.txt > ofile.txt
-1 means omit the lines unique to the first file, and -3 means omit the lines that are in both files, so this shows just the lines that are unique to the second file.
This will do your job:
diff file1 file2 |awk '{print $2}'
You could try:
diff file1 file2 | awk '{print $2}' | grep -v '^$' > output.file

Ignore first few lines and last few lines in a file Linux

I have a file like this and would like to print $0 except the first two and last three lines in linux. Tried awk command but no luck, is there any options I am using the following command - I suppose I am doing something wrong, but not able to figure out what it is with my minimal experience in computer science.
awk '{if(NR>2){c++}else if(FNR<=c-3){print $0}}' samplefile.out > sampleout.txt
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
Thanks
awk cannot look ahead so you'll have to save the lines.
awk 'NR>2{if(z!="")print z;z=y;y=x;x=$0}' file
Practically zero memory overhead
You can do it with a combination of head and tail:
tail -n +2 samplefile.out | head -n -3 > sampleout.txt
Try this:
awk 'NR>2{a[++j]=$0}END{for (i=1;i<=j-3;i++){print a[i]}}' samplefile.out
There's no way to calculate the lines of the file if you don't read or save previous line first.
If the archive is too big , head + tail mix could be better to avoid a memory overhead.
You may also try this, but it uses array
$ cat file
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
$ awk 'NR>first+last{print A[NR%last]}{A[NR%last]=$0}' first=2 last=3 file
entry1 41
entry2 78
entry3 89
entry4 68

(Unix) Changing A Row To A Column In A Text File

I currently have a text file that has the following data in row format:
TIME (HR) 0 6 12 18 24 36 48 60 72 84 96 108 120
I would like to "flip" this row into a column so that it reads:
TIME (HR)
0
6
12
18
24
etc...
Is there a way to do this with sed/awk?
grep could do:
grep -Po '.*\)|\d+' file
this line works too:
grep -Po '.*?(?= \d)|\d+' file
test:
kent$ cat f
TIME (HR) 0 6 12 18 24 36 48 60 72 84 96 108 120
kent$ grep -Po '.*\)|\d+' f
TIME (HR)
0
6
12
18
24
36
48
60
72
84
96
108
120
$ awk -v RS=' ' '{ORS=(NR<2?" ":"\n")}1' file
TIME (HR)
0
6
12
18
24
Through awk,
awk '{print $1,$2;for(i=3;i<=NF;i++) print $i}' file
Through perl,
perl -pe 's/(^\S+\s+\S+)(*SKIP)(*F)| /\n/g' file
Another perl one:
perl -pe 's/\s+(?=\d+)/\n/g'
Test:
$ echo 'TIME (HR) 0 6 12 18 24 36 48 60 72 84 96 108 120' | perl -pe 's/ (?=\d+)/\n/g'
TIME (HR)
0
6
12
18
24
36
48
60
72
84
96
108
120
Another GREAT solutions (from the comments from #AvinashRaj)
perl -pe 's/\s+(?!\()/\n/g'
perl -pe 's/ (?=\b)/\n/g'
sed 's/ \([0-9]\)/\
\1/g' YourFile
posix version (so --posix for GNU sed)
chanage any space followed by a digit by a return. Digit is keep in memory and set back bacause there is no back reference in sed regex

Cutting Element in Unix Based on Column Value

Without a shell script, in a single line. What command can help you cut from a row based on the column value
For example:
In
118 Balboni,Steve 23
11 Baker,Doug 0
120 Armas,Tony 13
133 Allanson,Andy 5
158 Baines,Harold 13
33 Bando,Chris 1
44 Adduci,James 1
50 Aguayo,Luis 3
5 Allen,Rod 0
94 Anderson,Brady 1
IF there 3rd row is not zero, how do I remove the row entirely in one statement? Is this possible in unix?
Assuming that the question is really asking about 'if the third column is non-zero, do not print it' or (equivalently) 'only print the row if the third column is 0':
Using awk:
awk '$3 == 0' data
(If the third column is zero, print the input; otherwise, ignore it. You could add { print } after the 0 to make the action explicit.)
Using perl:
perl -nae 'print if $F[2] == 0' data
Using sed:
sed -n '/ 0$/p' data
Using grep:
grep '[^0-9]0$' input
This does the inplace replacement.
perl -i -F -pane 'undef $_ if($F[2]!=0)' your_file
tested:
> cat temp
118 Balboni,Steve 23
11 Baker,Doug 0
120 Armas,Tony 13
133 Allanson,Andy 5
158 Baines,Harold 13
33 Bando,Chris 1
44 Adduci,James 1
50 Aguayo,Luis 3
5 Allen,Rod 0
94 Anderson,Brady 1
>
>
> perl -i -F -pane 'undef $_ if($F[2]!=0)' temp
> cat temp
11 Baker,Doug 0
5 Allen,Rod 0
>
If you wish to print lines that have no third column as well as those in which the 3rd column is explicitly 0 (ie, if you consider a blank field to be zero), try:
awk '!$3'
If you do not want to print lines with only 2 columns, try:
awk 'NF>2 && !$3'

Resources