deleting part after a specific character and that character - linux

I'm working within my command line/bash on a large file with millions of rows. I'm analyzing the data with a software that requires the rsIDs to be less than 40 characters.
awk 'length($2)>40' 1000G_All_chr_merged.bim > IDtoolong.bim
head IDtoolong.bim
1 rs540674385;rs540674385;rs540674385;rs576523156 0 4439107 AAG AAGGAGG
1 rs561687032;rs546685337;rs528205989;rs370782231 0 4804685 GCACACA GCA
1 rs561021122;rs542858700;rs527502051;rs560257256;rs545143128 0 6210427 AGG GGAAT
1 rs529037702;rs561824298;rs539915961;rs528175459 0 12122415 CCCATCCAT AT
1 rs571308260;rs549871057;rs537509991;rs587738155 0 12611561 CAAA CAAAA
1 rs553093917;rs553093917;rs534535365;rs570185860 0 16657917 AAAT AAATAAT
How can I run through the second column and delete the first semicolon, ;, and anything after that?
I tried this:
awk '{sub(/;.*/,"", $2)}' 1000G_All_chr_merged.bim > adjusted_IDlength.bim
And I also using sed but found myself ruining the file at one point. Any help is appreciated!

I'm guessing that by "ruining the file" you mean changing the white space between fields. If that's the problem, the following won't do that:
$ sed 's/;[^[:space:]]*//' file
1 rs540674385 0 4439107 AAG AAGGAGG
1 rs561687032 0 4804685 GCACACA GCA
1 rs561021122 0 6210427 AGG GGAAT
1 rs529037702 0 12122415 CCCATCCAT AT
1 rs571308260 0 12611561 CAAA CAAAA
1 rs553093917 0 16657917 AAAT AAATAAT

Related

How to replace a number to another number in a specific column using awk

This is probably basic but I am completely new to command-line and using awk.
I have a file like this:
1 RQ22067-0 -9
2 RQ34365-4 1
3 RQ34616-4 1
4 RQ34720-1 0
5 RQ14799-8 0
6 RQ14754-1 0
7 RQ22101-7 0
8 RQ22073-1 0
9 RQ30201-1 0
I want the 0s to change to 1 in column3. And any occurence of 1 and 2 to change to 2 in column3. So essentially only changing numbers in column 3. But I am not changing the -9.
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
I have tried using (see below) but it has not worked
>> awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
>> awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
Thank you.
With this code in your question:
awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
you're running both commands on the same input file and writing their
output to the same output file so only the output of the 2nd script
will be present in the output, and
you're trying to change 0 to 1
first and THEN change 1 to 2 so the $3s that start out as 0 would
end up as 2, you need to change the order of the operations.
This is what you should be doing, using your existing code:
awk '{gsub("1","2",$3); gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
For example:
$ awk '{gsub("1","2",$3); gsub("0","1",$3)}1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
The gsub() should also just be sub()s as you only want to perform each substitution once, and you don't need to enclose the numbers in quotes so you could just do:
awk '{sub(1,2,$3); sub(0,1,$3)}1' file
You can check the value of column 3 and then update the field value.
Check for 1 as the first rule because if the first check is for 0, the value will be set to 1 and the next check will set the value to 2 resulting in all 2's.
awk '
{
if($3==1) $3 = 2
if($3==0) $3 = 1
}
1' file
Output
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
With your shown samples and ternary operators try following code. Simple explanation would be, checking condition if 3rd field is 1 then set it to 2 else check if its 0 then set it to 0 else keep it as it is, finally print the line.
awk '{$3=$3==1?2:($3==0?1:$3)} 1' Input_file
Generic solution: Adding a Generic solution here, where we can have 3 awk variables named: fieldNumber in which you could mention all field numbers which we want to check for. 2nd one is: existValue which we want to match(in condition) and 3rd one is: newValue new value which needs to be there after replacement.
awk -v fieldNumber="3" -v existValue="1,0" -v newValue="2,1" '
BEGIN{
num=split(fieldNumber,arr1,",")
num1=split(existValue,arr2,",")
num2=split(newValue,arr3,",")
for(i=1;i<=num1;i++){
value[arr2[i]]=arr3[i]
}
}
{
for(i=1;i<=num;i++){
if($arr1[i] in value){
$arr1[i]=value[$arr1[i]]
}
}
}
1
' Input_file
This might work for you (GNU sed):
sed -E 's/\S+/\n&\n/3;h;y/01/12/;G;s/.*\n(.*)\n.*\n(.*)\n.*\n.*/\2\1/' file
Surround 3rd column by newlines.
Make a copy.
Replace all 0's by 1's and all 1's by 2's.
Append the original.
Pattern match on newlines and replace the 3rd column in the original by the 3rd column in the amended line.
Also with awk:
awk 'NR > 1 {s=$3;sub(/1/,"2",s);sub(/0/,"1",s);$3=s} 1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
the substitutions are made with sub() on a copy of $3 and then the copy with the changes is assigned to $3.
When you don't like the simple
sed 's/1$/2/; s/0$/1/' file
you might want to play with
sed -E 's/(.*)([01])$/echo "\1$((\2+1))"/e' file

How can I replace a specific character in a file where it's position changes in bash command line or script?

I have the following file:
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
The character "3" that I need to change is bolded and italicized. The value of this character is dynamic, but always a single digit. I have tried a few things using sed but I can't come up with a way to account for the character changing position due to additional characters being added before that position.
This character is always at the same position from the END of the line, but not from the beginning. Meaning, the content to the left of this character may change and it may be longer, but this is always the 11th character and 6th digit from the end. It is easy to devise a way to cut it, or find it using tail, but I can't devise a way to replace it.
To be clear, the single digit character in question will always be replaced with another single digit character.
With GNU awk
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
$ gawk -i inplace -v new=9 'NF {$(NF-5) = new} 1' file
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 9 1 1 1 1 1
Where:
NF {$(NF-5) = new} means, when the line is not empty, replace the 6th-last field with the new value (9).
1 means print every record.
awk '{ $(NF-5) = ($(NF - 5) + 8) % 10; print }'
Given your input data, it produces;
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 1 1 1 1 1 1
The 3 has been mapped via 11 to 1 — pick your poison on how you assign the new value, but the magic is $(NF - 5) to pick up the fifth column before the last one (or sixth from end).
Would you try the following:
replace="x" # or whatever you want to replace
sed 's/\(.\)\(.\{10\}\)$/'"$replace"'\2/' file
The left portion of the sed command \(.\)\(.\{10\}\)$ matches a character, followed by ten characters, then anchored by the end of line.
Then the 1st character is replaced with the specified character and the following ten characters are reused.
I'm gonna assume that the number that you're looking for is the same distance from the end, regardless of what comes before it:
rev ~/test.txt | awk '$6=<value to replace>' | rev
Using the bash shell which should be the last option.
rep=10
read -ra var <<< '2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1'
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
If it is in a file.
rep=10
read -ra var < file.txt
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
Not the shortest and fastest way but it can be done...

How to reorder columns of hunderds of tab deliminated file in linux?

I have large scale tab-delimited files (a couple of hundreds), and but the order of the columns is different across the different files( the same columns, but different locations). Hence, I need to reorder all the columns in all the files and write it back in tab-deliminated format.
I would like to write a shell script that takes a specified order of columns and reorder all the columns in all the files and write it back. Can someone help me with it?
Here is how the header of my files looks like:
file1)
sLS72 chrX
A B E C F H
2 1 4 5 7 8
0 0 0 0 0 0
and the header of my second file:
S721 chrX
A E B F H C
12 11 2 3 4 1
0 0 0 0 0 0
here is the order of the columns that I want to achieve:
Order=[A ,B ,C ,E,F,H]
and here is the expected outputs for each file based on this ordering:
sLS72 chrX
A B C E F H
2 1 5 4 7 8
0 0 0 0 0 0
file 2:
S721 chrX
A B C E F H
12 2 1 11 3 4
0 0 0 0 0 0
I was trying to use awk:
awk -F'\t' '{s2=$A; $3=$B; $4=$C; $5=$E; $1=s}1' OFS='\t' in file
but the point is the, first, the order of columns are different in different files, and second, the names of the columns start from the second line of the file. In order words, first line is the header, I don't want to change it, but the second line is the colnames of the columns, so I want to order all files based on that. it's kind of tricky
$ awk -v order="A B C E F H" '
BEGIN {n=split(order,ho)}
FNR==1 {print; next}
FNR==2 {for(i=1;i<=NF;i++) hn[$i]=i}
{for(i=1;i<=n;i++) printf "%s",$hn[ho[i]] (i==n?ORS:OFS)}' file1 > tmp && mv tmp file1
sLS72 chrX
A B C E F H
0 0 0 0 0 0
0 0 0 0 0 0
if working on multiple files at the same time, change it to
$ awk -v ...
{... printf "%s",$hn[ho[i]] (i==n?ORS:OFS) > (FILENAME"_reordered") }' dir/files*
and do a mass rename afterwards. Alternative is run the original script if a loop for each file.

Counting number of rows depending on more than 1 column condition

I have a data file like this
H1 H2 H3 E1 E2 E3 C1 C2 C3
0 0 0 0 0 0 0 0 1
1 0 0 0 1 0 0 0 1
0 1 0 0 1 0 1 0 1
now i want to count the rows where H1,H2,H3 has the same pattern as E1,E2 and E3. for example, i want to count the number of time H1,H2,H3 and E1,E2,E3 both are 010 or 000.
I tried to use this code but it doesnt really work
awk -F "" '!($1==0 && $2==1 && $3==0 && $4==0 && $5==1 && $6==0)' file | wc -l
Something like
>>> awk '$1$2$3 == $4$5$6' input | wc -l
2
What it does?
$1$2$3 == $4$5$6 Checks if the string formed by columns 1 2 and 3 is equal to the columns formed by 4 5 and 6. When it is true, awk takes the default action of printing the entire line and the wc takes care of counting those lines.
Or, if you want complete awk solution, you can write
>>> awk '$1$2$3 == $4$5$6{count++} END{print count}' input
2

bash: grep exact matches based on the first column

I have a .txt file like below:
9342432_A1 9342432 1 0 0 0
4392483_A2 4392483 2 0 0 0
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
For example, I want to generate a subset with the IDs 4324321_A3 and 9342432 (based on the first column!).
I tried the following command to find the exact matches:
grep -E '4324321_A3|9342432'
But when I use this line, I end up with a dataset like this:
9342432_A1 9342432 1 0 0 0
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
The problem is that the line that matches a part of the ID (9342432_A1) shouldn't be there.
Can anyone help me with this?
I would like to end up with this:
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
It matches
9342432_A1 9342432 1 0 0 0
because it has 9342432 in the second column.
You need to update the command to make grep check lines starting with those words, that is, use ^word:
$ grep -E '^4324321_A3|^9342432' file
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
To make it more accurate, you can also use -w that matches the full word. This way grep -wE '^4324321_A3|^9342432' file would not match a line like
4324321_A3something 4324321 1 0 0 0
Your regex doesn't check if the ID is at the start of the line. Simply include a ^ at the beginning of your regular expression to tell it to match only ID's at the start of the line, and then group the alternatives using ():
grep -E '^(4324321_A3|9342432)\b' <file>
\b is a boundary character which forces it to only match whole words.
When you need matching on a specific field (or column) of your files, it could be better to use a tool like awk instead of grep. you can write something like this:
awk '$1 == "STRING_TO_MATCH"' txtfile.txt
and this could work also on a column different from the first (just use $2 for second column, $3 for the third, and so on).
awk accept regex as well as grep.
Regards.
Include in your grep the ^ at the beginning and after the pattern the space .
Add a start of line anchor at the beginning and a word boundary at the end of each pattern
grep -E '^4324321_A3\b|^9342432\b'

Resources