first column copy under empty line - linux

Salam
Following is the required output:
RXOTG-136 VENEN6 0
VENEN6 1
VENEN7 0
VENEN7 1
RXOTG-137 TIVIK6 0
TIVIK6 1
RXOTG-138 KESTA1 0
KESTA1 1
KESTA2 0
KESTA2 1
KESTA3 0
KESTA3 1
RXOTG-139 KESTA4 0
KESTA4 1
For which i used following command
awk 'NF==1{a=$1; next}{ print val}'
but the output I am getting is
RXOTG-136 VENEN6 0
RXOTG-136 VENEN6 1
RXOTG-136 VENEN7 0
RXOTG-136 VENEN7 1
RXOTG-137 TIVIK6 0
RXOTG-137 TIVIK6 1
RXOTG-138 KESTA1 0
RXOTG-138 KESTA1 1
RXOTG-138 KESTA2 0
RXOTG-138 KESTA2 1
RXOTG-138 KESTA3 0
RXOTG-138 KESTA3 1
RXOTG-139 KESTA4 0
RXOTG-139 KESTA4 1

awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file
you need to store the first field somewhere
1 is for printing every line
the format will change due to reassignment of $1~$3 thus you can use column -t to format it
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file | column -t

Following simple awk may help you on same.
awk '!/^ /{val=$1} /^ /{$1=val OFS $1} 1' Input_file | column -t

This might work for you (GNU sed):
sed -r '1h;1b;s/^/\n/;G;:a;/\n\s(.*\n)(.)(.*\S+\s+\S+$)/s//\2\n\1\3/;ta;s/\n//;s/\n.*//;h' file
Print the first line after making a copy in the hold space. For all subsequent lines, prepend a newline and append the previous line. Copy a character at a time from the previous line to the front of the current line until either there are no more spaces at the front of the current line or there are only two fields in the previous line. Remove the first introduced newline and remove the remains of the previous line. Copy the current line to the hold space, ready for the next time and print the current line.

Related

How to replace a number to another number in a specific column using awk

This is probably basic but I am completely new to command-line and using awk.
I have a file like this:
1 RQ22067-0 -9
2 RQ34365-4 1
3 RQ34616-4 1
4 RQ34720-1 0
5 RQ14799-8 0
6 RQ14754-1 0
7 RQ22101-7 0
8 RQ22073-1 0
9 RQ30201-1 0
I want the 0s to change to 1 in column3. And any occurence of 1 and 2 to change to 2 in column3. So essentially only changing numbers in column 3. But I am not changing the -9.
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
I have tried using (see below) but it has not worked
>> awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
>> awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
Thank you.
With this code in your question:
awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
you're running both commands on the same input file and writing their
output to the same output file so only the output of the 2nd script
will be present in the output, and
you're trying to change 0 to 1
first and THEN change 1 to 2 so the $3s that start out as 0 would
end up as 2, you need to change the order of the operations.
This is what you should be doing, using your existing code:
awk '{gsub("1","2",$3); gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
For example:
$ awk '{gsub("1","2",$3); gsub("0","1",$3)}1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
The gsub() should also just be sub()s as you only want to perform each substitution once, and you don't need to enclose the numbers in quotes so you could just do:
awk '{sub(1,2,$3); sub(0,1,$3)}1' file
You can check the value of column 3 and then update the field value.
Check for 1 as the first rule because if the first check is for 0, the value will be set to 1 and the next check will set the value to 2 resulting in all 2's.
awk '
{
if($3==1) $3 = 2
if($3==0) $3 = 1
}
1' file
Output
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
With your shown samples and ternary operators try following code. Simple explanation would be, checking condition if 3rd field is 1 then set it to 2 else check if its 0 then set it to 0 else keep it as it is, finally print the line.
awk '{$3=$3==1?2:($3==0?1:$3)} 1' Input_file
Generic solution: Adding a Generic solution here, where we can have 3 awk variables named: fieldNumber in which you could mention all field numbers which we want to check for. 2nd one is: existValue which we want to match(in condition) and 3rd one is: newValue new value which needs to be there after replacement.
awk -v fieldNumber="3" -v existValue="1,0" -v newValue="2,1" '
BEGIN{
num=split(fieldNumber,arr1,",")
num1=split(existValue,arr2,",")
num2=split(newValue,arr3,",")
for(i=1;i<=num1;i++){
value[arr2[i]]=arr3[i]
}
}
{
for(i=1;i<=num;i++){
if($arr1[i] in value){
$arr1[i]=value[$arr1[i]]
}
}
}
1
' Input_file
This might work for you (GNU sed):
sed -E 's/\S+/\n&\n/3;h;y/01/12/;G;s/.*\n(.*)\n.*\n(.*)\n.*\n.*/\2\1/' file
Surround 3rd column by newlines.
Make a copy.
Replace all 0's by 1's and all 1's by 2's.
Append the original.
Pattern match on newlines and replace the 3rd column in the original by the 3rd column in the amended line.
Also with awk:
awk 'NR > 1 {s=$3;sub(/1/,"2",s);sub(/0/,"1",s);$3=s} 1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
the substitutions are made with sub() on a copy of $3 and then the copy with the changes is assigned to $3.
When you don't like the simple
sed 's/1$/2/; s/0$/1/' file
you might want to play with
sed -E 's/(.*)([01])$/echo "\1$((\2+1))"/e' file

GNU Awk - don't modify whitespaces

I am using GNU Awk to replace a single character in a file. The file is a single line with varying whitespacing between "fields". After passing through gawk all the extra whitespacing is removed and I end up with single spaces. This is completely unintended and I need it to ignore these spaces and only change the one character I have targeted. I have tried several variations, but I cannot seem to get gawk to ignore these extra spaces.
Since I know this will come up, I read from the end of the line for replacement because the whitespacing is arbitrary/inconsistent in the source file.
Command:
gawk -i inplace -v new=3 'NF {$(NF-5) = new} 1' ~/scripts/tmp_beta_weather_file
Original file example:
2020-07-01 18:29:51.00 C M -11.4 28.9 29 9 23 5.5 000 0 0 00020 044013.77074 1 1 1 3 0 0
Result after command above:
2020-07-01 18:30:51.00 C M -11.8 28.8 29 5 23 5.5 000 0 0 00020 044013.77143 3 1 1 3 0 0
it might be easier with sed
sed -E 's/([^ ]+)(( [^ ]+){5})$/3\2/' file
test and add -i for in-place edit.

Replace two columns linux

I want to replace the second column of my first file
file 1:
2 rs58086319 0 983550 T C
2 rs56809628 0 983571 T C
2 rs7608441 0 983572 A G
2 rs114910509 0 983579 A G
2 var_chr2_983614 0 983614 T C
2 var_chr2_983624 0 983624 A G
2 rs115188027 0 983632 A C
2 var_chr2_983636 0 983636 T C
2 var_chr2_983650 0 983650 A G
2 var_chr2_983660 0 983660 T C
with the first column of my second file
file 2:
2_983550_T_C
2_983571_T_C
2_983572_A_G
2_983579_A_G
2_983614_T_C
2_983624_A_G
2_983632_A_C
2_983636_T_C
2_983650_A_G
2_983660_T_C
I've tried join and awk but somehow it doesn't seem to work. I suspect the fact that there's '_' on my second file.
Thank you
I'm a bit puzzled why you need a second file. All information of file2 seems to be encoded in file1. You could just do something like this :
awk '{$2=$1"_"$4"_"$5"_"$6}1' file1
Your file2 have only one column so with awk.
awk -v f='file2' '{getline $2 <f}1' file1
If the separator of file2 is "_"
awk -v f='file2' '{getline a <f;split(a,b,"_");$2=b[1]}1' file1
EDIT: In case you want to make _ as field separator in Input_file2 then following may help you.
awk 'FNR==NR{a[FNR]=$1;next} (FNR in a){$2=a[FNR]} 1' FS="_" file2 FS=" " file1 | column -t
Following awk may help you here.
awk 'FNR==NR{a[FNR]=$0;next} (FNR in a){$2=a[FNR]} 1' file2 file1 | column -t
I would go with paste and awk, e.g.:
paste file1 file2 | awk '{ $2 = $NF } NF--' OFS='\t'
Output:
2 2_983550_T_C 0 983550 T C
2 2_983571_T_C 0 983571 T C
2 2_983572_A_G 0 983572 A G
2 2_983579_A_G 0 983579 A G
2 2_983614_T_C 0 983614 T C
2 2_983624_A_G 0 983624 A G
2 2_983632_A_C 0 983632 A C
2 2_983636_T_C 0 983636 T C
2 2_983650_A_G 0 983650 A G
2 2_983660_T_C 0 983660 T C

How to split a column which has multiple dots using Linux command line

I have a file which looks like this:
chr10:100013403..100013414,- 0 0 0 0
chr10:100027943..100027958,- 0 0 0 0
chr10:100076685..100076699,+ 0 0 0 0
I want output to be like:
chr10 100013403 100013414 - 0 0 0 0
chr10 100027943 100027958 - 0 0 0 0
chr10 100076685 100076699 + 0 0 0 0
So, I want the first column to be tab separated at field delimiter = : , ..
I have used awk -F":|," '$1=$1' OFS="\t" file to separate first column. But, I am still struggling with .. characters.
I tried awk -F":|,|.." '$1=$1' OFS="\t" file but this doesn't work.
.. should be escaped.
awk -F':|,|\\.\\.' '$1=$1' OFS="\t" file
It is important to remember that when you assign a string constant as the value of FS, it undergoes normal awk string processing. For example, with Unix awk and gawk, the assignment FS = "\.." assigns the character string .. to FS (the backslash is stripped). This creates a regexp meaning “fields are separated by occurrences of any two characters.” If instead you want fields to be separated by a literal period followed by any single character, use FS = "\\..".
https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html
If your Input_file is same as shown sample then following may help you too in same.
awk '{gsub(/:|\.+|\,/,"\t");} 1' Input_file
Here I am using gsub keyword of awk to globally substitute (:) (.+ which will take all dots) (,) with TAB and then 1 will print the edited/non-edited line of Input_file. I hope this helps.

How to delete the first column ( which is in fact row names) from a data file in linux?

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?
idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.
#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done
You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.
As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt
You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

Resources