in unix system, replace string in file, with variouse position - string

I work on unix server.
I have many csv files containing, among other info, date fields.
I have to replace some of this date fields with another value, for example 20110915 to 20110815. Their position is variable from a file to another.
The problem is that the substitution is position field specific. For example, if my file has row like this:
blablabla;12;0.2121;20110915;20110915;19951231;popopo;other text;321;20101010
I have to replace only first date fields and not other, transforming row in:
blablabla;12;0.2121;20110815;20110915;19951231;popopo;other text;321;20101010
Is there a way to restrict the replace in file, using some constraints?
Thanks

You can try awk:
awk 'BEGIN {FS=";";OFS=";"} {if($4=="20110915")$4="20110815"; print}' input.csv
How it works:
FS and OFS define the input and output field separators. It compares the fourth field ($4) against 20110915. If it matches, it is changed to 20110815. The line is then printed.

Here is an alternative using gsub in awk:
awk 'BEGIN {FS=";";OFS=";"} {gsub(/20110915/,20110815,$4); print}' input.csv
Here is a method, if you have to substitute in a range of fields/columns (e.g. 4-4):
awk 'BEGIN {FS=";";OFS=";"} {for(i=4;i<=4;i++){gsub(/20110915/,20110815,$i)}; print}' input.csv

Related

Replacing a string in the beginning of some rows in two columns with another string in linux

I have a tab separated text file. In column 1 and 2 there are family and individual ids that start with a character followed by number as follow:
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
NA1008 NA1008
NA1009 NA1009
I would like to replace NA with HG in both the columns. I am very new to linux and tried the following code and some others:
awk '{sub("NA","HG",$2)';print}' input file > output file
Any help is highly appreciated.
Converting my comment to answer now, use gsub in spite of sub here. Because it will globally substitute NA to HG here.
awk 'BEGIN{FS=OFS="\t"} {gsub("NA","HG");print}' inputfile > outputfile
OR use following in case you have several fields and you want to perform substitution only in 1st and 2nd fields.
awk 'BEGIN{FS=OFS="\t"} {sub("NA","HG",$1);sub("NA","HG",$2);print}' inputfile > outputfile
Change sub to gsub in 2nd code in case multiple occurrences of NA needs to be changed within field itself.
The $2 in your call to sub only replaces the first occurrence of NA in the second field.
Note that while sed is more typical for such scenarios:
sed 's/NA/HG/g' inputfile > outputfile
you can still use awk:
awk '{gsub("NA","HG")}1' inputfile > outputfile
See the online demo.
Since there is no input variable in gsub (that performs multiple search and replaces) the default $0 is used, i.e. the whole record, the current line, and the code above is equal to awk '{gsub("NA","HG",$0)}1' inputfile > outputfile.
The 1 at the end triggers printing the current record, it is a shorter variant of print.
Notice /^NA/ position at the beginning of field:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
HG1008 HG1008
HG1009 HG1009
and save it:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
If you have a tab as separator:
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile

Convert data to be single quoted

I have a task where I want to convert the below text to single quote text
The data in the file is:
(A,1)
(DC,2)
(EFG,3)
The output should be like:
('A'1)
('DC',2)
('EFG'3)
I used awk -F print '{$2}' > file.txt
Could you please try following.
awk 'BEGIN{s1="\047";FS=OFS=","} {sub(/^\(/,"&" s1 );$1=$1 s1} 1' Input_file
Why OP's attempt didn't work: Since OP has not defined F(field separator) and simply trying to print 2nd column which will be NULL.

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

Reformat data using awk

I have a dataset that contains rows of UUIDs followed by locations and transaction IDs. The UUIDs are separated by a semi-colon (';') and the transactions are separated by tabs, like the following:
01234;LOC_1=ABC LOC_1=BCD LOC_2=CDE
56789;LOC_2=DEF LOC_3=EFG
I know all of the location codes in advance. What I want to do is transform this data into a format I can load into SQL/Postgres for analysis, like this:
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
I'm pretty sure I can do this easily using awk (or similar) by looking up location IDs from a file (ex. LOC_1) and matching any instance of the location ID and printing that out next to the UUID. I haven't been able to get it right yet, and any help is much appreciated!
My locations file is named location and my dataset is data. Note that I can edit the original file or write the results to a new file, either is fine.
awk without using split: use semicolon or tab as the field separator
awk -F'[;\t]' -v OFS=';' '{for (i=2; i<=NF; i++) print $1,$i}' file
I don't think you need to match against a known list of locations; you should be able to just print each line as you go:
$ awk '{print $1; split($1,a,";"); for (i=2; i<=NF; ++i) print a[1] ";" $i}' file
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
You comment on knowing the locations and the mapping file makes me suspicious what your example seems to have done isn't exactly what is being asked - but it seems like you're wanting to reformat each set of tab delimited LOC= values into a row with their UUID in front.
If so, this will do the trick:
awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}'
Given:
$ cat -A data.txt
01234;LOC_1=ABC^ILOC_1=BCD^ILOC_2=CDE$
56789;LOC_2=DEF^ILOC_3=EFG$
Then:
$ awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}' data.txt
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
The BEGIN {OFS=FS=";"} block sets the input and output delimiter to ;.
For each row, we then split the second field into an array named locs, splitting on tab, via - split($2,locs,"\t")
And then loop through locs printing the UUID and each loc value - for (n in locs) { print $1,locs[n]}
How about without loop or without split one as follows.(considering that Input_file is same as shown samples only)
awk 'BEGIN{FS=OFS=";"}{gsub(/[[:space:]]+/,"\n"$1 OFS)} 1' Input_file
This might work for you (GNU sed):
sed -r 's/((.*;)\S+)\s+(\S+)/\1\n\2\3/;P;D' file
Repeatedly replace the white space between locations with a newline, followed by the UUID and a ;, printing/deleting each line as it appears.

Use sed to find and replace a number following by its successor in bash

I have a string that contains multiple occurrences of number ranges, which are separated by a comma, e.g.,
2-12,59-89,90-102,103-492,593-3990,3991-4930
Now I would like to remove all directly neighbouring ranges and remove them from the string, i.e., remove anything that is of the form -(x),(x+1), to get something like this:
2-12,59-492,593-4930
Can anyone think of a method to accomplish this? I can honestly not post anything that I have tried, because all my tries were highly unsuccessful. To me it seems like it is not possible to actually find anything of the form -(x),(x+1) using sed, since that would require doing operations or comparisons of a found number by another number that has to be part of the command that is currently searching for numbers.
If everybody agrees that sed is NOT the correct tool for doing this, I will do it another way, but I am still interested if it's possible.
with awk
awk -F, -v RS="-" -v ORS="-" '$2!=$1+1' file
with appropriate separator setting, print the record when second field is not +1.
RS is the record separator and ORS is the outpout record separator.
test:
> awk -F, -v RS="-" -v ORS="-"
'$2!=$1+1' <<< "2-12,59-89,90-102,103-492,593-3990,3991-4930"
2-12,59-492,593-4930
awk solution:
awk -F'-' '{ r=$1;
for (i=2; i<=NF; i++) {
split($i, a, ",");
r=sprintf("%s%s", r, a[2]-a[1]==1? "" : FS $i)
}
print r
}' file
-F'-' - treat -(hyphen) as field separator
r - resulting string
split($i, a, ",") - split adjacent range boundaries into array a by separator ,
a[2]-a[1]==1 - crucial condition, reflects (x),(x+1)
The output:
2-12,59-492,593-4930
This might work for you (GNU sed):
sed -r ' s/^/\n/;:a;ta;s/\n([^-]*-)([0-9]*)(.*,)/\1\n\2\n\2\n\3/;Td;:b;s/(\n.*\n.*)9(_*\n)/\1_\2/;tb;s/(\n.*\n)(_*\n)/\10\2/;s/$/\n0123456789/;s/(\n.*\n[0-9]*)([0-8])(_*\n.*)\n.*\2(.).*/\1\4\3/;:z;tz;s/(\n.*\n[^_]*)_([^\n]*\n)/\10\2/;tz;:c;tc;s/([0-9]*-)\n(.*)\n(.*)\n,(\3)-/\n\1/;ta;s/\n(.*)\n.*\n,/\1,\n/;ta;:d;s/\n//g' file
This proof-of-concept sed solution, iteratively increments and compares the end of one range with the start of another. If the comparison is true it removes both and repeats, otherwise it moves on to the next range and repeats until all ranges have been compared.

Resources