Convert specific field to upper case by other field using sed - linux

Using sed I need to covert to upper case the second field when the field city=miami and city=chicago
My code now looks like this, it convert all the name to upper without filtering by city.
id,name,country,sex,year,price,team,city,x,y,z
266,Aaron Russell,USA,m,1989,50,12,miami,0,0,1
179872,Abbos Rakhmonov,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,Amar Music,CRO,m,1991,110,24,miami,0,0,0
3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0
sed 's/^\(\([^,]*,\)\{1\}\)\([^,]*\)/\1\U\3/'
My output:
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0
3662,ABBY ERCEG,NZL,m,1977,67,20,toronto,0,0,0,
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0,
393115111,ANIRBAN LAHIRI,IND,m,1987,105,27,boston,0,0,0
998460252,ANISSA KHELFAOUI,ALG,f,1967,45,2,toronto,0,0,0
Expected output. Only using sed.
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONV,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0
393115111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
998460252,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0

Easier IMHO with awk:
awk 'BEGIN{city=8; FS=OFS=","}
$city=="miami" || $city=="chicago" {$2=toupper($2)} 1' file
Prints:
id,name,country,sex,year,price,team,city,x,y,z
266,AARON RUSSELL,USA,m,1989,50,12,miami,0,0,1
179872,ABBOS RAKHMONOV,UZB,m,1979,0,25,chicago,0,0,0
3662,Abby Erceg,NZL,m,1977,67,20,toronto,0,0,0
5554573,AMAR MUSIC,CRO,m,1991,110,24,miami,0,0,0
3931111,Anirban Lahiri,IND,m,1987,105,27,boston,0,0,0
98402,Anissa Khelfaoui,ALG,f,1967,45,2,toronto,0,0,0

sed -E '/^([^,]*,){7}(miami|chicago),/{s/(^[^,]*,)([^,]+)/\1\U\2/}'
This relies on matching comma number 1 and number 7 in each line (ie. to match field 2 and field 8). If a single field in the CSV contained (extra) quoted or escaped commas, this would break.
Note that \U syntax is specific to GNU sed, and not portable.
This task would probably be more clearly expressed in awk, and gawk can also handle quoted commas, using FPAT.

Related

awk command to split filename based on substring

I have a directory in that file names are like
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
I like to divide into 4 variables as below
v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.
I would do it using GNU AWK following way, let file.txt content be
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
then
awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt
output
Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character.
(tested in GNU Awk 5.0.1)

replace sub-string with last special character, being (3rd part) of comma separated string

I have a string with comma separated values, like:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
As you can see, the 3rd comma separated value has sometimes special character, like the dash (-), in the end. I want to used sed, or preferably perl command to replace this string (with the -i option, so as to replace at existing file), with same string at the same place (i.e. 3rd comma separated value) but without the special character (like the dash (-)) at the end of the string. So, result at above example string should be:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
Since such multiple lines like the above are inside a file, I am using while loop at shell/bash script to loop and manipulate all lines of the file. And I have assigned the above string values to variables, so as to replace them using perl. So, my while loop is:
while read mystr
do
myNEWstr=$(echo $mystr | sed s/[_.-]$// | sed s/[__]$// | sed s/[_.-]$//)
perl -pi -e "s/\b$mystr\b/$myNEWstr/g" myFinalFile.txt
done < myInputFile.txt
where:
$mystr is the "SOME-STRING_A_-BLAHBLAH_1-4MP0-"
$myNEWstr result is the "SOME-STRING_A_-BLAHBLAH_1-4MP0"
Note that the myInputFile.txt is a file that contains the 3rd comma separated values of the myFinalFile.txt, so that those EXACT string values ($mystr) will be checked for special characters in the end, like underscore, dash, dot, double-underscore, and if they exist to be removed and form the new string ($myNEWstr), then finally that new string ($myNEWstr) to be replaced at the myFinalFile.txt, so as to have the resulting strings like the example final string shown above, i.e. with the 3rd comma separated sub-string value WITHOUT the special character in the end (which is dash (-) at above example).
Thank you.
You could use the following regex:
s/^([^,]*,[^,]*,[^,]*)-,/$1,/
This defined csv fields as series of characters other than a comma (empty fields are allowed). We are looking for a dash at the very end of the third csv field. The regex captures everything until there, and then replaces it while omitting the dash.
$ cat t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
]$ perl -p -e 's/^([^,]*,[^,]*,[^,]*)-,/$1,/' t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
]$

Show rows of a file which have a regular expression more than 'n' number of times

I have file- abc.txt, in below format-
a:,b:,c:,d:,e:,f:,g:
a:0;b:,c:3,d:,e:,f:,g:1
a:9,b:8,c:6,d:5,e:2,f:,g:
a:0;b:,c:2,d:1,e:,f:,g:
Now in unix, I want to get only those rows where this regular expression :[0-9] (colon followed by any number) exists more than 2 times.
Or in other words show rows where at least 3 attributes have numerical values present.
Output should be only 2nd and 3rd row
a:0;b:,c:3,d:,e:,f:,g:1
a:9,b:8,c:6,d:5,e:2,f:,g:
With basic grep:
grep '\(:[[:digit:]].*\)\{3,\}' file
:[[:digit:]].* matches a colon followed by a digit and zero or more arbitrary characters. This expressions is put into a sub pattern: \(...\). The expression \{3,\} means that the previous expression has to occur 3 or more times.
With extended posix regular expressions this can be written a little simpler, without the need to escape ( and {:
grep -E '(:[[:digit:]].*){3,}' file
$ awk -F':[0-9]' 'NF>3' file
a:0;b:,c:3,d:,e:,f:,g:1
a:9,b:8,c:6,d:5,e:2,f:,g:
a:0;b:,c:2,d:1,e:,f:,g:
perl -nE '/:[0-9](?{$count++})(?!)/; print if $count > 2; $count=0' input
perl -ne 'print if /(.*?\:\d.*?){2,}/' yourfile
This matches rows having character:number twice or more times.
https://regex101.com/r/tRWtbY/1

How to use awk command to remove word "a" not character 'a' in a text file?

I tried to use awk '{$0 = tolower($0);gsub(/a|an|is|the/, "", $0);}' words.txt
but it also replaced a in words like day.I only want to delete word a.
for example:
input: The day is sunny the the the Sunny is is
expected output:day sunny
Using GNU awk and built-in variable RT:
$ echo this is a test and nothing more |
awk '
BEGIN {
RS="[ \n]+"
a["a"]
a["an"]
a["is"]
a["the"]
}
(tolower($0) in a==0) {
printf "%s%s",$0, RT
}'
this test and nothing more
However, post some sample data with expected output for more specific answers.
you need to define word boundary to eliminate partial matches
$ echo "This is a sunny day, that is it." |
awk '{$0=tolower($0); gsub(/\y(is|it|a|this)\y/,"")}1'
will print
sunny day, that .
you can eliminate punctuation signs as well by either adding them to field delimiters or to the gsub words.
Following awk may help you in same.
Condition 1st: Considering you want to only remove words like a, the and is here, you could edit my code and add more words too as per your need.
awk '{
for(i=1;i<=NF;i++){
if(tolower($i)=="a" || tolower($i)=="the" || tolower($i)=="is"){
$i=""
}
};
}
1' Input_file
Condition 2nd: In case you want to remove words like a, the and is and you want to remove duplicate fields too from lines then following may help you(this has come by seeing your example output shown in comments above):
awk '{
for(i=1;i<=NF;i++){
if(tolower($i)=="a" || tolower($i)=="the" || tolower($i)=="is" || ++a[tolower($i)]>1){
$i=""
}
};
}
1' Input_file
NOTE: Since I am nullifying the fields so I am considering that you are fine with little improper space in between the line.
You need to an expression where the word is delimited by something (you need to decide what delimits your words. For example, do numbers delimit the word or are a part of the word, for example, a4?) So the expression could be, for example, /[^:alphanum:](a|an|is|the)[^:alphanum:]/.
Note however that these expressions will match the word AND the delimiters. Use capture feature to deal with this problem.
It looks like your "words.txt" containts just one word per line, so the expression should be delimited by beginning and end of line, like /^a$/

bash Changing every other comma to point

I am working with set of data which is written in Swedish format. comma is used instead of point for decimal numbers in Sweden.
My data set is like this:
1,188,1,250,0,757,0,946,8,960
1,257,1,300,0,802,1,002,9,485
1,328,1,350,0,846,1,058,10,021
1,381,1,400,0,880,1,100,10,418
Which I want to change every other comma to point and have output like this:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
Any idea of how to do that with simple shell scripting. It is fine If I do it in multiple steps. I mean if I change first the first instance of comma and then the third instance and ...
Thank you very much for your help.
Using sed
sed 's/,\([^,]*\(,\|$\)\)/.\1/g' file
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
For reference, here is a possible way to achieve the conversion using awk:
awk -F, '{for(i=1;i<=NF;i=i+2) {printf $i "." $(i+1); if(i<NF-2) printf FS }; printf "\n" }' file
The for loop iterates every 2 fields separated by a comma (set by the option -F,) and prints the current element and the next one separated by a dot.
The comma separator represented by FS is printed except at the end of line.
As a Perl one-liner, using split and array manipulation:
perl -F, -e '#a = #b = (); while (#b = splice #F, 0, 2) {
push #a, join ".", #b} print join ",", #a' file
Output:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
Many sed dialects allow you to specify which instance of a pattern to replace by specifying a numeric option to s///.
sed -e 's/,/./9' -e 's/,/./7' -e 's/,/./5' -e 's/,/./3' -e 's/,/./'
ISTR some sed dialects would allow you to simplify this to
sed 's/,/./1,2'
but this is not supported on my Debian.
Demo: http://ideone.com/6s2lAl

Resources