How to convert to title case a specific column - linux

I have come up with this code:
cut -d';' -f4 columns.csv | sed 's/.*/\L&/; s/[a-z]*/\u&/g'
which actually does the job for the fourth column, but in the way I have lost the other columns..
I have unsuccessfully tried:
cut -d';' -f4 columns.csv | sed -i 's/.*/\L&/; s/[a-z]*/\u&/g'
So, how could I apply the change to that specific column in the file and keep other columns as they are?
Let say that columns.csv content is:
TEXT;more text;SoMe MoRe TeXt;THE FOURTH COLUMN;something else
Then, expected output should be:
TEXT;more text;SoMe MoRe TeXt;The Fourth Column;something else

GNU sed:
sed -ri 's/;/&\r/3;:1;s/\r([^; ]+\s*)/\L\u\1\r/;t1;s/\r//' columns.csv
update:
sed -i 's/; */&\n/3;:1;s/\n\([^; ]\+ *\)/\L\u\1\n/;t1;s/\n//' columns.csv
Place anchor \r (\n) at the beginning of field 4. We edit the whole word and move the anchor to the beginning of the next one. Jump by label t1 :1 is carried out as long as there are matches for the pattern in the substitution command. Removing the anchor.

Not a short simple awk, but should work:
awk -F";" '{t=split($4,a," ");$4="";for(i=1;i<=t;i++) {a[i]=substr(a[i],1,1) tolower(substr(a[i],2));$4=$4 sprintf("%s ",a[i])}$4=substr($4,1,length($4)-1)}1' OFS=";" file
TEXT;more text;SoMe MoRe TeXt;The Fourth Column;something else
Some shorter version
awk -F";" '{t=split($4,a," ");$4="";for(i=1;i<=t;i++) {a[i]=substr(a[i],1,1) tolower(substr(a[i],2));$4=$4 a[i](t==i?"":" ")}}1' OFS=";" file

With perl:
$ perl -F';' -lane '$F[3] =~ s/[a-z]+/\L\u$&/gi; print join ";", #F' columns.csv
TEXT;more text;SoMe MoRe TeXt;The Fourth Column;something else
-F';' use ; to split the input line
$F[3] =~ s/[a-z]+/\L\u$&/gi change case only for the 4th column
print join ";", #F print the modified fields
Unicode version:
perl -Mopen=locale -Mutf8 -F';' -lane '$F[3]=~s/\p{L}+/\L\u$&/gi;
print join ";", #F'

Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS=";" }
{
title = ""
numWords = split($4,words,/ /)
for (wordNr=1; wordNr<=numWords; wordNr++) {
word = words[wordNr]
word = toupper(substr(word,1,1)) tolower(substr(word,2))
title = (wordNr>1 ? title " " : "") word
}
$4 = title
print
}
$ awk -f tst.awk file
TEXT;more text;SoMe MoRe TeXt;The Fourth Column;something else
True capitalization in a title is much more complicated than that though.

This might work for you (GNU sed):
sed -E 's/[^;]*/\n&\n/4;h;s/\S*/\L\u&/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Delimit the fourth field by newlines and make a copy.
Uppercase the first character of each word.
Append the amended line to the original.
Using pattern matching, replace the original fourth field by the amended one.

Related

Select subdomains using print command

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file
Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)
For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

grep with two or more words, one line by file with many files

everyone. I have
file 1.log:
text1 value11 text
text text
text2 value12 text
file 2.log:
text1 value21 text
text text
text2 value22 text
I want:
value11;value12
value21;value22
For now I grep values in separated files and paste later in another file, but I think this is not a very elegant solution because I need to read all files more than one time, so I try to use grep for extract all data in a single cat | grep line, but is not the result I expected.
I use:
cat *.log | grep -oP "(?<=text1 ).*?(?= )|(?<=text2 ).*?(?= )" | tr '\n' '; '
or
cat *.log | grep -oP "(?<=text1 ).*?(?= )|(?<=text2 ).*?(?= )" | xargs
but I get in each case:
value11;value12;value21;value22
value11 value12 value21 value22
Thank you so much.
Try:
$ awk -v RS='[[:space:]]+' '$0=="text1" || $0=="text2"{getline; printf "%s%s",sep,$0; sep=";"} ENDFILE{if(sep)print""; sep=""}' *.log
value11;value12
value21;value22
For those who prefer their commands spread over multiple lines:
awk -v RS='[[:space:]]+' '
$0=="text1" || $0=="text2" {
getline
printf "%s%s",sep,$0
sep=";"
}
ENDFILE {
if(sep)print""
sep=""
}' *.log
How it works
-v RS='[[:space:]]+'
This tells awk to treat any sequence of whitespace (newlines, blanks, tabs, etc) as a record separator.
$0=="text1" || $0=="text2"{getline; printf "%s%s",sep,$0; sep=";"}
This tells awk to look file records that matches either text1 ortext2`. For those records and those records only the commands in curly braces are executed. Those commands are:
getline tells awk to read in the next record.
printf "%s%s",sep,$0 tells awk to print the variable sep followed by the word in the record.
After we print the first match, the command sep=";" is executed which tells awk to set the value of sep to a semicolon.
As we start each file, sep is empty. This means that the first match from any file is printed with no separator preceding it. All subsequent matches from the same file will have a ; to separate them.
ENDFILE{if(sep)print""; sep=""}
After the end of each file is reached, we print a newline if sep is not empty and then we set sep back to an empty string.
Alternative: Printing the second word if the first word ends with a number
In an alternative interpretation of the question (hat tip: David C. Rankin), we want to print the second word on any line for which the first word ends with a number. In that case, try:
$ awk '$1~/[0-9]$/{printf "%s%s",sep,$2; sep=";"} ENDFILE{if(sep)print""; sep=""}' *.log
value11;value12
value21;value22
In the above, $1~/[0-9]$/ selects the lines for which the first word ends with a number and printf "%s%s",sep,$2 prints the second field on that line.
Discussion
The original command was:
$ cat *.log | grep -oP "(?<=text1 ).*?(?= )|(?<=text2 ).*?(?= )" | tr '\n' '; '
value11;value12;value21;value22;
Note that, when using most unix commands, cat is rarely ever needed. In this case, for example, grep accepts a list of files. So, we could easily do without the extra cat process and get the same output:
$ grep -hoP "(?<=text1 ).*?(?= )|(?<=text2 ).*?(?= )" *.log | tr '\n' '; '
value11;value12;value21;value22;
I agree with #John1024 and how you approach this problem will really depend on what the actual text is you are looking for. If for instance your lines of concern start with text{1,2,...} and then what you want in the second field can be anything, then his approach is optimal. However, if the values in the first field and vary and what you are really interested in is records where you have valueXX in the second field, then an approach keying off the second field may be what you are looking for.
Taking for example your second field, if the text you are interested in is in the form valueXX (where XX are two or more digits at the end of the field), you can process only those records where your second field matches and then use a simple conditional testing whether FNR == 1 to control the ';' delimiter output and ENDFILE to control the new line similar to:
awk '$2 ~ /^value[0-9][0-9][0-9]*$/ {
printf "%s%s", (FNR == 1) ? "" : ";", $2
}
ENDFILE {
print ""
}' file1.log file2.log
Example Use/Output
$ awk '$2 ~ /^value[0-9][0-9][0-9]*$/ {
printf "%s%s", (FNR == 1) ? "" : ";", $2
}
ENDFILE {
print ""
}' file1.log file2.log
value11;value12
value21;value22
Look things over and consider your actual input files and then either one of these two approaches should get you there.
If I understood you correctly, you want the values but search for the text[12] ie. to get the word after matching search word, not the matching search word:
$ awk -v s="^text[12]$" ' # set the search regex *
FNR==1 { # in the beginning of each file
b=b (b==""?"":"\n") # terminate current buffer with a newline
}
{
for(i=1;i<NF;i++) # iterate all but last word
if($i~s) # if current word matches search pattern
b=b (b~/^$|\n$/?"":";") $(i+1) # add following word to buffer
}
END { # after searching all files
print b # output buffer
}' *.log
Output:
value11;value12
value21;value22
* regex could be for example ^(text1|text2)$, too.

Split or join lines in Linux using sed

I have file that contains below information
$ cat test.txt
Studentename:Ram
rollno:12
subjects:6
Highest:95
Lowest:65
Studentename:Krish
rollno:13
subjects:6
Highest:90
Lowest:45
Studentename:Sam
rollno:14
subjects:6
Highest:75
Lowest:65
I am trying place info of single student in single.
i.e My output should be
Studentename:Ram rollno:12 subjects:6 Highest:95 Lowest:65
Studentename:Krish rollno:13 subjects:6 Highest:90 Lowest:45
Studentename:Sam rollno:14 subjects:6 Highest:75 Lowest:65.
Below is the command I wrote
cat test.txt | tr "\n" " " | sed 's/Lowest:[0-9]\+/Lowest:[0:9]\n/g'
Above command is breaking line at regex Lowest:[0-9] but it doesn't print the pattern. Instead it is printing Lowest:[0-9].
Please help
Try:
$ sed '/^Studente/{:a; N; /Lowest/!ba; s/\n/ /g}' test.txt
Studentename:Ram rollno:12 subjects:6 Highest:95 Lowest:65
Studentename:Krish rollno:13 subjects:6 Highest:90 Lowest:45
Studentename:Sam rollno:14 subjects:6 Highest:75 Lowest:65
How it works
/^Studente/{...} tells sed to perform the commands inside the curly braces only on lines that start with Studente. Those commands are:
:a
This defines a label a.
N
This reads in the next line and appends it to the pattern space.
/Lowest/!ba
If the current pattern space does not contain Lowest, this tells sed to branch back to label a.
In more detail, /Lowest/ is true if the line contains Lowest. In sed, ! is negation so /Lowest/! is true if the line does not containLowest. Inba, thebstands for the branch command anda` is the label to branch to.
s/\n/ /g
This tells sed to replace all newlines with spaces.
Try this using awk :
awk '{if ($1 !~ /^Lowest/) {printf "%s ", $0} else {print}}' file.txt
Or shorter but more obfuscated :
awk '$1!~/^Lowest/{printf"%s ",$0;next}1' file.txt
Or correcting your command :
tr "\n" " " < file.txt | sed 's/Lowest:[0-9]\+/&\n/g'
Explanation: & is whats matched in the left part of substitution
Another possible GNU sed that doesn't assume Lowest is the last item:
sed ':a; N; /\nStudent/{P; D}; s/\n/ /; ba' test.txt
This might work for you (GNU sed):
sed '/^Studentename:/{:a;x;s/\n/ /gp;d};H;$ba;d' file
Use the hold space to gather up the fields and then remove the newlines to produce a record.

replace text between two tabs - sed

I have the following input files:
text1 text2 text3 text4
abc1 abc2 abc3 abc4
and I am trying to find the second string between the two tabs (e.g. text2, abc2) and replace it with another word.
I have tried with
sed s'/\t*\t/sample/1'
but it only deletes the tab and does not replace the word.
I appreciate any help!
I would suggest using awk here:
awk 'BEGIN { FS = OFS = "\t" } { $2 = "sample" } 1' file
Set the input and output field separators to a tab and change the second field. The 1 at the end is always true, so awk does the default action, { print }.
Use this sed:
sed 's/\t[^\t]*\t/\tsample\t/'
An alternative in gawk, since you tagged awk ---
gawk -- 'BEGIN {FS="\t"; OFS="\t"} {$2="sample"; print}'
For example,
echo -e 'a\tb\tc\td' | gawk -- 'BEGIN {FS="\t"; OFS="\t"} {$2="sample"; print}'
prints
a sample c d
The FS breaks input at tabs, OFS separates output fields using tabs, and $2="sample" changes only the second field, leaving the rest of the fields unchanged.
Try this
sed -e 's/\([a-zA-Z0-9]*\) \([a-zA-Z0-9]*\) \([a-zA-Z0-9]*\) \([a-zA-Z0-9]*\)/\1 sample \2 \3 \4/'
In GNU sed v4.2.2 I had to use -r:
sed -r 's/^([^\t]*\t)[^\t]*/\1sample/'
The ^([^\t]*\t) is the first field and the first tab, and the [^\t]* is the text of the second field. The \1 restores the first field and the sample is whatever you want :) .
For example,
echo -e 'a\tb\tc\td' | sed -r 's/^([^\t]*\t)[^\t]*/\1sample/'
prints
a sample c d
This also works for other than four columnns. For example
$ echo -e 'a\tb\tc' | sed -r 's/^([^\t]*\t)[^\t]*/\1sample/'
a sample c
$ echo -e 'a\tb\tc\td\te' | sed -r 's/^([^\t]*\t)[^\t]*/\1sample/'
a sample c d e

Replacing newline character [duplicate]

This question already has answers here:
How can I replace each newline (\n) with a space using sed?
(43 answers)
Closed 8 years ago.
I have an XML file which has occasional lines that are split into 2: the first line ending with 
. I want to concatenate any such lines and remove the 
, perhaps replacing it with a space.
e.g.
<message>hi I am
here </message>
needs to become
<message>hi I am here </message>
I've tried:
sed -i 's/
\/n/ /g' filename
with no luck.
Any help is much appreciated!
Here is a GNU sed version:
sed ':a;$bc;N;ba;:c;s/
\n/ /g' file
Explanation:
sed '
:a # Create a label a
$bc # If end of file then branch to label c
N # Append the next line to pattern space
ba # branch back to label a to repeat until end of file
:c # Another label c
s/
\n/ /g # When end of file is reached perform this substitution
' file
give this gawk one-liner a try:
awk -v RS="" 'gsub(/
\n/," ")+7' file
tested here with your example:
kent$ echo "<message>hi I am
here </message>"|awk -v RS="" 'gsub(/
\n/," ")+7'
<message>hi I am here </message>
You can use this awk:
awk -F"
" '/
$/ {a=$1; next} a{print a, $0; a=""; next} 1' file
Explanation
-F"
" set 
 as delimiter, so that the first field will be always the desired part of the string.
/
$/ {a=$1; next} if the line ends with 
, store it in a and jump to the next line.
a{print a, $0; a=""; next} if a is set, print it together with current line. Then unset a for future loops. Finally jump to next line.
1 as true, prints current line.
Sample
$ cat a
yeah
<message>hi I am
here </message>
hello
bye
$ awk -F"
" '/
$/ {a=$1; next} a{print a, $0; a=""; next} 1' a
yeah
<message>hi I am here </message>
hello
bye
This will work for you:
sed -i '{:q;N;s/&.*\n/ /g;t q}' <filename>
However replacing newline with sed is always a bash(read bad) idea. Chances of making an error are high.
So another but simpler solution:
tr -s '\&\#13\;\n' ' ' < <filename>
tr is replacing all chracter in match with space, so without -s it would have printed
<message>hi I am here </message>
-s from man page:
-s, --squeeze-repeats
replace each input sequence of a repeated character that is listed in SET1 with a single occurrence of that character.

Resources