Change case of first word of each line

Change case of first word of each line - string

From command line, how to change to uppercase each first word of a line in a text file?
Example input:
hello world
tell me who you are!
Example output:
HELLO world
TELL me who you are!
There are no empty lines, it's ASCII, and each line starts with an alphabetic word followed by a tab.
Tools to use: anything that works on command line on macOS (bash 3.2, BSD sed, awk, tr, perl 5, python 2.7, swift 4, etc.).

You can always just use bash case conversion and a while loop to accomplish what you intend, e.g.
$ while read -r a b; do echo "${a^^} $b"; done < file
HELLO world
HOW are you?
The parameter expansion ${var^^} converts all chars in var to uppercase, ${var^} converts the first letter.
Bash 3.2 - 'tr'
For earlier bash, you can use the same setup with tr with a herestring to handle the case conversion:
$ while read -r a b; do echo "$(tr [a-z] [A-Z] <<<"$a") $b"; done file
HELLO world
HOW are you?
Preserving \t Characters
To preserve the tab separated words, you have to prevent word-splitting during the read. Unfortunately, the -d option to read doesn't allow termination on a set of characters. A way around checking for both spaces or tab delimited words is the read the entire line disabling word-splitting with IFS= and then scanning forward through the line until the first literal $' ' or $'\t' is found. (the literals are bash-only, not POSIX shell) A simple implementation would be:
while IFS= read -r line; do
word=
ct=0
for ((i = 0; i < ${#line}; i++)); do
ct=$i
## check against literal 'space' or 'tab'
[ "${line:$i:1}" = $' ' -o "${line:$i:1}" = $'\t' ] && break
word="${word}${line:$i:1}"
done
word="$(tr [a-z] [A-Z] <<<"$word")"
echo "${word}${line:$((ct))}"
done <file
Output of tab Separated Words
HELLO world
HOW are you?

Use awk one-liner:
awk -F$'\t' -v OFS=$'\t' '{ $1 = toupper($1) }1' file

Using GNU sed:
sed 's/^\S*/\U&/g' file
where \S matches a non-whitespace character and \U& uppercases the matched pattern
UPDATE: in case of BSD sed since it does not support most of those special characters it is still doable but requires a much longer expression
sed -f script file
where the script contains
{
h
s/ .*//
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.*\)\n[^ ]* \(.*\)/\1 \2/
}

Related

Bash regexp to find part of string

I have a string like setSuperValue('sdfsdfd') and I need to get the 'sdfsdfd' value from this line. What is way to do this?
First I find line by setSuperValue and then get only string with my target content - setSuperValue('sdfsdfd'). How do I build a regexp to get sdfsdfd from this line?

This should help you
grep setSuperValue myfile.txt | grep -o "'. *'" | tr -d "'"
The grep -o will return all text that start with a single ' and ends with another ', including both quotes. Then use tr to get rid of the quotes.
You could also use cut:
grep setSuperValue myfile.txt | cut -d"'" -f2
Or awk:
grep setSuperValue myfile.txt | awk -F "'" '{print $2}'
This will split the line where the single quotes are and return the second value, that is what you are looking for.

Generally, to locate a string in multiple lines of data, external utilities will be much faster than looping over lines in Bash.
In your specific case, a single sed command will do what you want:
sed -n -r "s/^.*setSuperValue\('([^']+)'\).*$/\1/p" file
Extended (-r) regular expression ^.*setSuperValue\('([^']+)'\).*$ matches any line containing setSuperValue('...') as a whole, captures whatever ... is in capture group \1, replaces the input line with that, and prints p the result.
Due to option -n, nothing else is printed.
Move the opening and closing ' inside (...) to include them in the captured value.
Note: If the input file contains multiple setSuperValue('...') lines, the command will print every match; either way, the command will process all lines.
To only print the 1st match and stop processing immediately after, modify the command as follows:
sed -n -r "/^.*setSuperValue\('([^']+)'\).*$/ {s//\1/;p;q}" file
/.../ only matches lines containing setSuperValue('...'), causing the following {...} to be executed only for matching lines.
s// - i.e., not specifying a regex - implicitly performs substitution based on the same regex that matched the line at hand; p prints the result, and q quits processing altogether, meaning that processing stops once the fist match was found.
If you have already located a line of interest through other methods and are looking for a pure Bash method of extracting a substring based on a regex, use =~, Bash's regex-matching operator, which supports extended regular expressions and capture groups through the special ${BASH_REMATCH[#]} array variable:
$ sampleLine="... setSuperValue('sdfsdfd') ..."
$ [[ $sampleLine =~ "setSuperValue('"([^\']+)"')" ]] && echo "${BASH_REMATCH[1]}"
sdfsdfd
Note the careful quoting of the parts of the regex that should be taken literally, and how ${BASH_REMATCH[1]} refers to the first (and only) captured group.

You can parse the value from the line, using parameter expansion/substring removal without relying on any external tools:
#!/bin/bash
while read -r line; do
value=$(expr "$line" : ".*setSuperValue('\(.*\)')")
if [ "x$value" != "x" ]; then
printf "value : %s\n" "$value"
fi
done <"$1"
Test Input
$ cat dat/supervalue.txt
setSuperValue('sdfsdfd')
something else
setSuperValue('sdfsdfd')
something else
setSuperValue('sdfsdfd')
something else
Example Output
$ bash parsevalue.sh dat/supervalue.txt
value : sdfsdfd
value : sdfsdfd
value : sdfsdfd

Shell : choosing string between two strings using sed

I have a log file in format like this :
pseudo=thierry33 pseudoConcat=thierry33
pseudo=i love you pseudoConcat=i love you
I want to return all the strings which are between pseudo and pseudoConcat, my desired output is :
thierry33
i love you
How can I do this using sed or awk? I'm trying for a few days in vain.
Thanks.

With sed:
sed -r 's/pseudo=(.*[^ ]) +pseudoConcat.*/\1/'
Explanation:
use GNU option -r to allow +, () without backslashes
capture string after pseudo= with ()
string should end with a non-space [^ ]
before spaces and pseudoConcat +pseudoConcat
use 1st captured group \1 as a replacement

With GNU grep:
grep -oP '(?<=pseudo=).*?(?= *pseudoConcat)' file
Output without trailing spaces:
thierry33
i love you
With bash:
while read -r line; do [[ $line =~ pseudo=(.*?[^\ ])\ *pseudoConcat ]] && echo "${BASH_REMATCH[1]}"; done < file

adding double quotes, commas and removing newlines

I have a file that have a list of integers:
12542
58696
78845
87855
...
I want to change them into:
"12542", "58696", "78845", "87855", "..."
(no comma at the end)
I believe I need to use sed but couldnt figure it out how. Appreciate your help.

You could do a sed multiline trick, but the easy way is to take advantage of shell expansion:
echo $(sed '$ ! s/.*/"&",/; $ s/.*/"&"/' foo.txt)
Run echo $(cat file) to see why this works. The trick, in a nutshell, is that the result of cat is parsed into tokens and interpreted as individual arguments to echo, which prints them separated by spaces.
The sed expression reads
$ ! s/.*/"&",/
$ s/.*/"&"/
...which means: For all but the last line ($ !) replace the line with "line",, and for the last line, with "line".
EDIT: In the event that the file contains not just a line of integers like in OP's case (when the file can contain characters the shell expands), the following works:
EDIT2: Nicer code for the general case.
sed -n 's/.*/"&"/; $! s/$/,/; 1 h; 1 ! H; $ { x; s/\n/ /g; p; }' foo.txt
Explanation: Written in a more readable fashion, the sed script is
s/.*/"&"/
$! s/$/,/
1 h
1! H
$ {
x
s/\n/ /g
p
}
What this means is:
s/.*/"&"/
Wrap every line in double quotes.
$! s/$/,/
If it isn't the last line, append a comma
1 h
1! H
If it is the first line, overwrite the hold buffer with the result of the previous transformation(s), otherwise append it to the hold buffer.
$ {
x
s/\n/ /g
p
}
If it is the last line -- at this point the hold buffer contains the whole line wrapped in double quotes with commas where appropriate -- swap the hold buffer with the pattern space, replace newlines with spaces, and print the result.

Here is the solution,
sed 's/.*/ "&"/' input-file|tr '\n' ','|rev | cut -c 2- | rev|sed 's/^.//'
First change your input text line in quotes
sed 's/.*/ "&"/' input-file
Then, this will convert your new line to commas
tr '\n' ',' <your-inputfile>
The last commands including rev, cut and sed are used for formatting the output according to requirement.
Where,
rev is reversing string.
cut is removing trailing comma from output.
sed is removing the first character in the string to formatting it accordingly.
Output:

With perl without any pipes/forks :
perl -0ne 'print join(", ", map { "\042$_\042" } split), "\n"' file
OUTPUT:
"12542", "58696", "78845", "87855"

Here's a pure Bash (Bash≥4) possibility that reads the whole file in memory, so it won't be good for huge files:
mapfile -t ary < file
((${#ary[#]})) && printf '"%s"' "${ary[0]}"
((${#ary[#]}>1)) && printf ', "%s"' "${ary[#]:1}"
printf '\n'
For huge files, this awk seems ok (and will be rather fast):
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file

One way, using sed:
sed ':a; N; $!ba; s/\n/", "/g; s/.*/"&"/' file
Results:
"12542", "58696", "78845", "87855", "..."

You can write the column oriented values in a row with no comma following the last as follows:
cnt=0
while read -r line || test -n "$line" ; do
[ "$cnt" = "0" ] && printf "\"%s\"" "$line"
printf ", \"%s\"" "$line"
cnt=$((cnt + 1))
done
printf "\n"
output:
$ bash col2row.sh dat/ncol.txt
"12542", "12542", "58696", "78845", "87855"

A simplified awk solution:
awk '{ printf sep "\"%s\"", $0; sep=", " }' file
Takes advantage of uninitialized variables defaulting to an empty string in a string context (sep).
sep "\"%s\"" synthesizes the format string to use with printf by concatenating sep with \"%s\". The resulting format string is applied to $0, each input line.
Since sep is only initialized after the first input record, , is effectively only inserted between output elements.

bash script to strip out some characters

Bash scripting. How can i get a simple while loop to go through a file with below content and strip out all character from T (including T) using sed
"2012-05-04T10:16:04Z"
"2012-04-05T15:27:40Z"
"2012-03-05T14:58:27Z"
"2011-11-29T15:04:09Z"
"2011-11-16T12:12:00Z"
Thanks

A simple awk command to do this:
awk -F '["T]' '{print $2}' file
2012-05-04
2012-04-05
2012-03-05
2011-11-29
2011-11-16

Through sed,
sed 's/"\|T.*//g' file
"matches double quotes \| or T.* starts from the first T match all the characters upto the last. Replacing the matched characters with an empty string will give you the desired output.
Example:
$ echo '"2012-05-04T10:16:04Z"' | sed 's/"\|T.*//g'
2012-05-04

With bash builtins:
while IFS='T"' read -r a a b; do echo "$a"; done < filename
Output:
2012-05-04
2012-04-05
2012-03-05
2011-11-29
2011-11-16

sed: remove whole words containg a character class

I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'

The following sed command does the job:
sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).
Explanation
We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.
Test
$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok

Using awk:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
{printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok
This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.
Using grep + tr together:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok
First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.

If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:
perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } #F), "\n"
the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array #F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.

This might work for you (GNU sed):
sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file
This uses a back reference within alternation to save the required string.

st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
for word in $st;
do
if [[ $word =~ ^[a-zA-Z]+$ ]];
then
echo $word;
fi;
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Change case of first word of each line - string

Use awk one-liner: awk -F$'\t' -v OFS=$'\t' '{ $1 = toupper($1) }1' file

Related

Bash regexp to find part of string

Shell : choosing string between two strings using sed

adding double quotes, commas and removing newlines

bash script to strip out some characters

sed: remove whole words containg a character class

Categories

Resources