Split string at special character in bash - string

I'm reading filenames from a textfile line by line in a bash script. However the the lines look like this:
/path/to/myfile1.txt 1
/path/to/myfile2.txt 2
/path/to/myfile3.txt 3
...
/path/to/myfile20.txt 20
So there is a second column containing an integer number speparated by space. I only need the part of the string before the space.
I found only solutions using a "for-loop". But I need a function that explicitly looks for the " "-character (space) in my string and splits it at that point.
In principle I need the equivalent to Matlabs "strsplit(str,delimiter)"

If you are already reading the file with something like
while read -r line; do
(and you should be), then pass two arguments to read instead:
while read -r filename somenumber; do
read will split the line on whitespace and assign the first field to filename and any remaining field(s) to somenumber.

Three (of many) solutions:
# Using awk
echo "$string" | awk '{ print $1 }'
# Using cut
echo "$string" | cut -d' ' -f1
# Using sed
echo "$string" | sed 's/\s.*$//g'

If you need to iterate trough each line of the file anyways, you can cut off everything behind the space with bash:
while read -r line ; do
# bash string manipulation removes the space at the end
# and everything which follows it
echo ${line// *}
done < file

This should work too:
line="${line% *}"
This cuts the string at it's last occurrence (from left) of a space. So it will work even if the path contains spaces (as long as it follows by a space at end).

while read -r line
do
{ rev | cut -d' ' -f2- | rev >> result.txt; } <<< $line
done < input.txt
This solution will work even if you have spaces in your filenames.

Related

String split and extract the last field in bash

I have a text file FILENAME. I want to split the string at - of the first column field and extract the last element from each line. Here "$(echo $line | cut -d, -f1 | cut -d- -f4)"; alone is not giving me the right result.
FILENAME:
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
code I tried:
while read line; do \
DNA="$(echo $line | cut -d, -f1 | cut -d- -f4)";
echo $DNA
done < ${FILENAME}
Result I want
1195060301
1195060302
1195060311
Would you please try the following:
while IFS=, read -r f1 _; do # set field separator to ",", assigns f1 to the 1st field and _ to the rest
dna=${f1##*-} # removes everything before the rightmost "-" from "$f1"
echo "$dna"
done < "$FILENAME"
Well, I had to do with the two lines of codes. May be someone has a better approach.
while read line; do \
DNA="$(echo $line| cut -d, -f1| rev)"
DNA="$(echo $DNA| cut -d- -f1 | rev)"
echo $DNA
done < ${FILENAME}
I do not know the constraints on your input file, but if what you are looking for is a 10-digit number, and there is only ever one 10-digit number per line... This should do niceley
grep -Eo '[0-9]{10,}' input.txt
1195060301
1195060302
1195060311
This essentially says: Show me all 10 digit numbers in this file
input.txt
TWEH-201902_Pau_EX_21-1195060301,15cef8a046fe449081d6fa061b5b45cb.final.cram
TWEH-201902_Pau_EX_22-1195060302,25037f17ba7143c78e4c5a475ee98e25.final.cram
TWEH-201902_Pau_T-1383-1195060311,267364a6767240afab2b646deec17a34.final.cram
A sed approach:
sed -nE 's/.*-([[:digit:]]+)\,.*/\1/p' input_file
sed options:
-n: Do not print the whole file back, but only explicit /p.
-E: Use Extend Regex without need to escape its grammar.
sed Extended REgex:
's/.*-([[:digit:]]+)\,.*/\1/p': Search, capture one or more digit in group 1, preceded by anything and a dash, followed by a comma and anything, and print only the captured group.
Using awk:
awk -F[,] '{ split($1,arr,"-");print arr[length(arr)] }' FILENAME
Using , as a separator, take the first delimited "piece" of data and further split it into an arr using - as the delimiter and awk's split function. We then print the last index of arr.

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

What is the best way to do this string transformation in shell?

I am trying to write a shell script that will need to transform input of the following form:
foo/bar/baz/qux.txt
bar/baz/quz.txt
baz/quz/foo.txt
Into:
baz-qux
quz
foo
I.e. split on '/', drop the first 2 segments, drop the '.txt' and substitute remaining slashes for hyphens.
The substitution seems straightforward enough using tr:
paths=$(cat <<- EOF
foo/bar/baz/qux.txt
bar/baz/quz.txt
baz/quz/foo.txt
EOF
)
echo $paths | tr '/' '-' | tr '.txt' ' '
I've tried various forms of
cut -d '/' -f x
To get the necessary segments but am coming up short.
I'm a ruby guy so tempted to reach for my hammer and just use ruby:
lines.each { |s| s.split('/')[2..-1].join('-').split('.')[0] }
But deploying ruby for this one operation seems like it might be overkill. And I would like to improve my shell skills anyway so was wondering if there is a more elegant way anyone would recommend to do this in shell?
Thanks for any help
It can be done using bash parameter expansions:
for name in foo/bar/baz/qux.txt bar/baz/quz.txt baz/quz/foo.txt; do
new=${name#*/} # drop the shortest prefix match for */, thus everything up to first /
new=${new#*/} # repeat, dropping the second segment
new=${new%.txt} # drop shortest suffix match for .txt
new=${new//\//-} # convert any remaining slashes
echo "$new"
done
Gives:
baz-qux
quz
foo
These are all bash shell built-in constructs, so no external processes like cut, sed or tr required.
You can do everything in one sed command
sed -E 's|([^/]*/){,2}||; s|/|-|g; s|\.txt$||' file
Replace \.txt$ with \.[^.]*$ to remove all extensions instead of only .txt.
You can try something like this: cut -d/ -f3- | cut -d. -f1 | tr / -
Explanation:
cut -d/ -f3- - split on '/', and keep the third field and everything after it (baz/qux.txt)
cut -d. -f1 - split on '.', keep the first value (drops the file extension) (baz/qux)
tr / - - Transform any remaining '/' into '-'.
(baz-qux)
Try Perl
$ cat mark_smith.txt
foo/bar/baz/qux.txt
bar/baz/quz.txt
baz/quz/foo.txt
$ perl -F"/" -lane ' #a=#F[2..$#F]; #b=map{s/.txt//g;$_} #a; print join("-",#b) ' mark_smith.txt
baz-qux
quz
foo
$
assuming . is only in the filenames
$ awk -F[/.] '{n=NF; p=$(n-1)} n>4{p=$(n-2)"-"p} {print p}' file
baz-qux
quz
foo
awk '{gsub(/^.{8}|.txt$/,"")sub(/\//,"-")}1' file
baz-qux
quz
foo

concatenate the result of echo and a command output

I have the following code:
names=$(ls *$1*.txt)
head -q -n 1 $names | cut -d "_" -f 2
where the first line finds and stores all names matching the command line input into a variable called names, and the second grabs the first line in each file (element of the variable names) and outputs the second part of the line based on the "_" delim.
This is all good, however I would like to prepend the filename (stored as lines in the variable names) to the output of cut. I have tried:
names=$(ls *$1*.txt)
head -q -n 1 $names | echo -n "$names" cut -d "_" -f 2
however this only prints out the filenames
I have tried
names=$(ls *$1*.txt
head -q -n 1 $names | echo -n "$names"; cut -d "_" -f 2
and again I only print out the filenames.
The desired output is:
$
filename1.txt <second character>
where there is a single whitespace between the filename and the result of cut.
Thank you.
Best approach, using awk
You can do this all in one invocation of awk:
awk -F_ 'NR==1{print FILENAME, $2; exit}' *"$1"*.txt
On the first line of the first file, this prints the filename and the value of the second column, then exits.
Pure bash solution
I would always recommend against parsing ls - instead I would use a loop:
You can avoid the use of awk to read the first line of the file by using bash built-in functionality:
for i in *"$1"*.txt; do
IFS=_ read -ra arr <"$i"
echo "$i ${arr[1]}"
break
done
Here we read the first line of the file into an array, splitting it into pieces on the _.
Maybe something like that will satisfy your need BUT THIS IS BAD CODING (see comments):
#!/bin/bash
names=$(ls *$1*.txt)
for f in $names
do
pattern=`head -q -n 1 $f | cut -d "_" -f 2`
echo "$f $pattern"
done
If I didn't misunderstand your goal, this also works.
I've always done it this way, I just found out that this is a deprecated way to do it.
#!/bin/bash
names=$(ls *"$1"*.txt)
for e in $names;
do echo $e `echo "$e" | cut -c2-2`;
done

Extract string between two characters in bash

I have a string formatted as below
Walk Off the Earth - Somebody That I Used to Know
[playing] #36/37 1:04/4:05 (26%)
volume: n/a repeat: off random: on single: off consume: off
Now, from the above string I need to extract 36 from #36/37.
First thing I did was to extract #36/37 from second line using
echo "above mentioned string" | awk 'NR==2 {print $2}'
Now, I want to extract 36 from the above extracted part for that I did
echo `#36/37` | sed -e 's/\//#/g' | awk -F "#" '{print $2}'
which gave me 36 as my outptut.
But, I feel that using both sed and awk just to extract text from #36/37 is but of a overkill. So, is there any better or shorter way to achieve this.
Split the field on the pound and slash characters into an array and retrieve the required element.
awk 'NR==2 {split($2, arr, "[#/]"); print arr[2]}'
This answer takes advantage of bash's built-in extended regular-expression syntax using the =~ test operator. (I say test, but don't expect it to work with the test command. It only works with the [[ keyword.)
mini:~ michael$ cat foo
Walk Off the Earth - Somebody That I Used to Know
[playing] #36/37 1:04/4:05 (26%)
volume: n/a repeat: off random: on single: off consume: off
mini:~ michael$ [[ $(<foo) =~ \#[[:digit:]]{2} ]] && echo "${BASH_REMATCH[0]#\#}"
36
When you boil it down, this is simply a regular expression that matches the two digits after a pound sign, and saves them in the zeroth element of the BASH_REMATCH array.
One way using sed assuming infile has the content of the question. In second line match any characters until #, then save any numbers in group 1, and substitute the complete line with this group \1. The -n switch avoids print anything unless indicated with a p instruction in the code.
sed -ne '2 { s/^[^#]*#\([0-9]*\).*$/\1/; p; q }' infile
Output:
36
This might work for you:
sed 's/.*#\([0-9]*\)\/[0-9]*.*/\1/p;d' file
36
sed -n '2s/.*\#\([0-9]*\)\/.*/\1/p'
This suppresses everything but the second line, then echos the digits between # and /
input | while read playing numbers rest
do
if [[ $playing = "[playing]" ]]; then
t="${numbers:1}"
echo "${t%/*}"
fi
done
Bash default split is by whitespace, so what you get in the second field (numbers) is just that numbers. The rest is the use of bash parameter expansion operators to get at the portion of interest: remove the first character and remove the suffix starting with "/"
This would solve your problem.
awk -F'[#/]' 'NR==2{print $2}'
I've written a script which output the string between the first and last character. To solve you're problem, you can use the following commands combined with this script.
echo '[playing] #36/37 1:044:05 (26%)' | cut -d' ' -f2 | ./cut_between.sh -f '#' -l '/'
You can download this script on GitHub.
You can do it without any external program with BASH-internal string operations like this:
string="[playing] #36/37 1:04/4:05 (26%)"
part=${string##*#};number=${part%%/*}
echo "$number"

Resources