Can't get IFS to work when converting array to string - linux

Below is a bash shell script for taking in a csv file and spitting out rows formatted the way I want (Some more changes are there, but I only kept the array affecting ones below to show).
FILENAME=$1
cat $FILENAME | while read LINE
do
OIFS=$IFS;
IFS=","
columns=( $LINE )
date=${columns[4]//\"/}
columns[13]=${columns[13]//\"/}
columns[4]=$(date -d $date +%s)
newline=${columns[*]}
echo $newline
IFS=$OIFS;
done
I'm using GNU bash v 4.1.2(1)-release for CentOS 6.3. I've tried putting quotes like
newline="${columns[*]}"
Still no luck.
Following is sample data line
112110120001299169,112110119001295978,11,"121.119.163.146.1322221980963094","2012/11/01"
It seems like it should be outputting the array into a comma delimited string. Instead, the string is space delimited. Anyone know the reason why?
I suspect it has something to do with the fact that if I echo out $IFS in script it's an empty string, but when I echo out "${IFS}" it's then the comma I expect.
Edit: Solution
I found the solution. When echoing out $newline, I have to use quotes around it, i.e.
echo "$newline"
Otherwise, it uses the default blanks. I believe it has something to do with bash only subbing in for the IFS when you force it to with the quotes.

I'm not clear on why, but bash only seems to use the first character of IFS as a delimiter when expanding ${array[*]} when it's in double-quotes:
$ columns=(a b "c d e" f)
$ IFS=,
$ echo ${columns[*]}
a b c d e f
$ echo "${columns[*]}"
a,b,c d e,f
$ newline=${columns[*]}; echo "$newline"
a b c d e f
$ newline="${columns[*]}"; echo "$newline"
a,b,c d e,f
Fortunately, the solution is simple: use double-quotes (newline="${columns[*]}")
(BTW, my testing was all on bash v3 and v2, as I don't have v4 handy; so it might be different for you.) (UPDATE: tested on bash v4.2.10, same results.)

Edit Thanks to #GordonDavidson, Removed erroneous comments about how IFS works in bash.
awk has a very nice pair of vars, name FS=","; OFS="|" that do perform this transformation. You'll have to construct awk -F, '{"date -d "$date" +%s" | getline columns[4]}' or similar to call external programs and fill variables. Not quite as intuitive as the shell's c[4]=$(date ...), but awk is a very good tool to learn for data manipulations like you have outlined in your question.
Something like
#!/bin/awk -f
{
# columns=( $LINE )
split($0, columns)
# date=${columns[4]//\"/}
myDcolucolumns[4] ; gsub(/\"/, "", myDate)
# gcolumns[13]=${columns[13]//\"/}
gsub(/\"/,""columns[13]}
# columns[4]=$(date -d $date +%s)
"date -d '"$date"' +%s" | getline columns[4]
#Don_t_need_this newline=${columns[*]}
#echo $newline
} print $0
used like
cat myFile | myAwkScript
should achieve the same result.
Sorry but I don't have the time, OR the sample data to test this right now.
Feel free to reply with error messages that you get, and I'll see if I can help.
You might also consider updating your posting with 1 line of sample data, and a date value you want to process.
IHTH

Related

Extract all Strings between two characters to array using Bash

i searched ours but can't find a solution to extract all Strings between two characters to array using Bash.
I find
sed -n 's/.*\[\(.*\)\].*/\1/p'
But this only show me the last entry.
My String looks like:
var="[a1] [b1] [123] [Text text] [0x0]"
I want a Array like this:
arr[0]="a1"
arr[1]="b1"
arr[2]="123"
arr[3]="Text text"
arr[4]="0x0"
So i search for Stings between [ and ] and load it into an Array without [ and ].
Thank you for helping!
There's no simple way to do it. I would use a loop to extract them one at a time:
var="[a1] [b1] [123] [Text text] [0x0]"
regex='\[([^]]*)\](.*)'
while [[ $var =~ $regex ]]; do
arr+=("${BASH_REMATCH[1]}")
var=${BASH_REMATCH[2]}
done
In the regular expression, \[([^]]*)\] captures everything after the first [ up to (but not including) the next ]. (.*) captures everything after that for the next iteration.
You can use declare -n in bash 4.3 or later to make this look a little less intimidating.
declare -n m1=BASH_REMATCH[1] m2=BASH_REMATCH[2]
regex='\[([^]]*)\](.*)'
var="[a1] [b1] [123] [Text text] [0x0]"
while [[ $var =~ $regex ]]; do
arr+=("$m1")
var=$m2
done
$ IFS=, arr=($(sed 's/\] \[/","/g;s/\]/"/;s/\[/"/' <<< "$var")); echo "${arr[3]}"
"Text text"
There are a lot of suggestions that may work for you here already, but may not depending on your data. For example, substituting your current field separator of ] [ for a comma works unless you have commas embedded in your fields. Which your sample data does not have, but one never knows. :)
An ideal solution would be to use something as a field separator that is guaranteed never to be part of your field, like a null. But that's hard to do in a portable way (i.e. without knowing what tools are available). So a less extreme stance might be to use a newline as a separator:
var="[a1] [b1] [123] [Text text] [0x0]"
mapfile -t arr < <(sed $'s/^\[//;s/] \[/\\\n/g;s/]$//' <<<"$var")
declare -p arr
which would result in:
declare -a arr='([0]="a1" [1]="b1" [2]="123" [3]="Text text" [4]="0x0")'
This is functionally equivalent to the awk solution that Inian provided. Note that mapfile requires bash version 4 or above.
That said, you could also this exclusively within bash, without relying on any external tools like sed:
arr=( $var )
last=0
for i in "${!arr[#]}"; do
if [[ ${arr[$i]} != \[* ]]; then
arr[$last]="${arr[$last]} ${arr[$i]}"
unset arr[$i]
continue
fi
last=$i
done
for i in "${!arr[#]}"; do
arr[$i]="${arr[$i]:1:$((${#arr[$i]}-2))}"
done
At this point, declare -p arr results in:
declare -a arr='([0]="a1" [1]="b1" [2]="123" [3]="Text text" [5]="0x0")'
This sucks your $var into the array $arr[] with fields separated by whitespace, then it collapses the fields based on whether they begin with a square bracket. It then goes through the fields and replaces them with the substring that eliminates the first and last character. It may be a little less resilient and harder to read, but it's all within bash. :)
With GNU awk for multi-char RS and RT and newer versions of bash for mapfile:
$ mapfile -t arr < <(echo "$var" | awk -v RS='[^][]+' 'NR%2{print RT}')
$ declare -p arr
declare -a arr=([0]="a1" [1]="b1" [2]="123" [3]="Text text" [4]="0x0")

Linux Bash: Use awk(substr) to get parameters from file input

I have a .txt-file like this:
'SMb_TSS0303' '171765' '171864' '-' 'NC_003078' 'SMb20154'
'SMb_TSS0302' '171758' '171857' '-' 'NC_003078' 'SMb20154'
I want to extract the following as parameters:
-'SMb'
-'171765'
-'171864'
-'-' (minus)
-> need them without quotes
I am trying to do this in a shell script:
#!/bin/sh
file=$1
cat "$1"|while read line; do
echo "$line"
parent=$(awk {'print substr($line,$0,5)'})
echo "$parent"
done
echos 'SMb
As far as I understood awk substr, I though, it would work like this:
substr(s, a, b)=>returns b number of chars from string s, starting at position a
Firstly, I do not get, why I can extract 'Smb with 0-5, secondly, I can't extract any other parameter I need, because moving the start does not work.
E.g. $1,6 gives empty echo. I would expect Mb_TSS
Desired final output:
#!/bin/sh
file=$1
cat "$1"|while read line; do
parent=$(awk {'print substr($line,$0,5)'})
start=$(awk{'print subtrs($line,?,?')})
end=$(awk{'print subtrs($line,?,?')})
strand=$(awk{'print subtrs($line,?,?')})
done
echo "$parent" -> echos SMb
echo "$start" -> echos 171765
echo "$end" -> echos 171864
echo "$strand" -> echos -
I have an assumption, that the items in the lines are seen as single strings or something? Maybe I am also handling the file-parsing wrongly, but everything I tried does not work.
Really unclear exactly what you're trying to do. But I can at least help you with the awk syntax:
while read -r line
do
parent=$(echo $line | awk '{print substr($1,2,3)}')
start=$(echo $line | awk '{print substr($2,2,6)}')
echo $parent
echo $start
done < file
This outputs:
SMb
171765
SMb
171758
You should be able to figure out how to get the rest of the fields.
This is quite an inefficient way to do this but based on the information in the question I'm unable to provide a better answer at the moment.
the question was orignally tagged python, so let me propose a python solution:
with open("input.txt") as f:
for l in txt:
data = [x.strip("'").partition("_")[0] for x in l.split()[:4]]
print("\n".join(data))
It opens the file, splits the lines like awk would to, considers only the 4 first fields, strips off the quotes, to create the list. Then display it separated by newlines.
that prints:
SMb
171765
171864
-
SMb
171758
171857
-

How can I display unique words contained in a Bash string?

I have a string that has duplicate words. I would like to display only the unique words. The string is:
variable="alpha bravo charlie alpha delta echo charlie"
I know several tools that can do this together. This is what I figured out:
echo $variable | tr " " "\n" | sort -u | tr "\n" " "
What is a more effective way to do this?
Use a Bash Substitution Expansion
The following shell parameter expansion will substitute spaces with newlines, and then pass the results into the sort utility to return only the unique words.
$ echo -e "${variable// /\\n}" | sort -u
alpha
bravo
charlie
delta
echo
This has the side-effect of sorting your words, as the sort and uniq utilities both require input to be sorted in order to detect duplicates. If that's not what you want, I also posted a Ruby solution that preserves the original word order.
Rejoining Words
If, as one commenter pointed out, you're trying to reassemble your unique words back into a single line, you can use command substitution to do this. For example:
$ echo $(echo -e "${variable// /\\n}" | sort -u)
alpha bravo charlie delta echo
The lack of quotes around the command substitution are intentional. If you quote it, the newlines will be preserved because Bash won't do word-splitting. Unquoted, the shell will return the results as a single line, however unintuitive that may seem.
You may use xargs:
echo "$variable" | xargs -n 1 | sort -u | xargs
Note: This solution assumes that all unique words should be output in the order they're encountered in the input. By contrast, the OP's own solution attempt outputs a sorted list of unique words.
A simple Awk-only solution (POSIX-compliant) that is efficient by avoiding a pipeline (which invariably involves subshells).
awk -v RS=' ' '{ if (!seen[$1]++) { printf "%s%s",sep,$1; sep=" " } }' <<<"$variable"
# The above prints without a trailing \n, as in the OP's own solution.
# To add a trailing newline, append `END { print }` to the end
# of the Awk script.
Note how $variable is double-quoted to prevent it from accidental shell expansions, notably pathname expansion (globbing), and how it is provided to Awk via a here-string (<<<).
-v RS=' ' tells Awk to split the input into records by a single space.
Note that the last word will have the input line's trailing newline included, which is why we don't use $0 - the entire record - but $1, the record's first field, which has the newline stripped due to Awk's default field-splitting behavior.
seen[$1]++ is a common Awk idiom that either creates an entry for $1, the input word, in associative array seen, if it doesn't exist yet, or increments its occurrence count.
!seen[$0]++ therefore only returns true for the first occurrence of a given word (where seen[$0] is implicitly zero/the empty string; the ++ is a post-increment, and therefore doesn't take effect until after the condition is evaluated)
{printf "%s%s",sep,$1; sep=" "} prints the word at hand $1, preceded by separator sep, which is implicitly the empty string for the first word, but a single space for subsequent words, due to setting sep to " " immediately after.
Here's a more flexible variant that handles any run of whitespace between input words; it works with GNU Awk and Mawk[1]:
awk -v RS='[[:space:]]+' '{if (!seen[$0]++){printf "%s%s",sep,$0; sep=" "}}' <<<"$variable"
-v RS='[[:space:]]s+' tells Awk to split the input into records by any mix of spaces, tabs, and newlines.
[1] Unfortunately, BSD/OSX Awk (in strict compliance with the POSIX spec), doesn't support using regular expressions or even multi-character literals as RS, the input record separator.
Preserve Input Order with a Ruby One-Liner
I posted a Bash-specific answer already, but if you want to return only unique words while preserving the word order of the original string, then you can use the following Ruby one-liner:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq'
alpha
bravo
charlie
delta
echo
This will split the input string on whitespace, and then return unique elements from the resulting array.
Unlike the sort or uniq utilities, Ruby doesn't need the words to be sorted to detect duplicates. This may be a better solution if you don't want your results to be sorted, although given your input sample it makes no practical difference for the posted example.
Rejoining Words
If, as one commenter pointed out, you're then trying to reassemble the words back into a single line after deduplication, you can do that too. For that, we just append the Array#join method:
$ echo "$variable" | ruby -ne 'puts $_.split.uniq.join(" ")'
alpha bravo charlie delta echo
You can use awk:
$ echo "$variable" | awk '{for(i=1;i<=NF;i++){if (!seen[$i]++) printf $i" "}}'
alpha bravo charlie delta echo
If you do not want the trailing space and want a trailing CR, you can do:
$ echo "$variable" | awk 'BEGIN{j=""} {for(i=1;i<=NF;i++){if (!seen[$i]++)j=j==""?j=$i:j=j" "$i}} END{print j}'
alpha bravo charlie delta echo
Using associative arrays in BASH 4+ you can simplify this:
variable="alpha bravo charlie alpha delta echo charlie"
# declare an associative array
declare -A unq
# read sentence into an indexed array
read -ra arr <<< "$variable"
# iterate each word and populate associative array with word as key
for w in "${arr[#]}"; do
unq["$w"]=1
done
# print unique results
printf "%s\n" "${!unq[#]}"
delta
bravo
echo
alpha
charlie
## if you want results in same order as original string
for w in "${arr[#]}"; do
[[ ${unq["$w"]} ]] && echo "$w" && unset unq["$w"]
done
alpha
bravo
charlie
delta
echo
pure, ugly bash:
for x in $vaviable; do
if [ "$(eval echo $(echo \$un__$x))" = "" ]; then
echo -n $x
eval un__$x=1
__usv="$__usv un__$x"
fi
done
unset $__usv

Adding spaces after each character in a string

I have a string variable in my script, made up of the 9 permission characters from ls -l
eg:
rwxr-xr--
I want to manipulate it so that it displays like this:
r w x r - x r - -
IE every three characters is tab separated and all others are separated by a space. The closest I've come is using a printf
printf "%c %c %c\t%c %c %c\t%c %c %c\t/\n" "$output"{1..9}
This only prints the first character but formatted correctly
I'm sure there's a way to do it using "sed" that I can't think of
Any advice?
Using the Posix-specified utilities fold and paste, split the string into individual characters, and then interleave a series of delimiters:
fold -w1 <<<"$str" | paste -sd' \t'
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g' <<< "$output"
r w x r - x r - -
Sadly, this leaves a trailing tab in the output. If you don't want that, use:
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g; s/\t$//' <<< "$str"
r w x r - x r - -
Why do u need to parse them? U can access to every element of string by copy needed element. It's a very easy and without any utility, for example:
DATA="rwxr-xr--"
while [ $i -lt ${#DATA} ]; do
echo ${DATA:$i:1}
i=$(( i+1 ))
done
With awk:
$ echo "rwxr-xr--" | awk '{gsub(/./,"& ");gsub(/. . . /,"&\t")}1'
r w x r - x r - -
> echo "rwxr-xr--" | sed 's/\(.\{3,3\}\)/\1\t/g;s/\([^\t]\)/\1 /g;s/\s*$//g'
r w x r - x r - -
( Evidently I didn't put much thought into my sed command. John Kugelman's version is obviously much clearer and more concise. )
Edit: I wholeheartedly agree with triplee's comment though. Don't waste your time trying to parse ls output. I did that for a long time before I figured out you can get exactly what you want (and only what you want) much easier by using stat. For example:
> stat -c %a foo.bar # Equivalent to stat --format %a
0754
The -c %a tells stat to output the access rights of the specified file, in octal. And that's all it prints out, thus eliminating the need to do wacky stuff like ls foo.bar | awk '{print $1}', etc.
So for instance you could do stuff like:
GROUP_READ_PERMS=040
perms=$(stat -c %a foo.bar)
if (( (perms & GROUP_READ_PERMS) != 0 )); then
... # Do some stuff
fi
Sure as heck beats parsing strings like "rwxr-xr--"
sed 's/.../& /2g;s/./& /g' YourFile
in 2 simple step
A version which includes a pure bash version for short strings, and sed for longer strings, and preserves newlines (adding a space after them too)
if [ "${OS-}" = "Windows_NT" ]; then
threshold=1000
else
threshold=100
fi
function escape()
{
local out=''
local -i i=0
local str="${1}"
if [ "${#str}" -gt "${threshold}" ]; then
# Faster after sed is started
sed '# Read all lines into one buffer
:combine
$bdone
N
bcombine
:done
s/./& /g' <<< "${str}"
else
# Slower, but no process to load, so faster for short strings. On windows
# this can be a big deal
while (( i < ${#str} )); do
out+="${str:$i:1} "
i+=1
done
echo "$out"
fi
}
Explanation of sed. "If this is the last line, jump to :done, else append Next into buffer and jump to :combine. After :done is a simple sed replacement expression. The entire string (newlines and all) are in one buffer so that the replace works on newlines too (which are lost in some of the awk -F examples)
Plus this is Linux, Mac, and Git for Windows compatible.
Setting awk -F '' allows each character to be bounded, then you'll want to loop through and print each field.
Example:
ls -l | sed -n 2p | awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'; echo ""
The part seems like the answer to your question:
awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'
I realize, this doesn't provide the trinary grouping you wanted though. hmmm...

Get the index of a specific string from a dynamically generated output

I tried a lot of things, but now I am at my wit's end.
My problem is I need the index of a specific string from my dynamically generated output.
In example, I want the index from the string 'cookie' in this output:
1337 cat dog table cookie 42
So in this example I would need this result:
5
One problem is that I need that number for a later executed awk command. Another problem is that this generated output has a flexible length and you are not able to 'sed' something with . - or something else. There is no such pattern like this.
Cheers
Just create an array mapping the string value to it's index and then print the entry:
$ cat file
1337 cat dog table cookie 42
$ awk -v v="cookie" '{v2i[v]=0; for (i=1;i<=NF;i++) v2i[$i]=i; print v2i[v]}' file
5
The above will print 0 if the string doesn't exist as a field on the given line.
By the way, you say you need that number output from the above "for a later executed awk command". Wild idea - why not do both steps in one awk command?
Ugly, but possible:
echo '1337 cat dog table cookie 42' \
| tr ' ' '\n' \
| grep -Fn cookie \
| cut -f1 -d:
Here is a way to find position of word in a string using gnu awk (due to RS), and store it to a variable.
pat="cookie"
pos=$(echo "1337 cat dog table cookie 42" | awk '{print NF+1;exit}' RS="$pat")
echo "$pos"
5
If you do not have gnu awk
pat="cookie"
pos=$(echo "1337 cat dog table cookie 42" | awk '{for (i=1;i<=NF;i++) if ($i~p) print i}' p="$pat")
echo "$pos"
5
Here is pure bash way of doing it with arrays, no sed or awk or GNUs required ;-)
# Load up array, you would use your own command in place of echo
array=($(echo 1337 cat dog table cookie 42))
# Show what we have
echo ${array[*]}
1337 cat dog table cookie 42
# Find which element contains our pattern
for ((i=0;i<${#array[#]};i++)); do [ ${array[$i]} == "cookie" ] && echo $(($i+1)); done
5
Of course, you could set a variable to use later instead of echoing $i+1. You may also want some error checking in case pattern isn't found, but you get the idea!
Here is another answer, not using arrays, or "sed" or "awk" or "tr", just based on the bash IFS separating the values for you:
#!/bin/bash
output="cat dog mouse cookie 42" # Or output=$(yourProgram)
f=0 # f will be your answer
i=0 # i counts the fields
for x in $output; do \
((i++)); [[ "$x" = "cookie" ]] && f=$i; \
done
echo $f
Result:
4
Or you can put it all on one line, if you remove the backslashes, like this:
#!/bin/bash
output="cat dog mouse cookie 42" # Or output=$(yourProgram)
f=0;i=0;for x in $output; do ((i++)); [[ "$x" = "cookie" ]] && f=$i; done
echo $f
Explanation:
The "[[a=b]] && c" part is just shorthand for
if [a=b]; then
c
fi
It relies on shortcut evaluation of logicals. Basically, we are asking shell to determine if the two statements "a equals b" AND the statement "c" are both true. If a is not equal to b, it already knows it doesn't need to evaluate c because they already can't both be true - so f doesn't get the value of i. If, on the other hand, a is equal to b, the shell must still evaluate statement "c" to see if it is also true - and when it does so, f will get the value of i.
Pat="cookie"
YourInput | sed -n "/${Pat}/ {s/.*/ & /;s/ ${Pat} .*/I/;s/[[:blank:]\{1,\}[^[:blank:]\{1,\}/I/g
s/I\{9\}/9/;s/I\{8\}/8/;s/I\{7\}/7/;s/IIIIII/6/;s/IIIII/5/;s/IIII/4/;s/III/3/;s/II/2/;s/I/1/
p;q;}
$ s/.*/0/p"
if there is more than 9 cols, a more complex sed could be made or pass through a wc -c instead

Resources