Linux Bash: Use awk(substr) to get parameters from file input - linux

I have a .txt-file like this:
'SMb_TSS0303' '171765' '171864' '-' 'NC_003078' 'SMb20154'
'SMb_TSS0302' '171758' '171857' '-' 'NC_003078' 'SMb20154'
I want to extract the following as parameters:
-'SMb'
-'171765'
-'171864'
-'-' (minus)
-> need them without quotes
I am trying to do this in a shell script:
#!/bin/sh
file=$1
cat "$1"|while read line; do
echo "$line"
parent=$(awk {'print substr($line,$0,5)'})
echo "$parent"
done
echos 'SMb
As far as I understood awk substr, I though, it would work like this:
substr(s, a, b)=>returns b number of chars from string s, starting at position a
Firstly, I do not get, why I can extract 'Smb with 0-5, secondly, I can't extract any other parameter I need, because moving the start does not work.
E.g. $1,6 gives empty echo. I would expect Mb_TSS
Desired final output:
#!/bin/sh
file=$1
cat "$1"|while read line; do
parent=$(awk {'print substr($line,$0,5)'})
start=$(awk{'print subtrs($line,?,?')})
end=$(awk{'print subtrs($line,?,?')})
strand=$(awk{'print subtrs($line,?,?')})
done
echo "$parent" -> echos SMb
echo "$start" -> echos 171765
echo "$end" -> echos 171864
echo "$strand" -> echos -
I have an assumption, that the items in the lines are seen as single strings or something? Maybe I am also handling the file-parsing wrongly, but everything I tried does not work.

Really unclear exactly what you're trying to do. But I can at least help you with the awk syntax:
while read -r line
do
parent=$(echo $line | awk '{print substr($1,2,3)}')
start=$(echo $line | awk '{print substr($2,2,6)}')
echo $parent
echo $start
done < file
This outputs:
SMb
171765
SMb
171758
You should be able to figure out how to get the rest of the fields.
This is quite an inefficient way to do this but based on the information in the question I'm unable to provide a better answer at the moment.

the question was orignally tagged python, so let me propose a python solution:
with open("input.txt") as f:
for l in txt:
data = [x.strip("'").partition("_")[0] for x in l.split()[:4]]
print("\n".join(data))
It opens the file, splits the lines like awk would to, considers only the 4 first fields, strips off the quotes, to create the list. Then display it separated by newlines.
that prints:
SMb
171765
171864
-
SMb
171758
171857
-

Related

How to search the full string in file which is passed as argument in shell script?

i am passing a argument and that argument i have to match in file and extract the information. Could you please how I can get it?
Example:
I have below details in file-
iMedical_Refined_load_Procs_task_id=970113
HV_Rawlayer_Execution_Process=988835
iMedical_HV_Refined_Load=988836
DHS_RawLayer_Execution_Process=988833
iMedical_DHS_Refined_Load=988834
If I am passing 'hv' as argument so it should to pick 'iMedical_HV_Refined_Load' and give the result - '988836'
If I am passing 'dhs' so it should pick - 'iMedical_DHS_Refined_Load' and give the result = '988834'
I tried below logic but its not giving the result correctly. What Changes I need to do-
echo $1 | tr a-z A-Z
g=${1^^}
echo $g
echo $1
val=$(awk -F= -v s="$g" '$g ~ s{print $2}' /medaff/Scripts/Aggrify/sltconfig.cfg)
echo "TASK ID is $val"
Assuming your matching criteria is the first string after delimiter _ and the output needed is the numbers after the = char, then you can try this sed
$ sed -n "/_$1/I{s/[^=]*=\(.*\)/\1/p}" input_file
$ read -r input
hv
$ sed -n "/_$input/I{s/[^=]*=\(.*\)/\1/p}" input_file
988836
$ read -r input
dhs
$ sed -n "/_$input/I{s/[^=]*=\(.*\)/\1/p}" input_file
988834
If I'm reading it right, 2 quick versions -
$: cat 1
awk -F= -v s="_${1^^}_" '$1~s{print $2}' file
$: cat 2
sed -En "/_${1^^}_/{s/^.*=//;p;}" file
Both basically the same logic.
In pure bash -
$: cat 3
while IFS='=' read key val; do [[ "$key" =~ "_${1^^}_" ]] && echo "$val"; done < file
That's a lot less efficient, though.
If you know for sure there will be only one hit, all these could be improved a bit by short-circuit exits, but on such a small sample it won't matter at all. If you have a larger dataset to read, then I strongly suggest you formalize your specs better than "in this set I should get...".

Bash: How to count the number of occurrences of a string within a file?

I have a file that looks something like this:
dog
cat
dog
dog
fish
cat
I'd like to write some kind of code in Bash to make the file formatted like:
dog:1
cat:1
dog:2
dog:3
fish:1
cat:2
Any idea on how to do this? The file is very large (> 30K lines), so the code should be somewhat fast.
I am thinking some kind of loop...
Like this:
while read line;
echo "$line" >> temp.txt
val=$(grep $line temp.txt)
echo "$val" >> temp2.txt
done < file.txt
And then paste -d ':' file1.txt temp2.txt
However, I am concerned that this would be really slow, as you're going line-by-line. What do other people think?
You may use this simple awk to do this job for you:
awk '{print $0 ":" ++freq[$0]}' file
dog:1
cat:1
dog:2
dog:3
fish:1
cat:2
Here's what I came up with:
declare -A arr; while read -r line; do ((arr[$line]++)); echo "$line:${arr[$line]}" >> output_file; done < input_file
First, declare hash table arr. Then read every line in a for loop and increment the value in the array with the key of the read line. Then echo out the line, followed out by the value in the hashtable. Lastly append into a file 'out'.
Awk or sed are very powerful but it's not bash, here is the bash variant
raw=( $(cat file) ) # read file
declare -A index # init indexed array
for item in ${raw[#]}; { ((index[$item]++)); } # 1st loop through raw data to count items
for item in ${raw[#]}; { echo $item:${index[$item]}; } # 2nd loop change data

awk usage in a variable

actlist file contains around 15 records. I want to print/store each row in a variable to perform further action. script runs but echo $j displays blank value. What is the issue?
my script:
#/usr/bin/sh
acList=/root/john/actlist
Rowcount=`wc -l $acList | awk -F " " '{print $1}'`
for ((i=1; i<=Rowcount; i++)); do
j=`awk 'FNR == $i{print}' $acList`
echo $j
done
file: actlist
cat > actlist
5663233332 2223 2
5656556655 5545 5
4454222121 5555 5
.
.
.
The issue happens to be related to quotes and to the way the shell interpolates variables.
More specifically, when you write
j=`awk "FNR == $i{print}" $acList`
the AWK code must be enclosed into double quotes. This is necessary if you want the shell to be able to substitute the $i with the actual value stored in the i variable.
On the other hand, if you write
j=`awk 'FNR == $i{print}' $acList`
i.e. with single quotes, the $i will be interpreted as a literal string.
Hence the fixed code will read:
#/usr/bin/sh
acList=/root/john/actlist
Rowcount=`wc -l $acList | awk -F " " '{print $1}'`
for ((i=1; i<=Rowcount; i++)); do
j=`awk "FNR == $i{print}" $acList`
echo $j
done
Remember: it is always the shell that does variable interpolation before calling other commands.
Having said that, there are some places, in supplied code, where some improvements could be devised. But that's another story.
Unfortunately all your script does is print the contents of the input file so we can't help you figure out the right approach to do whatever it is you REALLY want to do without more information on what that is but chances are this is the right starting point:
acList=/root/john/actlist
awk '
{ print }
' "$acList"
I think you would probably be better off with this for parsing your file:
#!/bin/bash
while read a b c; do
echo $a, $b, $c
done < "$actlist"
Output:
5663233332, 2223, 2
5656556655, 5545, 5
4454222121, 5555, 5
Updated
Whilst the above demonstrates the concept I was suggesting, as #EdMorton rightly says in the comments section, the following code would be more robust for a production environment.
#!/bin/bash
while IFS= read -r a b c; do
echo "$a, $b, $c"
done < "$actlist"

Adding spaces after each character in a string

I have a string variable in my script, made up of the 9 permission characters from ls -l
eg:
rwxr-xr--
I want to manipulate it so that it displays like this:
r w x r - x r - -
IE every three characters is tab separated and all others are separated by a space. The closest I've come is using a printf
printf "%c %c %c\t%c %c %c\t%c %c %c\t/\n" "$output"{1..9}
This only prints the first character but formatted correctly
I'm sure there's a way to do it using "sed" that I can't think of
Any advice?
Using the Posix-specified utilities fold and paste, split the string into individual characters, and then interleave a series of delimiters:
fold -w1 <<<"$str" | paste -sd' \t'
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g' <<< "$output"
r w x r - x r - -
Sadly, this leaves a trailing tab in the output. If you don't want that, use:
$ sed -r 's/(.)(.)(.)/\1 \2 \3\t/g; s/\t$//' <<< "$str"
r w x r - x r - -
Why do u need to parse them? U can access to every element of string by copy needed element. It's a very easy and without any utility, for example:
DATA="rwxr-xr--"
while [ $i -lt ${#DATA} ]; do
echo ${DATA:$i:1}
i=$(( i+1 ))
done
With awk:
$ echo "rwxr-xr--" | awk '{gsub(/./,"& ");gsub(/. . . /,"&\t")}1'
r w x r - x r - -
> echo "rwxr-xr--" | sed 's/\(.\{3,3\}\)/\1\t/g;s/\([^\t]\)/\1 /g;s/\s*$//g'
r w x r - x r - -
( Evidently I didn't put much thought into my sed command. John Kugelman's version is obviously much clearer and more concise. )
Edit: I wholeheartedly agree with triplee's comment though. Don't waste your time trying to parse ls output. I did that for a long time before I figured out you can get exactly what you want (and only what you want) much easier by using stat. For example:
> stat -c %a foo.bar # Equivalent to stat --format %a
0754
The -c %a tells stat to output the access rights of the specified file, in octal. And that's all it prints out, thus eliminating the need to do wacky stuff like ls foo.bar | awk '{print $1}', etc.
So for instance you could do stuff like:
GROUP_READ_PERMS=040
perms=$(stat -c %a foo.bar)
if (( (perms & GROUP_READ_PERMS) != 0 )); then
... # Do some stuff
fi
Sure as heck beats parsing strings like "rwxr-xr--"
sed 's/.../& /2g;s/./& /g' YourFile
in 2 simple step
A version which includes a pure bash version for short strings, and sed for longer strings, and preserves newlines (adding a space after them too)
if [ "${OS-}" = "Windows_NT" ]; then
threshold=1000
else
threshold=100
fi
function escape()
{
local out=''
local -i i=0
local str="${1}"
if [ "${#str}" -gt "${threshold}" ]; then
# Faster after sed is started
sed '# Read all lines into one buffer
:combine
$bdone
N
bcombine
:done
s/./& /g' <<< "${str}"
else
# Slower, but no process to load, so faster for short strings. On windows
# this can be a big deal
while (( i < ${#str} )); do
out+="${str:$i:1} "
i+=1
done
echo "$out"
fi
}
Explanation of sed. "If this is the last line, jump to :done, else append Next into buffer and jump to :combine. After :done is a simple sed replacement expression. The entire string (newlines and all) are in one buffer so that the replace works on newlines too (which are lost in some of the awk -F examples)
Plus this is Linux, Mac, and Git for Windows compatible.
Setting awk -F '' allows each character to be bounded, then you'll want to loop through and print each field.
Example:
ls -l | sed -n 2p | awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'; echo ""
The part seems like the answer to your question:
awk -F '' '{for(i=1;i<=NF;i++){printf " %s ",$i;}}'
I realize, this doesn't provide the trinary grouping you wanted though. hmmm...

Can't get IFS to work when converting array to string

Below is a bash shell script for taking in a csv file and spitting out rows formatted the way I want (Some more changes are there, but I only kept the array affecting ones below to show).
FILENAME=$1
cat $FILENAME | while read LINE
do
OIFS=$IFS;
IFS=","
columns=( $LINE )
date=${columns[4]//\"/}
columns[13]=${columns[13]//\"/}
columns[4]=$(date -d $date +%s)
newline=${columns[*]}
echo $newline
IFS=$OIFS;
done
I'm using GNU bash v 4.1.2(1)-release for CentOS 6.3. I've tried putting quotes like
newline="${columns[*]}"
Still no luck.
Following is sample data line
112110120001299169,112110119001295978,11,"121.119.163.146.1322221980963094","2012/11/01"
It seems like it should be outputting the array into a comma delimited string. Instead, the string is space delimited. Anyone know the reason why?
I suspect it has something to do with the fact that if I echo out $IFS in script it's an empty string, but when I echo out "${IFS}" it's then the comma I expect.
Edit: Solution
I found the solution. When echoing out $newline, I have to use quotes around it, i.e.
echo "$newline"
Otherwise, it uses the default blanks. I believe it has something to do with bash only subbing in for the IFS when you force it to with the quotes.
I'm not clear on why, but bash only seems to use the first character of IFS as a delimiter when expanding ${array[*]} when it's in double-quotes:
$ columns=(a b "c d e" f)
$ IFS=,
$ echo ${columns[*]}
a b c d e f
$ echo "${columns[*]}"
a,b,c d e,f
$ newline=${columns[*]}; echo "$newline"
a b c d e f
$ newline="${columns[*]}"; echo "$newline"
a,b,c d e,f
Fortunately, the solution is simple: use double-quotes (newline="${columns[*]}")
(BTW, my testing was all on bash v3 and v2, as I don't have v4 handy; so it might be different for you.) (UPDATE: tested on bash v4.2.10, same results.)
Edit Thanks to #GordonDavidson, Removed erroneous comments about how IFS works in bash.
awk has a very nice pair of vars, name FS=","; OFS="|" that do perform this transformation. You'll have to construct awk -F, '{"date -d "$date" +%s" | getline columns[4]}' or similar to call external programs and fill variables. Not quite as intuitive as the shell's c[4]=$(date ...), but awk is a very good tool to learn for data manipulations like you have outlined in your question.
Something like
#!/bin/awk -f
{
# columns=( $LINE )
split($0, columns)
# date=${columns[4]//\"/}
myDcolucolumns[4] ; gsub(/\"/, "", myDate)
# gcolumns[13]=${columns[13]//\"/}
gsub(/\"/,""columns[13]}
# columns[4]=$(date -d $date +%s)
"date -d '"$date"' +%s" | getline columns[4]
#Don_t_need_this newline=${columns[*]}
#echo $newline
} print $0
used like
cat myFile | myAwkScript
should achieve the same result.
Sorry but I don't have the time, OR the sample data to test this right now.
Feel free to reply with error messages that you get, and I'll see if I can help.
You might also consider updating your posting with 1 line of sample data, and a date value you want to process.
IHTH

Resources