Bash Scripting, Reading From A File [duplicate] - linux

This question already has answers here:
While loop stops reading after the first line in Bash
(5 answers)
Closed 2 years ago.
I'm trying to select lines that have F starting them from my .txt file, and then find the average of the numbers, here's my code, I don't know why it's not adding up.
#!/bin/bash
function F1()
{
count=1;
total=0;
file='users.txt'
while read line; do
if (grep \^"F");
then
for i in $( awk '{ print $3; }')
do
total=$(echo $total+$i )
var=$((var+1))
done
fi
echo "scale=2; $total / $count"
echo $line
done < $file
}
F1
my output

If you are using Awk anyway, use its capabilities.
awk '/^F/ { sum+=$3; count++ } END { print (count ? sum/count : 0) }' users.txt
This is a standard Awk idiom which will typically be an exercise within the first hour of any slow-paced basic beginner Awk tutorial, or within ten minutes of more directed learning.
Your shell script had at least the following errors;
grep without an argument will read in the rest of your input lines in one go. Running it in a subshell (i.e. inside parentheses) does nothing useful, and costs you a process.
The shell does not perform any arithmetic unless you separately request it; variables are simply strings. If you want to add two numbers, you need to explicitly use an arithmetic evaluation like ((total+=i)) ... like you already did in another place.
read without any options will mangle backlashes in your input; if you don't specifically require this behavior, you should always use read -r.
All except one of those semicolons are useless.
You should generally quote all your variables; see When to wrap quotes around a shell variable
If you are using Awk anyway, you should probably use it for as much as possible. The shell is very slow and inefficient when it comes to processing a file line by line, and horribly clunky when it comes to integer arithmetic (let alone then dividing numbers which are not even multiples of each other).
This is probably not complete; also try http://shellcheck.net/ before asking for human assistance.
On Stack Overflow, try to reduce your problem so that it really only asks a single question. There are duplicate questions for all of these issues, but I'll just mark this as a duplicate of the one you are actually asking about.

You don't perform an addition for the variable total. On the first iteration, the line
total=$(echo $total+$i )
(assuming that i, for instance, is 4711) is expanded to
total=0+4711
So the variable total is set to a 6-character string, not a number. To actually add here, you would have to write
((total = total + i))

Related

When processing string with Bash, how to treat comma differently depending on whether it's surrounded by some specific characters?

I would like to transform a MySQL script into a JSON file and was asked to use Bash for it.
By writing a simple shell script:
#!/bin/bash
# I know this script just output each entry with its value, because I haven' t gone any further
for filename in $dir/home/*.sql
do
cat $filename | while read line
do
names=${line%values*}
names=${names#*(}
names=${names%)*}
values=${line#*values(}
values=${values%)*}
while [[ $names != $currentname ]]
do
currentname=${names%%,*}
currentvalue=${values%%,*}
echo $currentname
echo $currentvalue
names=${names#*,}
values=${values#*,}
done
done
done
I have been basically able to fulfill the requirement. However, there is one more problem.
Some of the string entries has comma among its characters.
This causes a mistake that my script thinks these commas as the ones that separates values and thus a string bearing comma will be treated as two different strings.
It would be an easy task to solve this with programming languages like C++, but I have been asked to do this only with bash shell script although I am not familiar with it. So now I have been stuck with no clue. Maybe regular expression would be the cure? Or if there are other approaches please also help.
FYI, here is an example of the problem:
Input:
values(100, 'A100', 'A,100');
Expected output:
100
'A100'
'A,100'
Actual current output:
100
'A100'
'A
100'
Something like this may help:
data="values(100, 'A100', 'A,100');"
json=${data//values(}
json=${json//);}
json=${json//, /$'\n'}
echo "$json"
Expected output:
Typically in shell you would match it with a regex:
echo "values(100, 'A100', 'A,100');" | sed 's/values(//; s/\(, \|);\)/\n/g'
but this does not solve the problem at all.
The best and only solution is to write a real parser for real mysql langauge to 'handle' '' ' ' 'all\tcorner\'cases' properly. Read the input char by char, store state (ex. if you are inside quotation or not), handle '\'' and other \n etc. sequences for the need of extracting the field. You might interest yourself in mysql internal lexer (it's big!) and lex and yacc programs.
Check your scripts with http://shellcheck.net . Read https://mywiki.wooledge.org/BashFAQ/001 . Quote variable expansions. Don't be nominated for useless cat award.
and was asked to use Bash for it.
Bash is a shell - it's primary role is to run and connect other programs with each other. Bash is a shell, not a full blown programming language, and writing programming stuff in it is going to be very hard or it just ends up using external programs, as that's what it's for. Write the parser in other language - use bash to run it. If you're comfortable in C++, write it in C++ inside a bash script, then compile and execute it inside a bash script.
A common arrangement is to use regex for this, yes; for example, this is a requirement for parsing CSV files. But you can parse the line piece by piece like in your attempt.
However, you have a number of quoting errors which would prevent your code from working even if you figured out a way to parse the input the way you want to. (And of course, get rid of the Useless use of cat?)
while read -r line; do
case $line in
*values\(*\)\; );;
*) continue;;
esac
line=${line#values\(}
line=${line%\)\;}
while [ "$line" ]; do
case $line in
\'*)
line=${line#\'}
tail=${line#*\'}
value=\'${line%"$tail"}
line=${tail#,}
line=${line# };;
*) value=${line%%,*}
line=${line#*,}
line=${line# };;
esac
echo "$value"
done
done <"$filename"
This is probably not really the way to go, just a hint if you really want to try to tackle this in Bash. I would write a simple parser in Python if I wanted to cover all bases.

Unterstanding the dollar in shell with pwd and awk [duplicate]

This question already has answers here:
Backticks vs braces in Bash
(3 answers)
Brackets ${}, $(), $[] difference and usage in bash
(1 answer)
Closed 4 years ago.
I have two questions and could use some help understanding them.
What is the difference between ${} and $()? I understand that ()
means running command in separate shell and placing $ means passing
the value to variable. Can someone help me in understanding
this? Please correct me if I am wrong.
If we can use for ((i=0;i<10;i++)); do echo $i; done and it works fine then why can't I use it as while ((i=0;i<10;i++)); do echo $i; done? What is the difference in execution cycle for both?
The syntax is token-level, so the meaning of the dollar sign depends on the token it's in. The expression $(command) is a modern synonym for `command` which stands for command substitution; it means run command and put its output here. So
echo "Today is $(date). A fine day."
will run the date command and include its output in the argument to echo. The parentheses are unrelated to the syntax for running a command in a subshell, although they have something in common (the command substitution also runs in a separate subshell).
By contrast, ${variable} is just a disambiguation mechanism, so you can say ${var}text when you mean the contents of the variable var, followed by text (as opposed to $vartext which means the contents of the variable vartext).
The while loop expects a single argument which should evaluate to true or false (or actually multiple, where the last one's truth value is examined -- thanks Jonathan Leffler for pointing this out); when it's false, the loop is no longer executed. The for loop iterates over a list of items and binds each to a loop variable in turn; the syntax you refer to is one (rather generalized) way to express a loop over a range of arithmetic values.
A for loop like that can be rephrased as a while loop. The expression
for ((init; check; step)); do
body
done
is equivalent to
init
while check; do
body
step
done
It makes sense to keep all the loop control in one place for legibility; but as you can see when it's expressed like this, the for loop does quite a bit more than the while loop.
Of course, this syntax is Bash-specific; classic Bourne shell only has
for variable in token1 token2 ...; do
(Somewhat more elegantly, you could avoid the echo in the first example as long as you are sure that your argument string doesn't contain any % format codes:
date +'Today is %c. A fine day.'
Avoiding a process where you can is an important consideration, even though it doesn't make a lot of difference in this isolated example.)
$() means: "first evaluate this, and then evaluate the rest of the line".
Ex :
echo $(pwd)/myFile.txt
will be interpreted as
echo /my/path/myFile.txt
On the other hand ${} expands a variable.
Ex:
MY_VAR=toto
echo ${MY_VAR}/myFile.txt
will be interpreted as
echo toto/myFile.txt
Why can't I use it as bash$ while ((i=0;i<10;i++)); do echo $i; done
I'm afraid the answer is just that the bash syntax for while just isn't the same as the syntax for for.
your understanding is right. For detailed info on {} see bash ref - parameter expansion
'for' and 'while' have different syntax and offer different styles of programmer control for an iteration. Most non-asm languages offer a similar syntax.
With while, you would probably write i=0; while [ $i -lt 10 ]; do echo $i; i=$(( i + 1 )); done in essence manage everything about the iteration yourself

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Print previous line if condition is met

I would like to grep a word and then find the second column in the line and check if it is bigger than a value. Is yes, I want to print the previous line.
Ex:
Input file
AAAAAAAAAAAAA
BB 2
CCCCCCCCCCCCC
BB 0.1
Output
AAAAAAAAAAAAA
Now, I want to search for BB and if the second column (2 or 0.1) in that line is bigger than 1, I want to print the previous line.
Can somebody help me with grep and awk? Thanks. Any other suggestions are also welcome. Thanks.
This can be a way:
$ awk '$1=="BB" && $2>1 {print f} {f=$1}' file
AAAAAAAAAAAAA
Explanation
$1=="BB" && $2>1 {print f} if the 1st field is exactly BB and 2nd field is bigger than 1, then print f, a stored value.
{f=$1} store the current line in f, so that it is accessible when reading the next line.
Another option: reverse the file and print the next line if the condition matches:
tac file | awk '$1 == "BB" && $2 > 1 {getline; print}' | tac
Concerning generality
I think it needs to be mentioned that the most general solution to this class of problem involves two passes:
the first pass to add a decimal row number ($REC) to the front of each line, effectively grouping lines into records by $REC
the second pass to trigger on the first instance of each new value of $REC as a record boundary (resetting $CURREC), thereafter rolling along in the native AWK idiom concerning the records to follow matching $CURREC.
In the intermediate file, some sequence of decimal digits followed by a separator (for human reasons, typically an added tab or space) is parsed (aka conceptually snipped off) as out-of-band with respect to the baseline file.
Command line paste monster
Even confined to the command line, it's an easy matter to ensure that the intermediate file never hits disk. You just need to use an advanced shell such as ZSH (my own favourite) which supports process substitution:
paste <( <input.txt awk "BEGIN { R=0; N=0; } /Header pattern/ { N=1; } { R=R+N; N=0; print R; }" ) input.txt | awk -f yourscript.awk
Let's render that one-liner more suitable for exposition:
P="/Header pattern/"
X="BEGIN { R=0; N=0; } $P { N=1; } { R=R+N; N=0; print R; }"
paste <( <input.txt awk $X ) input.txt | awk -f yourscript.awk
This starts three processes: the trivial inline AWK script, paste, and the AWK script you really wanted to run in the first place.
Behind the scenes, the <() command line construct creates a named pipe and passes the pipe name to paste as the name of its first input file. For paste's second input file, we give it the name of our original input file (this file is thus read sequentially, in parallel, by two different processes, which will consume between them at most one read from disk, if the input file is cold).
The magic named pipe in the middle is an in-memory FIFO that ancient Unix probably managed at about 16 kB of average size (intermittently pausing the paste process if the yourscript.awk process is sluggish in draining this FIFO back down).
Perhaps modern Unix throws a bigger buffer in there because it can, but it's certainly not a scarce resource you should be concerned about, until you write your first truly advanced command line with process redirection involving these by the hundreds or thousands :-)
Additional performance considerations
On modern CPUs, all three of these processes could easily find themselves running on separate cores.
The first two of these processes border on the truly trivial: an AWK script with a single pattern match and some minor bookkeeping, paste called with two arguments. yourscript.awk will be hard pressed to run faster than these.
What, your development machine has no lightly loaded cores to render this master shell-master solution pattern almost free in the execution domain?
Ring, ring.
Hello?
Hey, it's for you. 2018 just called, and wants its problem back.
2020 is officially the reprieve of MTV: That's the way we like it, magic pipes for nothing and cores for free. Not to name out loud any particular TLA chip vendor who is rocking the space these days.
As a final performance consideration, if you don't want the overhead of parsing actual record numbers:
X="BEGIN { N=0; } $P { N=1; } { print N; N=0; }"
Now your in-FIFO intermediate file is annotated with just an additional two characters prepended to each line ('0' or '1' and the default separator character added by paste), with '1' demarking first line in record.
Named FIFOs
Under the hood, these are no different than the magic FIFOs instantiated by Unix when you write any normal pipe command:
cat file | proc1 | proc2 | proc2
Three unnamed pipes (and a whole process devoted to cat you didn't even need).
It's almost unfortunate that the truly exceptional convenience of the default stdin/stdout streams as premanaged by the shell obscures the reality that paste $magictemppipe1 $magictemppipe2 bears no additional performance considerations worth thinking about, in 99% of all cases.
"Use the <() Y-joint, Luke."
Your instinctive reflex toward natural semantic decomposition in the problem domain will herewith benefit immensely.
If anyone had had the wits to name the shell construct <() as the YODA operator in the first place, I suspect it would have been pressed into universal service at least a solid decade ago.
Combining sed & awk you get this:
sed 'N;s/\n/ /' < file |awk '$3>1{print $1}'
sed 'N;s/\n/ / : Combine 1st and 2nd line and replace next line char with space
awk '$3>1{print $1}': print $1(1st column) if $3(3rd column's value is > 1)

Difference between ${} and $() in Bash [duplicate]

This question already has answers here:
Backticks vs braces in Bash
(3 answers)
Brackets ${}, $(), $[] difference and usage in bash
(1 answer)
Closed 4 years ago.
I have two questions and could use some help understanding them.
What is the difference between ${} and $()? I understand that ()
means running command in separate shell and placing $ means passing
the value to variable. Can someone help me in understanding
this? Please correct me if I am wrong.
If we can use for ((i=0;i<10;i++)); do echo $i; done and it works fine then why can't I use it as while ((i=0;i<10;i++)); do echo $i; done? What is the difference in execution cycle for both?
The syntax is token-level, so the meaning of the dollar sign depends on the token it's in. The expression $(command) is a modern synonym for `command` which stands for command substitution; it means run command and put its output here. So
echo "Today is $(date). A fine day."
will run the date command and include its output in the argument to echo. The parentheses are unrelated to the syntax for running a command in a subshell, although they have something in common (the command substitution also runs in a separate subshell).
By contrast, ${variable} is just a disambiguation mechanism, so you can say ${var}text when you mean the contents of the variable var, followed by text (as opposed to $vartext which means the contents of the variable vartext).
The while loop expects a single argument which should evaluate to true or false (or actually multiple, where the last one's truth value is examined -- thanks Jonathan Leffler for pointing this out); when it's false, the loop is no longer executed. The for loop iterates over a list of items and binds each to a loop variable in turn; the syntax you refer to is one (rather generalized) way to express a loop over a range of arithmetic values.
A for loop like that can be rephrased as a while loop. The expression
for ((init; check; step)); do
body
done
is equivalent to
init
while check; do
body
step
done
It makes sense to keep all the loop control in one place for legibility; but as you can see when it's expressed like this, the for loop does quite a bit more than the while loop.
Of course, this syntax is Bash-specific; classic Bourne shell only has
for variable in token1 token2 ...; do
(Somewhat more elegantly, you could avoid the echo in the first example as long as you are sure that your argument string doesn't contain any % format codes:
date +'Today is %c. A fine day.'
Avoiding a process where you can is an important consideration, even though it doesn't make a lot of difference in this isolated example.)
$() means: "first evaluate this, and then evaluate the rest of the line".
Ex :
echo $(pwd)/myFile.txt
will be interpreted as
echo /my/path/myFile.txt
On the other hand ${} expands a variable.
Ex:
MY_VAR=toto
echo ${MY_VAR}/myFile.txt
will be interpreted as
echo toto/myFile.txt
Why can't I use it as bash$ while ((i=0;i<10;i++)); do echo $i; done
I'm afraid the answer is just that the bash syntax for while just isn't the same as the syntax for for.
your understanding is right. For detailed info on {} see bash ref - parameter expansion
'for' and 'while' have different syntax and offer different styles of programmer control for an iteration. Most non-asm languages offer a similar syntax.
With while, you would probably write i=0; while [ $i -lt 10 ]; do echo $i; i=$(( i + 1 )); done in essence manage everything about the iteration yourself

Resources