Awk: loop & save different lines to different files? - linux

I'm looping over a series of large files with a shell script:
i=0
while read line
do
# get first char of line
first=`echo "$line" | head -c 1`
# make output filename
name="$first"
if [ "$first" = "," ]; then
name='comma'
fi
if [ "$first" = "." ]; then
name='period'
fi
# save line to new file
echo "$line" >> "$2/$name.txt"
# show live counter and inc
echo -en "\rLines:\t$i"
((i++))
done <$file
The first character in each line will either be alphanumeric, or one of the above defined characters (which is why I'm renaming them for use in the output file name).
It's way too slow.
5,000 lines takes 128seconds.
At this rate I've got a solid month of processing.
Will awk be faster here?
If so, how do I fit the logic into awk?

This can certainly be done more efficiently in bash.
To give you an example: echo foo | head does a fork() call, creates a subshell, sets up a pipeline, starts the external head program... and there's no reason for it at all.
If you want the first character of a line, without any inefficient mucking with subprocesses, it's as simple as this:
c=${line:0:1}
I would also seriously consider sorting your input, so you can only re-open the output file when a new first character is seen, rather than every time through the loop.
That is -- preprocess with sort (as by replacing <$file with < <(sort "$file")) and do the following each time through the loop, reopening the output file only conditionally:
if [[ $name != "$current_name" ]] ; then
current_name="$name"
exec 4>>"$2/$name" # open the output file on FD 4
fi
...and then append to the open file descriptor:
printf '%s\n' "$line" >&4
(not using echo because it can behave undesirably if your line is, say, -e or -n).
Alternately, if the number of possible output files is small, you can just open them all on different FDs up-front (substituting other, higher numbers where I chose 4), and conditionally output to one of those pre-opened files. Opening and closing files is expensive -- each close() forces a flush to disk -- so this should be a substantial help.

A few things to speed it up:
Don't use echo/head to get the first character. You're
spawning at least two additional processes per line. Instead,
use bash's parameter expansion facilities to get the first character.
Use if-elif to avoid checking $first against all the
possibilities
each time. Even better, if you are using bash 4.0 or later, use an associative array
to store the output file names, rather than checking against
$first in a big if-statement for each line.
If you don't have a version of bash that supports associative
arrays, replace your if statements with the following.
if [[ "$first" = "," ]]; then
name='comma'
elif [[ "$first" = "." ]]; then
name='period'
else
name="$first"
fi
But the following is suggested. Note the use of $REPLY as the default variable used by read if no name is given (just FYI).
declare -A OUTPUT_FNAMES
output[","]=comma
output["."]=period
output["?"]=question_mark
output["!"]=exclamation_mark
output["-"]=hyphen
output["'"]=apostrophe
i=0
while read
do
# get first char of line
first=${REPLY:0:1}
# make output filename
name=${output[$first]:-$first}
# save line to new file
echo $REPLY >> "$name.txt"
# show live counter and inc
echo -en "\r$i"
((i++))
done <$file

#!/usr/bin/awk -f
BEGIN {
punctlist = ", . ? ! - '"
pnamelist = "comma period question_mark exclamation_mark hyphen apostrophe"
pcount = split(punctlist, puncts)
ncount = split(pnamelist, pnames)
if (pcount != ncount) {print "error: counts don't match, pcount:", pcount, "ncount:", ncount; exit}
for (i = 1; i <= pcount; i++) {
punct_lookup[puncts[i]] = pnames[i]
}
}
{
print > punct_lookup[substr($0, 1, 1)] ".txt"
printf "\r%6d", i++
}
END {
printf "\n"
}
The BEGIN block builds an associative array so you can do punct_lookup[","] and get "comma".
The main block simply does the lookups for the filenames and outputs the line to the file. In AWK, > truncates the file the first time and appends subsequently. If you have existing files that you don't want truncated, then change it to >> (but don't use >> otherwise).

Yet another take:
declare -i i=0
declare -A names
while read line; do
first=${line:0:1}
if [[ -z ${names[$first]} ]]; then
case $first in
,) names[$first]="$2/comma.txt" ;;
.) names[$first]="$2/period.txt" ;;
*) names[$first]="$2/$first.txt" ;;
esac
fi
printf "%s\n" "$line" >> "${names[$first]}"
printf "\rLine $((++i))"
done < "$file"
and
awk -v dir="$2" '
{
first = substr($0,1,1)
if (! (first in names)) {
if (first == ",") names[first] = dir "/comma.txt"
else if (first == ".") names[first] = dir "/period.txt"
else names[first] = dir "/" first ".txt"
}
print > names[first]
printf("\rLine %d", NR)
}
'

Related

Linux script reading an ini file and splitting into variables by a specified character

I'm stuck in the following task: Lets pretend we have an .ini file in a folder. The file contains lines like this:
eno1=10.0.0.254/24
eno2=172.16.4.129/25
eno3=192.168.2.1/25
tun0=10.10.10.1/32
I had to choose the biggest subnet mask. So my attempt was:
declare -A data
for f in datadir/name
do
while read line
do
r=(${line//=/ })
let data[${r[0]}]=${r[1]}
done < $f
done
This is how far i got. (Yeah i know the file named name is not an .ini file but a .txt since i got problem even with creating an ini file,this teacher didn't even give a file like that for our exam.)
It splits the line until the =, but doesn't want to read the IP number because of the (first) . character.
(Invalid arithmetic operator the error message i got)
If someone could help me and explain how i can make a script for tasks like this i would be really thankful!
Both previously presented solutions operate (and do what they're designed to do); I thought I'd add something left-field as the specifications are fairly loose.
$ cat freasy
eno1=10.0.0.254/24
eno2=172.16.4.129/25
eno3=192.168.2.1/25
tun0=10.10.10.1/32
I'd argue that the biggest subnet mask is the one with the lowest numerical value (holds the most hosts).
$ sort -t/ -k2,2nr freasy| tail -n1
eno1=10.0.0.254/24
Don't use let. It's for arithmetic.
$ help let
let: let arg [arg ...]
Evaluate arithmetic expressions.
Evaluate each ARG as an arithmetic expression.
Just use straight assignment:
declare -A data
for f in datadir/name
do
while read line
do
r=(${line//=/ })
data[${r[0]}]=${r[1]}
done < $f
done
Result:
$ declare -p data
declare -A data=([tun0]="10.10.10.1/32" [eno1]="10.0.0.254/24" [eno2]="172.16.4.129/25" [eno3]="192.168.2.1/25" )
awk provides a simple solution to find the max value following the '/' that will be orders of magnitude faster than a bash script or Unix pipeline using:
awk -F"=|/" '$3 > max { max = $3 } END { print max }' file
Example Use/Output
$ awk -F"=|/" '$3 > max { max = $3 } END { print max }' file
32
Above awk separates the fields using either '=' or '/' as field separator and then keeps the max of the 3rd field $3 and outputs that value using the END {...} rule.
Bash Solution
If you did want a bash script solution, then you can isolate the wanted parts of each line using [[ .. =~ .. ]] to populate the BASH_REMATCH array and then compare ${BASH_REMATCH[3]} against a max variable. The [[ .. ]] expression with =~ considers everything on the right side an Extended Regular Expression and will isolate each grouping ((...)) as an element in the array BASH_REMATCH, e.g.
#!/bin/bash
[ -z "$1" ] && { printf "filename required\n" >&2; exit 1; }
declare -i max=0
while read -r line; do
[[ $line =~ ^(.*)=(.*)/(.*)$ ]]
((${BASH_REMATCH[3]} > max)) && max=${BASH_REMATCH[3]}
done < "$1"
printf "max: %s\n" "$max"
Using Only POSIX Parameter Expansions
Using parameter expansion with substring removal supported by POSIX shell (Bourne shell, dash, etc..), you could do:
#!/bin/sh
[ -z "$1" ] && { printf "filename required\n" >&2; exit 1; }
max=0
while read line; do
[ "${line##*/}" -gt "$max" ] && max="${line##*/}"
done < "$1"
printf "max: %s\n" "$max"
Example Use/Output
After making yourscript.sh executable with chmod +x yourscript.sh, you would do:
$ ./yourscript.sh file
max: 32
(same output for both shell script solutions)
Let me know if you have further questions.

How do I indirectly assign a variable in bash to take multi-line data from both Standard In, a File, and the output of execution

I have found many snippets here and in other places that answer parts of this question. I have even managed to do this in many steps in an inefficient manner. If it is possible, I would really like to find single lines of execution that will perform this task, rather than having to assign to a variable and copy it a few times to perform the task.
e.g.
executeToVar ()
{
# Takes Arg1: NAME OF VARIABLE TO STORE IN
# All Remaining Arguments Are Executed
local STORE_INvar="${1}" ; shift
eval ${STORE_INvar}=\""$( "$#" 2>&1 )"\"
}
Overall does work, i.e. $ executeToVar SOME_VAR ls -l * # will actually fill SOME_VAR with the output of the execution of the ls -l * command that is taken from the rest of the arguments. However, if the command was to output empty lines at the end, (for e.g. - echo -e -n '\n\n123\n456\n789\n\n' which should have 2 x new lines at the start and the end ) these are stripped by bash's sub-execution process. I have seen in other posts similar to this that this has been solved by adding a token 'x' to the end of the stream, e.g. turning the sub-execution into something like:
eval ${STORE_INvar}=\""$( "$#" 2>&1 ; echo -n x )"\" # <-- ( Add echo -n x )
# and then if it wasn't an indirect reference to a var:
STORE_INvar=${STORE_INvar%x}
# However no matter how much I play with:
eval "${STORE_INvar}"=\""${STORE_INvar%x}"\"
# I am unable to indirectly remove the x from the end.
Anyway, I also need 2 x other variants on this, one that assigns the STDIN stream to the var and one that assigns the contents of a file to the var which I assume will be variations of this involving $( cat ${1} ), or maybe $( cat ${1:--} ) to give me a '-' if no filename. But, none of that will work until I can sort out the removal of the x that is needed to ensure accurate assignment of multi line variables.
I have also tried (but to no avail):
IFS='' read -d '' "${STORE_INvar}" <<<"$( $# ; echo -n x )"
eval \"'${STORE_INvar}=${!STORE_INvar%x}'\"
This is close to optimal -- but drop the eval.
executeToVar() { local varName=$1; shift; printf -v "$1" %s "$("$#")"; }
The one problem this formulation still has is that $() strips trailing newlines. If you want to prevent that, you need to add your own trailing character inside the subshell, and strip it off yourself.
executeToVar() {
local varName=$1; shift;
local val="$(printf %s x; "$#"; printf %s x)"; val=${val#x}
printf -v "$varName" %s "${val%x}"
}
If you want to read all content from stdin into a variable, this is particularly easy:
# This requires bash 4.1 for automatic fd allocation
readToVar() {
if [[ $2 && $2 != "-" ]]; then
exec {read_in_fd}<"$2" # copy from named file
else
exec {read_in_fd}<&0 # copy from stdin
fi
IFS= read -r -d '' "$1" <&$read_in_fd # read from the FD
exec {read_in_fd}<&- # close that FD
}
...used as:
readToVar var < <( : "run something here to read its output byte-for-byte" )
...or...
readToVar var filename
Testing these:
bash3-3.2$ executeToVar var printf '\n\n123\n456\n789\n\n'
bash3-3.2$ declare -p var
declare -- var="
123
456
789
"
...and...
bash4-4.3$ readToVar var2 < <(printf '\n\n123\n456\n789\n\n')
bash4-4.3$ declare -p var2
declare -- var2="
123
456
789
"
what'w wrong with storing in a file:
$ stuffToFile filename $(stuff)
where "stuffToFile" tests for a. > 1 argument, b. input on a pipe
$ ... commands ... | stuffToFile filename
and
$ stuffToFile filename < another_file
where "stoffToFile" is a function:
function stuffToFile
{
[[ -f $1 ]] || { echo $1 is not a file; return 1; }
[[ $# -lt 2 ]] && { cat - > $1; return; }
echo "$*" > $1
}
so, if "stuff" has leading and trailing blank lines, then you must:
$ stuff | stuffToFile filename

Splitting out timestamp/key/value pairs from bash

Hi I have this file full of data; the time stamps is basically the beginning of the line. I need to break down the file and print each line individually. How can I accomplish this using only bash and (if needed) standard UNIX tools (sed, awk, etc)?
The time stamp field goes from 08:30:00:324810: onward .. example 17:30:00:324810: . The number of field following the time stamp varies; so there could be 1 to x number of fields . So I need to find the time stamp format and then insert a page break.
08:30:00:324810: usg_07Y BidYield=1.99788141 Bid=99.20312500 08:30:00:325271: usg_07Y
AskYield=1.98578274 Ask=99.28125000 08:30:00:325535: usg_10Y Ask=0.00000000 08:30:01:324881:
usg_07Y BidYield=2.02938740 AskYield=1.97127853 Bid=99.00000000 Ask=99.37500000 08:30:01:377021:
usg_05Y Bid=0.00000000 Ask=0.00000000
Thanking u in advance
Matt
It is fairly trivial. Read the file into an array, find the timestamp, output a newline before it:
#!/bin/bash
set -f # inhibit globbing (filename expansion)
declare -i cnt=0 # simple counter
a=( $(<"$1") ) # read file into array
for i in "${a[#]}"; do # for each word in file
if [ "$cnt" -gt 0 ]; then # test counter > 0
# if last char ':', then output newline before word
[ ${i:(-1):1} = ':' ] && printf "\n%s" "${i}" || printf " %s" "$i"
else
printf "%s" "$i" # if first word, just print.
fi
((cnt++))
done
printf "\n"
Use/output:
$ bash parsedtstamp.sh filename.txt
08:30:00:324810: usg_07Y BidYield=1.99788141 Bid=99.20312500
08:30:00:325271: usg_07Y AskYield=1.98578274 Ask=99.28125000
08:30:00:325535: usg_10Y Ask=0.00000000
08:30:01:324881: usg_07Y BidYield=2.02938740 AskYield=1.97127853 Bid=99.00000000 Ask=99.37500000
08:30:01:377021: usg_05Y Bid=0.00000000 Ask=0.00000000
I added a counter var to only output the newline if not the first word.
Alternate version that avoids temporary array storage (for large files)
While there is no limit on array size in Bash, if you find yourself parsing million line files, it is probably better to avoid storing all lines in memory. This can be accomplished by simply processing the lines as they are read from the file. It is just a way of doing to same thing without using an array for intermediate storage:
#!/bin/bash
set -f # inhibit globbing (filename expansion)
declare -i cnt=0 # simple counter
# read each line in file
while read -r line_entries || [ -n "$line_entries" ]; do
for i in $line_entries; do # for each word in line (no quotes for word splitting)
if [ "$cnt" -gt 0 ]; then # test counter > 0
# if last char ':', then output newline before word
if [ ${i:(-1):1} = ':' ]; then
printf "\n%s" "${i}"
else
printf " %s" "$i"
fi
else
printf "%s" "$i" # if first word, just print.
fi
((cnt++)) # increment counter
done
done <"$1"
printf "\n"
An awk way
awk -vORS="" '{for(i=1;i<=NF;i++)if($i~/:$/&&x++)$i="\n"$i}$NF=$NF" "
END{print "\n"}' file
Sets output record sep to nothing.
Loops through fields.
If fields last char is : then it add a newline before the field.
Adds space to last field in case it is a date to prevent no space between colon and next field.
Prints a newline at the end.

Bash reading txt file and storing in array

I'm writing my first Bash script, I have some experience with C and C# so I think the logic of the program is correct, it's just the syntax is so complicated because apparently there are many different ways to write the same thing!
Here is the script, it simply checks if the argument (string) is contained in a certain file. If so it stores each line of the file in an array and writes an item of the array in a file. I'm sure there must be easier ways to achieve that but I want to do some practice with bash loops
#!/bin/bash
NOME=$1
c=0
#IF NAME IS FOUND IN THE PHONEBOOK THEN STORE EACH LINE OF THE FILE INTO ARRAY
#ONCE THE ARRAY IS DONE GET THE INDEX OF MATCHING NAME AND RETURN ARRAY[INDEX+1]
if grep "$NOME" /root/phonebook.txt ; then
echo "CREATING ARRAY"
while read line
do
myArray[$c]=$line # store line
c=$(expr $c + 1) # increase counter by 1
done < /root/phonebook.txt
else
echo "Name not found"
fi
c=0
for i in myArray;
do
if myArray[$i]="$NOME" ; then
echo ${myArray[i+1]} >> /root/numbertocall.txt
fi
done
This code returns the only the second item of myArray (myArray[2]) or the second line of the file, why?
The first part (where you build the array) looks ok, but the second part has a couple of serious errors:
for i in myArray; -- this executes the loop once, with $i set to "myArray". In this case, you want $i to iterate over the indexes of myArray, so you need to use
for i in "${!myArray[#]}"
or
for ((i=0; i<${#a[#]}; i++))
(although I generally prefer the first, since it'll work with noncontiguous and associative arrays).
Also, you don't need the ; unless do is on the same line (in shell, ; is mostly equivalent to a line break so having a semicolon at the end of a line is redundant).
if myArray[$i]="$NOME" ; then -- the if statement takes a command, and will therefore treat myArray[$i]="$NOME" as an assignment command, which is not at all what you wanted. In order to compare strings, you could use the test command or its synonym [
if [ "${myArray[i]}" = "$NOME" ]; then
or a bash conditional expression
if [[ "${myArray[i]}" = "$NOME" ]]; then
The two are very similar, but the conditional expression has much cleaner syntax (e.g. in a test command, > redirects output, while \> is a string comparison; in [[ ]] a plain > is a comparison).
In either case, you need to use an appropriate $ expression for myArray, or it'll be interpreted as a literal. On the other hand, you don't need a $ before the i in "${myArray[i]}" because it's in a numeric expression context and therefore will be expanded automatically.
Finally, note that the spaces between elements are absolutely required -- in shell, spaces are very important delimiters, not just there for readability like they usually are in c.
1.-This is what you wrote with small adjustments
#!/bin/bash
NOME=$1
#IF NAME IS FOUND IN THE PHONE-BOOK **THEN** READ THE PHONE BOOK LINES INTO AN ARRAY VARIABLE
#ONCE THE ARRAY IS COMPLETED, GET THE INDEX OF MATCHING LINE AND RETURN ARRAY[INDEX+1]
c=0
if grep "$NOME" /root/phonebook.txt ; then
echo "CREATING ARRAY...."
IFS= while read -r line #IFS= in case you want to preserve leading and trailing spaces
do
myArray[c]=$line # put line in the array
c=$((c+1)) # increase counter by 1
done < /root/phonebook.txt
for i in ${!myArray[#]}; do
if myArray[i]="$NOME" ; then
echo ${myArray[i+1]} >> /root/numbertocall.txt
fi
done
else
echo "Name not found"
fi
2.-But you can also read the array and stop looping like this:
#!/bin/bash
NOME=$1
c=0
if grep "$NOME" /root/phonebook.txt ; then
echo "CREATING ARRAY...."
readarray myArray < /root/phonebook.txt
for i in ${!myArray[#]}; do
if myArray[i]="$NOME" ; then
echo ${myArray[i+1]} >> /root/numbertocall.txt
break # stop looping
fi
done
else
echo "Name not found"
fi
exit 0
3.- The following improves things. Supposing a)$NAME matches the whole line that contains it and b)there's always one line after a $NOME found, this will work; if not (if $NOME can be the last line in the phone-book), then you need to do small adjustments.
!/bin/bash
PHONEBOOK="/root/phonebook.txt"
NUMBERTOCALL="/root/numbertocall.txt"
NOME="$1"
myline=""
myline=$(grep -A1 "$NOME" "$PHONEBOOK" | sed '1d')
if [ -z "$myline" ]; then
echo "Name not found :-("
else
echo -n "$NOME FOUND.... "
echo "$myline" >> "$NUMBERTOCALL"
echo " .... AND SAVED! :-)"
fi
exit 0

Script is re-reading arguments

When I supply the script with the argument: hi[123].txt it will do exactly what I want.
But if I specify the wildcard character ( hi*.txt ) it will be re-reading some files.
I was wondering how to modify this script to fix that silly problem:
#!/bin/sh
count="0"
total="0"
FILE="$1" #FILE specification is now $1 Specification..
for FILE in $#
do
#if the file is not readable then say so
if [ ! -r $FILE ];
then
echo "File: $FILE not readable"
exit 0
fi
# Start processing readable files
while read line
do
if [[ "$line" =~ ^Total ]];
then
tmp=$(echo $line | cut -d':' -f2)
total=$(expr $total + $tmp)
echo "$FILE (s) have a total of:$tmp "
count=$(expr $count + 1)
fi
done < $FILE
done
echo " Total is: $total"
echo " Number of files read is:$count"
This seems redundant:
FILE="$1" #FILE specification is now $1 Specification..
for FILE in $#
...
The initial assignment is promptly overwritten.
On the whole this seems to be a task better suited to a line processing language like awk or perl.
Consider something along the lines of this awk script:
BEGIN{
TOTAL=0;
COUNT=0;
FS=':';
}
/^Total/{
TOTAL += $2;
COUNT++;
printf("File '%s' has a total of %i",FILENAME,TOTAL);
}
END{
printf("Total is %i",TOTAL);
printf("Number of files read is%i",COUNT);
}
I don't know what is wrong with it, but one little point i noticed:
Change for FILE in $# into for FILE in "$#" . Because if files have embedded spaces, you are now on the safe way. It will expand into "$1" "$2" ... then, instead of $1 $2 ... (and note everywhere you use $FILE too remember to "" it).
And what others say, you don't need to initialize FILE before you enter the loop. It will be set to each of the filenames of the expanded positional parameters in the for loop automatically.
However, i would go with an awk script like this:
awk -F: '
/^Total/ {
total += $2
# count++ not needed. see below
print FILENAME "(s) have a total of: " $2
}
END {
print "Total is: " total
print "Number of files read is: " (ARGC-1)
}' foo*.txt
Note that when a file contains multiple "^Count" lines, you would indeed say you read more files than you actually read if you rely on count to tell you the number of files read.
On error, exit with a non-zero status. Also on error, report errors to standard error, not standard output - though that may be a bit advanced for you as yet.
echo "$0: file $FILE not readable" 1>&2
The 1 is theoretically unnecessary (though I remember problems with a shell implementation on Windows if it was omitted). Echoing the script name '$0' at the start of the error message is a good idea too - it makes error tracking easier later when your script is used in other contexts.
I believe this Perl one-liner does the job you are after.
perl -na -F: -e '$sum += $F[1] if m/^Total:/; END { print $sum; }' "$#"
I understand that you are learning shell programming, but one of the important things with shell programming is knowing which programs to use.
How about this solution:
for FILE in `/bin/ls $#`
do
. . .
This will effectively eliminate duplicates because /bin/ls hi1.txt hi1.txt hi1.txt should only show hi1.txt once.
Though I'm not sure why it's re-reading files. The wildcard expansion should only include each file once. Do you have some files matched by hi*.txt that are links to files matched by hi[123].txt?

Resources