Directing awk output to variable - linux

New guy here with a problem that will hopefully have an easy solution, but I just can't seem to manage.
So, I have a large list of files that I need to process using the same command line program, and I'm trying to write a small shell script to automate this. I wrote something that will read the input file name from a text file, and repeat the command for each of those files. So far so good. My problem though is with naming the output. Each file is named in the general format "lane_number_bla_bla_bla", and they are processed in pairs. So, there will be a "lane_1_bla_bla_bla_001" and "lane_1_bla_bla_bla_002" that need to combine into a single output file. For this, I'm trying to use awk to read the sample number from the .txt list of input files and parse it into the output file number. Here's the code I came up with (note that the echo statement before the command is there just for testing; it's removed when it comes to run the actual program; also this is not the actual command which is rather more complicated, but the principle still applies):
echo "Which input1 should I use?"
read text
input1=$text
echo "Which input2 should I use?"
read text
input2=$text
echo "How many lines?"
read text
n=$text
for i in $(seq 1 $n)
do
awkinput1=$(awk NR==$i $input1)
awkinput2=$(awk NR==$i $input2)
num=$(awk 'NR==$i{print $2 }' FS="_" $input1)
lane=$(awk 'NR==$i{print $1 }' FS="_" $input1)
echo "command $awkinput1.in > $awkinput1.out && command $awkinput2.in > $awkinput2.out && command cat $awkinput1.out $awkinput2.in > $num-$lane-CAT.out &"
if (( $i % 10 == 0 )); then wait; fi # Limit to 10 concurrent subshells.
done
When I run this, both $awkinput fields get replaced properly in the comand line by the appropriate filename, but not the $num and $lane fields, which print nothing.
So, what am I doing wrong? I'm sure it's pretty simple, but I tried quite a lot of different ways to format the relevant awk command, and nothing seems to work. I'm doing this on a remote linux server using SSH protocol, if it makes a difference.
Thanks a lot!

Shell does not parse $i quoted by single quote ('). So quoted string should be terminated before $i.
FS should be set before parsing lines.
Following code will work.
num=$(awk 'BEGIN{FS="_"} NR=='$i'{print $2 }' $input1)
lane=$(awk 'BEGIN{FS="_"} NR=='$i'{print $1 }' $input1)
Code below will be more efficient:
while read in1 ; do
read in2 <&3
num=$(awk 'BEGIN{FS="_"} {print $2 }' <<<"$in1")
lane=$(awk 'BEGIN{FS="_"} {print $1 }' <<<"$in1")
...
done <$input1 3<$input2

Related

Separating lines of a huge file into two files depending on the date

I'm gathering tones of data in a stream on an Ubuntu machine, the data is stored in days packages (where each day_file contains somewhere between 1 and 5 gb). I'm not an experienced linux/bash/awk user, but the data looks something like this (all lines start with a date):
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
Now to the problem, the stream is cut around midnight local time (for a few reasons it can't be cut at exact 00.00.00 gtm time). This means that rows from two dates are stored in the same file and I want to separate them into the correct date files. I wrote the following script trying to separate the rows, it works but it takes several hours to run and I think that there must be a faster way of doing this operation?
#!/bin/bash
dateDiff (){
line_str="$1"
dte1="2020-09-01"
dte2=${line_str:0:10}
if [[ "$dte1" == "$dte2" ]]; then
echo $line_str >> correct_date.txt;
else
echo $line_str >> wrong_date.txt;
fi
}
IFS=$'\n'
for line in $(cat massive_file.txt)
do
dateDiff "$line"
done
unset IFS
Using this awk script I'm able to process 10GB file in approx 1 minute on my machine.
awk '{ if ($0 ~ /^2020-08-31/) { print $0 > "correct.txt" } else { print $0 > "wrong.txt" } }' input_file_name.txt
Line is checked against regular expression containing your date, then whole line is printed to file based on regexp match.
Using awk with T as your field separator, the first field, $1, will be the date. Then you can output each record to a file named for the date.
$ cat file
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
2020-09-01T00:00:00Z !In a unreadable way
$ awk -FT '{ print > ($1 ".txt") }' file
$ ls 20*.txt
2020-08-31.txt 2020-09-01.txt
$ cat 2020-09-01.txt
2020-09-01T00:00:00Z !In a unreadable way
$ cat 2020-08-31.txt
2020-08-31T23:59:59Z !RANDOM numbers and letters
2020-08-31T23:59:59Z $Enconding the data
Some notes:
Using a bash loop to read logs would be very slow.
Using awk, sed, grep or similar is very good, but still you will have to read and write whole files line by line, and this has a perfomance ceiling.
For your specific case, you could only identify the split points, which can be 3, not only 2, (previous, current and next day logs can co-exist in a file) with something like grep -nm1 "^$day" and then split the log file with a combination of head and tail, like this. Then append or prepend them to the existing ones. This would be a very fast solution because you would write the files massively, not line by line.
Here is a simple solution with grep, as you need to test only the 10 first characters of the log lines, and for this job grep is faster than awk.
Assuming that you store logs in a destination directory, every incoming file should pass from something like this script. Order of processing is important, you have to follow date order of the files, e.g. you see that I append to an existing file. This is just a demo solution for guidance.
#!/bin/bash
[[ -f "$1" ]] && f="$1" || { echo "Nothing to do"; exit 1; }
dest_dir=archive/
suffix="_file.log"
curr=${f:0:10}
prev=$( date -d "$curr -1 day" "+%Y-%m-%d" )
next=$( date -d "$curr +1 day" "+%Y-%m-%d" )
for d in $prev $curr $next; do
grep "^$d" "$f" >> "${dest_dir}${d}${suffix}"
done

Linux Scripting with Spaces in Filenames

I am currently working with a vendor-provided software that is trying to handle sending attachment files to another script that will text-extract from the listed file. The script fails when we receive files from an outside source that contain spaces, as the vendor-supplied software does not surround the filename in quotes - meaning when the text-extraction script is run, it receives a filename that will split apart on the space and cause an error on the extractor script. The vendor-provided software is not editable by us.
This whole process is designed to be an automated transfer, so having this wrench that could be randomly thrown into the gears is an issue.
What we're trying to do, is handle the spaced name in our text extractor script, since that is the piece we have some control over. After a quick Google, it seems like changing the IFS value for the script would be the quick solution, but unfortunately, that script would take effect after the extensions have already mutilated the incoming data.
The script I'm using takes in a -e value, a -i value, and a -o value. These values are sent from the vendor supplied script, which I have no editing control over.
#!/bin/bash
usage() { echo "Usage: $0 -i input -o output -e encoding" 1>&2; exit 1; }
while getopts ":o:i:e:" o; do
case "${o}" in
i)
inputfile=${OPTARG}
;;
o)
outputfile=${OPTARG}
;;
e)
encoding=${OPTARG}
;;
*)
usage
;;
esac
done
shift $((OPTIND-1))
...
...
<Uses the inputfile, outputfile, and encoding variables>
I admit, there may be pieces to this I don't fully understand, and it could be a simple fix, but my end goal is to be able to extract -o, -i, and -e that all contain 1 value, regardless of the spaces within each section. I can handle quoting the script after I can extract the filename value
The script fragment that you have posted does not have any issues with spaces in the arguments.
The following, for example, does not need quoting (since it's an assignment):
inputfile=${OPTARG}
All other uses of $inputfile in the script should be double quoted.
What matters is how this script is called.
This would fail and would assign only hello to the variable inputfile:
$ ./script.sh -i hello world.txt
The string world.txt would prompt the getopts function to stop processing the command line and the script would continue with the shift (world.txt would be left in $1 afterwards).
The following would correctly assign the string hello world.txt to inputfile:
$ ./script.sh -i "hello world.txt"
as would
$ ./script.sh -i hello\ world.txt
The following script uses awk to split the arguments while including spaces in the file names. The arguments can be in any order. It does not handle multiple consecutive spaces in an argument, it collapses them to one.
#!/bin/bash
IFS=' '
str=$(printf "%s" "$*")
istr=$(echo "${str}" | awk 'BEGIN {FS="-i"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-e"} {print $1}')
estr=$(echo "${str}" | awk 'BEGIN {FS="-e"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
ostr=$(echo "${str}" | awk 'BEGIN {FS="-o"} {print $2}' | awk 'BEGIN {FS="-e"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
inputfile=""${istr}""
outputfile=""${ostr}""
encoding=""${estr}""
# call the jar
There was an issue when calling the jar where Java threw a MalformedUrlException on a filename with a space.
So after reading through the commentary, we decided that although it may not be the right answer for every scenario, the right answer for this specific scenario was to extract the pieces manually.
Because we are building this for a pre-built script passing to it, and we aren't updating that script any time soon, we can accept with certainty that this script will always receive a -i, -o, and -e flag, and there will be spaces between them, which causes all the pieces passed in to be stored in different variables in $*.
And we can assume that the text after a flag is the response to the flag, until another flag is referenced. This leaves us 3 scenarios:
The variable contains one of the flags
The variable contains the first piece of a parameter immediately after the flag
The variable contains part 2+ of a parameter, and the space in the name was interpreted as a split, and needs to be reinserted.
One of the other issues I kept running into was trying to get string literals to equate to variables in my IF statements. To resolve that issue, I pre-stored all relevant data in array variables, so I could test $variable == $otherVariable.
Although I don't expect it to change, we also handled what to do if the three flags appear in a different order than we anticipate (Our assumption was that they list as i,o,e... but we can't see excatly what is passed). The parameters are dumped into an array in the order they were read in, and a parallel array tracks whether the items in slots 0,1,2 relate to i,o,e.
The final result still has one flaw: if there is more than one consecutive space in the filename, the whitespace is trimmed before processing, and I can only account for one space. But saying as we processed over 4000 files before encountering one with a space, I find it unlikely with the naming conventions that we would encounter something with more than one space.
At that point, we would have to be stepping in for a rare intervention anyways.
Final code change is as follows:
#!/bin/bash
IFS='|'
position=-1
ioeArray=("" "" "")
previous=""
flagArr=("-i" "-o" "-e" " ")
ioePattern=(0 1 2)
#echo "for loop:"
for i in $*; do
#printf "%s\n" "$i"
if [ "$i" == "${flagArr[0]}" ] || [ "$i" == "${flagArr[1]}" ] || [ "$i" == "${flagArr[2]}" ]; then
((position += 1));
previous=$i;
case "$i" in
"${flagArr[0]}")
ioePattern[$position]=0
;;
"${flagArr[1]}")
ioePattern[$position]=1
;;
"${flagArr[2]}")
ioePattern[$position]=2
;;
esac
continue;
fi
if [[ $previous == "-"* ]]; then
ioeArray[$position]=${ioeArray[$position]}$i;
else
ioeArray[$position]=${ioeArray[$position]}" "$i;
fi
previous=$i;
done
echo "extracting (${ioeArray[${ioePattern[0]}]}) to (${ioeArray[${ioePattern[1]}]}) with (${ioeArray[${ioePattern[2]}]}) encoding."
inputfile=""${ioeArray[${ioePattern[0]}]}"";
outputfile=""${ioeArray[${ioePattern[1]}]}"";
encoding=""${ioeArray[${ioePattern[2]}]}"";

Writing variables to file with bash

I'm trying to configure a file with a bash script. And the variables in the bash script are not written in file as it is written in script.
Ex:
#!/bin/bash
printf "%s" "template("$DATE\t$HOST\t$PRIORITY\t$MSG\n")" >> /file.txt
exit 0
This results to template('tttn') instead of template("$DATE\t$HOST\t$PRIORITY\t$MSG\n in file.
How do I write in the script so that the result is template("$DATE\t$HOST\t$PRIORITY\t$MSG\n in the configured file?
Is it possible to write variable as it looks in script to file?
Enclose the strings you want to write within single quotes to avoid variable replacement.
> FOO=bar
> echo "$FOO"
bar
> echo '$FOO'
$FOO
>
Using printf in any shell script is uncommon, just use echo with the -e option.
It allows you to use ANSI C metacharacters, like \t or \n. The \n at the end however isn't necessary, as echo will add one itself.
echo -e "template(${DATE}\t${HOST}\t${PRIORITY}\t${MSG})" >> file.txt
The problem with what you've written is, that ANSI C metacharacters, like \t can only be used in the first parameter to printf.
So it would have to be something like:
printf 'template(%s\t%s\t%s\t%s)\n' ${DATE} ${HOST} ${PRIORITY} ${MSG} >> file.txt
But I hope we both agree, that this is very hard on the eyes.
There are several escaping issues and the power of printf has not been used, try
printf 'template(%s\t%s\t%s\t%s)\n' "${DATE}" "${HOST}" "${PRIORITY}" "${MSG}" >> file.txt
Reasons for this separate answer:
The accepted answer does not fit the title of the question (see comment).
The post with the right answer
contains wrong claims about echo vs printf as of this post and
is not robust against whitespace in the values.
The edit queue is full at the moment.

formatting issue in printf script

I have a file stv.txt containing some names
For example stv.txt is as follows:
hello
world
I want to generate another file by using these names and adding some extra text to them.I have written a script as follows
for i in `cat stvv.txt`;
do printf 'if(!strcmp("$i",optarg))' > my_file;
done
output
if(!strcmp("$i",optarg))
desired output
if(!strcmp("hello",optarg))
if(!strcmp("world",optarg))
how can I get the correct result?
This is a working solution.
1 All symbols inside single quotes is considered a string. 2 When using printf, do not surround the variable with quotes. (in this example)
The code below should fix it,
for i in `cat stvv.txt`;
printf 'if(!strcmp('$i',optarg))' > my_file;
done
basically, break the printf statement into three parts.
1: the string 'if(!strcmp('
2: $i (no quotes)
3: the string ',optarg))'
hope that helps!
To insert a string into a printf format, use %s in the format string:
$ for line in $(cat stvv.txt); do printf 'if(!strcmp("%s",optarg))\n' "$line"; done
if(!strcmp("hello",optarg))
if(!strcmp("world",optarg))
The code $(cat stvv.txt) will perform word splitting and pathname expansion on the contents of stvv.txt. You probably don't want that. It is generally safer to use a while read ... done <stvv.txt loop such as this one:
$ while read -r line; do printf 'if(!strcmp("%s",optarg))\n' "$line"; done <stvv.txt
if(!strcmp("hello",optarg))
if(!strcmp("world",optarg))
Aside on cat
If you are using bash, then $(cat stvv.txt) could be replaced with the more efficient $(<stvv.txt). This question, however, is tagged shell not bash. The cat form is POSIX and therefore portable to all POSIX shells while the bash form is not.

Looping through lines in a file in bash, without using stdin

I am foxed by the following situation.
I have a file list.txt that I want to run through line by line, in a loop, in bash. A typical line in list.txt has spaces in. The problem is that the loop contains a "read" command. I want to write this loop in bash rather than something like perl. I can't do it :-(
Here's how I would usually write a loop to read from a file line by line:
while read p; do
echo $p
echo "Hit enter for the next one."
read x
done < list.txt
This doesn't work though, because of course "read x" will be reading from list.txt rather than the keyboard.
And this doesn't work either:
for i in `cat list.txt`; do
echo $i
echo "Hit enter for the next one."
read x
done
because the lines in list.txt have spaces in.
I have two proposed solutions, both of which stink:
1) I could edit list.txt, and globally replace all spaces with "THERE_SHOULD_BE_A_SPACE_HERE" . I could then use something like sed, within my loop, to replace THERE_SHOULD_BE_A_SPACE_HERE with a space and I'd be all set. I don't like this for the stupid reason that it will fail if any of the lines in list.txt contain the phrase THERE_SHOULD_BE_A_SPACE_HERE (so malicious users can mess me up).
2) I could use the while loop with stdin and then in each loop I could actually launch e.g. a new terminal, which would be unaffected by the goings-on involving stdin in the original shell. I tried this and I did get it to work, but it was ugly: I want to wrap all this up in a shell script and I don't want that shell script to be randomly opening new windows. What would be nice, and what might somehow be the answer to this question, would be if I could figure out how to somehow invoke a new shell in the command and feed commands to it without feeding stdin to it, but I can't get it to work. For example this doesn't work and I don't really know why:
while read p; do
bash -c "echo $p; echo ""Press enter for the next one.""; read x;";
done < list.txt
This attempt seems to fail because "read x", despite being in a different shell somehow, is still seemingly reading from list.txt. But I feel like I might be close with this one -- who knows.
Help!
You must open as a different file descriptor
while read p <&3; do
echo "$p"
echo 'Hit enter for the next one'
read x
done 3< list.txt
Update: Just ignore the lengthy discussion in the comments below. It has nothing to do with the question or this answer.
I would probably count lines in a file and iterate each of those using eg. sed. It is also possible to read infinitely from stdin by changing while condition to: while true; and exit reading with ctrl+c.
line=0 lines=$(sed -n '$=' in.file)
while [ $line -lt $lines ]
do
let line++
sed -n "${line}p" in.file
echo "Hit enter for the next ${line} of ${lines}."
read -s x
done
AWK is also great tool for this. Simple way to iterate through input would be like:
awk '{ print $0; printf "%s", "Hit enter for the next"; getline < "-" }' file
As an alternative, you can read from stderr, which by default is connected to the tty as well. The following then also includes a test for that assumption:
(
tty -s <& 2|| exit 1
while read -r line; do
echo "$line"
echo 'Hit enter'
read x <& 2
done < file
)

Resources