I'm having a very odd problem with a bash script, written on Ubuntu 17.04 machine.
I have a txt file that contains information about people in this fashion:
number name surname city state
With these infos I have to create an organization system that works by state. For example, with a list like this
123 alan smith new_york NEW_YORK
123 bob smith buffalo NEW_YORK
123 charles smith los_angeles CALIFORNIA
123 dean smith mobile ALABAMA
the outcome at the end of the computation should be three new files named NEW_YORK, CALIFORNIA and ALABAMA that contain people who live there.
The script takes as a parameter the list of names. I implemented an if statement (which condition is dictated by a test of existance for the file, just in case there's more people living in a certain state) inside a for loop that, oddly, doesn't operate unless I press enter while the program is running. The outcome is right, I get the files with the right people on them, but it baffles me that I have to press enter to make the code work, it doesn't make sense to me.
Here's my code:
#!/bin/bash
clear
#finding how many file lines and adding 1 to use the value as a counter later
fileLines=`wc -l addresses | cut -f1 --delimiter=" "`
(( fileLines = fileLines+1 ))
for (( i=1; i<$fileLines; i++ ))
do
#if the file named as the last column already exists do not create new one
test -e `head -n$i | tail -n1 | cut -f5 --delimiter=" "`
if [ $? = 0 ]
then
head -n$i $1 | tail -n1 >> `head -n$i $1 | tail -n1 | cut -f5 --delimiter=" "`
else
head -n$i $1 | tail -n1 > `head -n$i $1 | tail -n1 | cut -f5 --delimiter=" "`
fi
done
echo "cancel created files? y/n"
read key
if [ $key = y ]
then
rm `ls | grep [A-Z]$`
echo "done"
read
else
echo "done"
read
fi
clear
What am I doing wrong here? And moreover, why doesn't it tells me that something's wrong (clearly there is)?
The immediate problem (pointed out by #that other guy) is that in the line:
test -e `head -n$i | tail -n1 | cut -f5 --delimiter=" "`
The head command isn't given a filename to read from, so it's reading from stdin (i.e. you). But I'd change the whole script drastically, because you're doing it in a very inefficient way. If you have, say, a 1000-line file, you run head to read the first line (3 times, actually), then the first two lines (three times), then the first three... by the time you're done, head has read the first line of the file 3000 times, and then tail has discarded it 2997 of those times. You only really needed to read it once.
When iterating through a file like this, you're much better off just reading the file line-by-line, with something like this:
while read line; do
# process $line here
done <"$1"
But in this case, there's an even better tool. awk is really good at processing files like this, and it can handle the task really simply:
awk '{ if($5!="") { print $0 >>$5 }}' "$1"
(Note: I also put in the if to make sure there is a fifth field/ignore blank lines. Without that check it would've just been awk '{ print $0 >>$5 }' "$1").
Also, the command:
rm `ls | grep [A-Z]$`
...is a really weird and fragile way to do this. Parsing the output of ls is generally a bad idea, and again there's a much simpler way to do it:
rm *[A-Z]
Finally, I recommend running your script through shellcheck.net, since it'll point out some other problems (e.g. unquoted variable references).
Related
STEP1
Like I said in title,
I would like to save output of
tail -f example | grep "DESIRED"
to different file
I have tried
tail -f example | grep "DESIRED" | tee -a different
tail -f example | grep "DESIRED" >> different
all of them not working
and I have searched similar questions and read several experts suggesting buffered
but I cannot use it.....
Is there any other way I can do it?
STEP2
once above is done, I would like to make "different" (filename from above) to time varying. I want to keep change its name in every 30minutes.
For example like
20221203133000
20221203140000
20221203143000
...
I have tried
tail -f example | grep "DESIRED" | tee -a $(date +%Y%m%d%H)$([ $(date +%M) -lt 30 ] && echo 00 || echo 30)00
The problem is since I did not even solve first step, I could not test the second step. But I think this command will only create one file based on the time I run the command,,,, Could I please get some advice?
Below code should do what you want.
Some explanations: as you want bash to execute some "code" (in your case dumping to a different file name) you might need two things running in parallel: the tail + grep, and the code that would decide where to dump.
To connect the two processes I use a name fifo (created with mkfifo) in which what is written by tail + grep (using > tmp_fifo) is read in the while loop (using < tmp_fifo). Then once in a while loop, you are free to output to whatever file name you want.
Note: without line-buffered (like in your question) grep will work, will just wait until it has more data (prob 8k) to dump to the file. So if you do not have lots of data generated in "example" it will not dump it until it is enough.
rm -rf tmp_fifo
mkfifo tmp_fifo
(tail -f input | grep --line-buffered TEXT_TO_CHECK > tmp_fifo &)
while read LINE < tmp_fifo; do
CURRENT_NAME=$(date +%Y%m%d%H)
# or any other code that determines to what file to dump ...
echo $LINE >> ${CURRENT_NAME}
done
I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.
The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.
First, let's start with a while read loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && ((++first))
[[ $line == *'second example'* ]] && ((++second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:
awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt
Explanation: /first example/ {first++} means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second++} does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.
But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:
bothcounts=$(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
read first second <<<"$bothcounts"
(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)
Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:
read first second < <(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands.
So just try this:
FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
It's possible to get matches an count them in a single pass, then get the count of each from the result.
matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"
FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST
Result:
3
Using awk is the best option I think.
So i'm currently trying to grep a single result from a random file in a specific directory. The grepping works just fine and the expected output file is populated as expected, but for some reason, even after the output file has already been filled, the process won't stop. This is the grep command where the program seems to be getting stuck.
searchFILE(){
case $2 in
pref)
echo "Populating output file: $3-$1.data.out"
dataOutputFile="$3-$1.data.out"
zgrep -a "\"someParameter\"\:\"$1\"" /folder/anotherFolder/filetemplate.log.* | zgrep -a "\"parameter2\"\:\"$3\"" | head -1 > $dataOutputFile
;;
*)
echo "Unrecognized command"
;;
esac
echo "Query finished"
}
What is currently happening is that the output file is being populated as expected with the head pipe, but for some reason I'm not getting the "Query finished" message, and the process seems not to stop at all.
grep does not know that head -n1 is no longer reading from the pipe until it attempts to write to the pipe, which it will only do if another match is found. There is no direct communication between the processes. It will eventually stop, but only once all the data is read, a second match is found and write fails with EPIPE, or some other error occurs.
You can watch this happen in a simple pipeline like this:
cat /dev/urandom | grep -ao "12[0-9]" | head -n1
With a sufficiently rare pattern, you will observe a delay between output and exit.
One solution is to change your stop condition. Instead of waiting for SIGPIPE as your pipeline does, wait for grep to match once using the -m1 option:
cat /dev/urandom | grep -ao -m1 "12[0-9]"
I saw better performance results with zcat myZippedFile | grep whatever paradigm...
The first difference you need to try is pipe with | head -z --lines=1
The reason is null terminated lines instead of newlines (just in case).
My example script below worked (drop the case statement to make it more simple). If I hold onto $1 $2 inside functions things go wrong. I use parameter $names and only use the $1 $2 $# once, because it also goes wrong for me if I don't and in any case you can then shift over $# and catch arguments. The $# in the script itself are not the same as arguments in bash functions.
grep searching for 2 or multiple parameters in any order means using grep twice; in your case zgrep | grep. The second grep is a normal grep! You only need the first grep to be zgrep to do the unzip. Your question is simpler if you drop the case statement as bash case scares off people: bash was always an ugly lady that works good for short scripts.
zgrep searches text or compressed text, but newlines in LINUX style vs WINDOWS are not the same. So use dos2unix to convert files so that newlines work. I use compressed file simply because it is strange and rare to see zgrep, so it is demonstrated in a shell script with a compressed file! It works for me. I changed a few things, like >> and "sort -u" but you can obviously change them back.
#!/usr/bin/env bash
# Search for egA AND egB using option go
# COMMAND LINE: ./zgrp egA go egB
A="$1"
cOPT="$2" # expecting case go
B="$3"
LOG="./filetemplate.log" # use parameters for long names.
# Generate some data with gzip and delete the temporary file.
echo "\"pramA\":\"$A\" \"pramB\":\"$B\"" >> $B$A.tmp
rm -f ${LOG}.A; tar czf ${LOG}.A $B$A.tmp
rm -f $B$A.tmp
# Use paramaterise $names not $1 etc because you may want to do shift etc
searchFILE()
{
outFile="$B-$A.data.out"
case $cOPT in
go) # This is zgrep | grep NOT zgrep | zgrep
zgrep -a "\"pramA\":\"$A\"" ${LOG}.* | grep -a "\"pramB\":\"$B\"" | head -z --lines=1 >> $outFile
sort -u $outFile > ${outFile}.sorted # sort unique on your output.
;;
*) echo -e "ERROR second argument must be go.\n Usage: ./zgrp egA go egB"
exit 9
;;
esac
echo -e "\n ============ Done: $0 $# Fin. ============="
}
searchFILE "$#"
cat ${outFile}.sorted
I'm writing a Bash script where I need to look through the output of a command and do certain actions based on that output. For clarity, this command will output a few million lines of text and it may take roughly an hour or so to do so.
Currently, I'm executing the command and piping it into a while loop that reads a line at a time then looks for certain criteria. If that criterion exists, then update a .dat file and reprint the screen. Below is a snippet of the script.
eval "$command"| while read line ; do
if grep -Fq "Specific :: Criterion"; then
#pull the sixth word from the line which will have the data I need
temp=$(echo "$line" | awk '{ printf $6 }')
#sanity check the data
echo "\$line = $line"
echo "\$temp = $temp"
#then push $temp through a case statement that does what I need it to do.
fi
done
So here's the problem, the sanity check on the data is showing weird results. It is printing lines that don't contain the grep criteria.
To make sure that my grep statement is working properly, I grep the log file that contains a record of the text that is output by the command and it outputs only the lines that contain the specified criteria.
I'm still fairly new to Bash so I'm not sure what's going on. Could it be that the command is force feeding the while loop a new $line before it can process the $line that met the grep criteria?
Any ideas would be much appreciated!
How does grep know what line looks like?
if ( printf '%s\n' "$line" | grep -Fq "Specific :: Criterion"); then
But I cant help feel like you are overcomplicating a lot.
function process() {
echo "I can do anything I want"
echo " per element $1"
echo " that I want here"
}
export -f process
$command | grep -F "Specific :: Criterion" | awk '{print $6}' | xargs -I % -n 1 bash -c "process %";
Run the command, filter only matching lines, and pull the sixth element. Then if you need to run an arbitrary code on it, send it to a function (you export to make it visible in subprocesses) via xargs.
What are you applying the grep on ?
Modify
if grep -Fq "Specific :: Criterion"; then
as below
if ( echo $line | grep -Fq "Specific :: Criterion" ); then
I'm encountering a very strange thing while writing a shell script.
Initially I have the following script.
OUTPUT_FOLDER=/home/user/output
HOST_FILE=/home/user/host.txt
LOG_DIR=/home/user/log
SLEEPTIME=60
mkdir -p $OUTPUT_FOLDER
while true
do
rm -f $OUTPUT_FOLDER/*
DATE=`date +"%Y%m%d"`
DATETIME=`date +"%Y%m%d_%H%M%S"`
pssh -h $HOST_FILE -o $OUTPUT_FOLDER
while read host
do
numline=0
while read line
do
if [[ $numline == 0 ]]
then
FIELD1=`echo $line | cut -d':' -f2 | cut =d' ' -f2`
fi
((numline+=1))
done < ${OUTPUT_FOLDER}/${host}
echo "$DATETIME,$FIELD1" >> ${LOG_DIR}/${DATE}.log
done < $HOST_FILE
sleep $SLEEPTIME
done
When I ran this script, every 60 seconds, I'll see the $DATETIME,$FIELD1 values in my log file.
What is very strange, is that every 30 seconds or so, after the first minute has passed, I'll see 20160920_120232,, meaning there was output where there shouldn't be, and I guess because I deleted the contents of my output folder, FIELD1 is empty.
What is even more strange, is while trying to debug this, I added a few more echo statements to print to my log file, and then I deleted those lines. However, they continued to be printed every 30 seconds or so, after the first minute has passed.
What is stranger still is that I then commented out everything inside the while true block, that is,
OUTPUT_FOLDER=/home/user/output
HOST_FILE=/home/user/host.txt
LOG_DIR=/home/user/log
SLEEPTIME=60
mkdir -p $OUTPUT_FOLDER
while true
do
: << 'END'
rm -f $OUTPUT_FOLDER/*
DATE=`date +"%Y%m%d"`
DATETIME=`date +"%Y%m%d_%H%M%S"`
pssh -h $HOST_FILE -o $OUTPUT_FOLDER
while read host
do
numline=0
while read line
do
if [[ $numline == 0 ]]
then
FIELD1=`echo $line | cut -d':' -f2 | cut =d' ' -f2`
fi
((numline+=1))
done < ${OUTPUT_FOLDER}/${host}
echo "$DATETIME,$FIELD1" >> ${LOG_DIR}/${DATE}.log
done < $HOST_FILE
sleep $SLEEPTIME
END
done
Even with this script, where I'm expecting nothing to be printed to my log file, I see my previous echo statements that I have deleted, and the lines with the empty FIELD1. I have checked that I'm running the correct version of the script each time.
What is going on?
Am not sure, it could be a the actual reason for the screw-up. You have an incorrect usage of cut in line number 22 which could have mangled the FIELD1 inadvertently.
FIELD1=`echo $line | cut -d':' -f2 | cut =d' ' -f2`
# ^ incorrect usage of the delimiter '-d' flag
which should have been used as
FIELD1=`echo $line | cut -d':' -f2 | cut -d' ' -f2`
# ^ Proper usage of the '-d' flag with space delimiter
I tried to capture the piped command output to see if the last command could have succeeded and below was my observation.
echo $line | cut -d':' -f2 | cut =d' ' -f2;echo "${PIPESTATUS[#]}"
cut: =d : No such file or directory # --> Error from the failed 'cut' command
0 0 1 # --> Return codes of each of the command being piped
Also, if you have been a bash purist, you have avoided the legacy `` style command expansion and used a $(cmd) syntax like,
FIELD1=$(echo "$line" | cut -d':' -f2 | cut -d' ' -f2)
Another simpler way would have been to avoid use of cut & echo and do pure bash string-manipulation.
Assuming your $line contains : and de-limited string e.g."What Hell:Hello World" and you are trying to extract the World part as I could see, just do
FIELD1="${line##* }" # Strips-off characters until last occurrence of whitespace
I tried to reproduce it and couldn't. Thre is no reason for this script (when commented as you have) to produce any output.
There are couple of cases:
As others pointed out, there may be some other script writing to your log file
You may have left few instances of this script running on other terminals
You may be running this script from crontab once every 5 mins
In order to rule out possibilities, do a "ps -aef" and check out all the processes that could resemble the name of your script. To catch things from crontab, you may have to watch the output of "ps -aef" a big longer (or just check in the crontab entries).