Filter lines by number of fields - linux

I am filtering very long text files in Linux (usually > 1GB) to get only those lines I am interested in. I use with this command:
cat ./my/file.txt | LC_ALL=C fgrep -f ./my/patterns.txt | $decoder > ./path/to/result.txt
$decoder is the path to a program I was given to decode these files. The problem now is that it only accept lines with 7 fields, this is, 7 strings separated by spaces (e.g. "11 22 33 44 55 66 77"). Whenever a string with more or less fields is passed into this program makes it crash, and I get a broken pipe error message.
To fix it, I wrote a super simple script in Bash:
while read line ; do
if [[ $( echo $line | awk '{ print NF }') == 7 ]]; then
echo $line;
fi;
done
But the problem is that now it take ages to finish. Before it took seconds and now it takes ~30 minutes.
Does anyone know a better/faster way to do this? Thank you in advance.

Well perhaps you can insert awk between instead. No need to rely on Bash:
LC_ALL=C fgrep -f ./my/patterns.txt ./my/file.txt | awk 'NF == 7' | "$decoder" > ./path/to/result.txt
Perhaps awk can be the starter. Performance may be better that way:
awk 'NF == 7' ./my/file.txt | LC_ALL=C fgrep -f ./my/patterns.txt | "$decoder" > ./path/to/result.txt
You can merge fgrep and awk as a single awk command however I'm not sure if that would affect anything that require LC_ALL=C and that it would give better performance.

Related

Saving values in BASH shell variables while using |tee

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.
The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.
First, let's start with a while read loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && ((++first))
[[ $line == *'second example'* ]] && ((++second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:
awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt
Explanation: /first example/ {first++} means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second++} does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.
But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:
bothcounts=$(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
read first second <<<"$bothcounts"
(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)
Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:
read first second < <(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands.
So just try this:
FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
It's possible to get matches an count them in a single pass, then get the count of each from the result.
matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"
FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST
Result:
3
Using awk is the best option I think.

bash script: calculate sum size of files

I'm working on Linux and need to calculate the sum size of some files in a directory.
I've written a bash script named cal.sh as below:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done<`ls -l | grep opencv | awk '{print $5}'`
However, when I executed this script ./cal.sh, I got an error:
./cal.sh: line 6: `ls -l | grep opencv | awk '{print $5}'`: ambiguous redirect
And if I execute it with sh cal.sh, it seems to work but I will get some weird message at the end of output:
25
31
385758: File name too long
Why does sh cal.sh seem to work? Where does File name too long come from?
Alternatively, you can do:
du -cb *opencv* | awk 'END{print $1}'
option -b will display each file in bytes and -c will print the total size.
Ultimately, as other answers will point out, it's not a good idea to parse the output of ls because it may vary between systems. But it's worth knowing why the script doesn't work.
The ambiguous redirect error is because you need quotes around your ls command i.e.:
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done < "`ls -l | grep opencv | awk '{print $5}'`"
But this still doesn't do what you want. The "<" operator is expecting a filename, which is being defined here as the output of the ls command. But you don't want to read a file, you want to read the output of ls. For that you can use the "<<<" operator, also known as a "here string" i.e.:
while IFS='' read -r line || [[ -n "$line" ]]; do
echo $line
done <<< "`ls -l | grep opencv | awk '{print $5}'`"
This works as expected, but has some drawbacks. When using a "here string" the command must first execute in full, then store the output of said command in a temporary variable. This can be a problem if the command takes long to execute or has a large output.
IMHO the best and most standard method of iterating a commands output line by line is the following:
ls -l | grep opencv | awk '{print $5} '| while read -r line ; do
echo "line: $line"
done
I would recommend against using that pipeline to get the sizes of the files you want - in general parsing ls is something that you should avoid. Instead, you can just use *opencv* to get the files and stat to print the size:
stat -c %s *opencv*
The format specifier %s prints the size of each file in bytes.
You can pipe this to awk to get the sum:
stat -c %s *opencv* | awk '{ sum += $0 } END { if (sum) print sum }'
The if is there to ensure that no input => no output.

Concatenation of huge number of selective files from a directory in Shell

I have more than 50000 files in a directory such as file1.txt, file2.txt, ....., file50000.txt. I would like to concatenate of some files whose file numbers are listed in the following text file (need.txt).
need.txt
1
4
35
45
71
.
.
.
I tried with the following. Though it is working, but I look for more simpler and short way.
n1=1
n2=$(wc -l < need.txt)
while [ $n1 -le $n2 ]
do
f1=$(awk 'NR=="$n1" {print $1}' need.txt)
cat file$f1.txt >> out.txt
(( n1++ ))
done
This might also work for you:
sed 's/.*/file&.txt/' < need.txt | xargs cat > out.txt
Something like this should work for you:
sed -e 's/.*/file&.txt/' need.txt | xargs cat > out.txt
It uses sed to translate each line into the appropriate file name and then hands the filenames to xargs to hand them to cat.
Using awk it could be done this way:
awk 'NR==FNR{ARGV[ARGC]="file"$1".txt"; ARGC++; next} {print}' need.txt > out.txt
Which adds each file to the ARGV array of files to process and then prints every line it sees.
It is possible do it without any sed or awk command. Directly using bash built-in functions and cat (of course).
for i in $(cat need.txt); do cat file${i}.txt >> out.txt; done
And as you want, it is quite simple.

Shell script to get count of a variable from a single line output

How can I get the count of the # character from the following output. I had used tr command and extracted? I am curious to know what is the best way to do it? I mean other ways of doing the same thing.
{running_device,[test#01,test#02]},
My solution was:
echo '{running_device,[test#01,test#02]},' | tr ',' '\n' | grep '#' | wc -l
I think it is simpler to use:
echo '{running_device,[test#01,test#02]},' | tr -cd # | wc -c
This yields 2 for me (tested on Mac OS X 10.7.5). The -c option to tr means 'complement' (of the set of specified characters) and -d means 'delete', so that deletes every non-# character, and wc counts what's provided (no newline, so the line count is 0, but the character count is 2).
Nothing wrong with your approach. Here are a couple of other approaches:
echo $(echo {running_device,[test#01,test#02]}, |awk -F"#" '{print NF - 1}')
or
echo $((`echo {running_device,[test#01,test#02]} | sed 's+[^#]++g' | wc -c` - 1 ))
The only concern I would have is if you are running this command in a loop (e.g. once for every line in a large file). If that is the case, then execution time could be an issue as stringing together shell utilities incurs the overhead of launching processes which can be sloooow. If this is the case, then I would suggest writing a pure awk version to process the entire file.
Use GNU Grep to Avoid Character Translation
Here's another way to do this that I personally find more intuitive: extract just the matching characters with grep, then count grep's output lines. For example:
echo '{running_device,[test#01,test#02]},' |
fgrep --fixed-strings --only-matching # |
wc -l
yields 2 as the result.

Simpler way of extracting text from file

I've put together a batch script to generate panoramas using the command line tools used by Hugin. One interesting thing about several of those tools is they allow multi-core usage, but this option has to be flagged within the command.
What I've come up with so far:
#get the last fields of each line in the file, initialize the line counter
results=$(more /proc/cpuinfo | awk '{print ($NF)}')
count=0
#loop through the results till the 12th line for cpu core count
for result in $results; do
if [ $count == 12 ]; then
echo "Core Count: $result"
fi
count=$((count+1))
done
Is there a simpler way to do this?
result=$(awk 'NR==12{print $NF}' /proc/cpuinfo)
To answer your question about getting the first/last so many lines, you could use head and tail,e.g. :
cat /proc/cpuinfo | awk '{print ($NF)}' | head -12 | tail -1
But instead of searching for the 12th line, how about searching semantically for any line containing cores. For example, some machines may have multiple cores, so you may want to sum the results:
cat /proc/cpuinfo | grep "cores" | awk '{s+=$NF} END {print s}'
count=$(getconf _NPROCESSORS_ONLN)
see getconf(1) and sysconf(3) constants.
According to the Linux manpage, _SC_NPROCESSORS_ONLN "may not be standard". My guess is this requires glibc or even a Linux system specifically. If that doesn't work, I'd probably take looking at /sys/class/cpuid (perhaps there's something better?) over parsing /proc/cpuinfo. None of the above are completely portable.
There are many ways:
head -n 12 /proc/cpuinfo | tail -1 | awk -F: '{print $2}'
grep 'cpu cores' /proc/cpuinfo | head -1 | awk -F: '{print $2}'
and so on.
But I must note that you take only the information from the first section of /proc/cpuinfo and I am not sure that that is what you need.
And if the cpuinfo changes its format ;) ? Maybe something like this will be better:
cat /proc/cpuinfo|sed -n 's/cpu cores\s\+:\s\+\(.*\)/\1/p'|tail -n 1
And make sure to sum the cores. Mine has got like 12 or 16 of them ;)
unsure what you are trying to do and why what ormaaj said above wouldn't wouldn't work either. my instinct based on your description would have been much simpler along the lines of.
grep processor /proc/cpuinfo | wc -l

Resources