Length comparison of one specific field in linux - linux

I was trying to check the length of second field of a TSV file (hundreds of thousands of lines). However, it runs very very slowly. I guess it should be something wrong with "echo", but not sure how to do.
Input file:
prob name
1.0 Claire
1.0 Mark
... ...
0.9 GFGKHJGJGHKGDFUFULFD
So I need to print out what went wrong in the name. I tested with a little example using "head -100" and it worked. But just can't cope with original file.
This is what I ran:
for title in `cat filename | cut -f2`;do
length=`echo -n $line | wc -m`
if [ "$length" -gt 10 ];then
echo $line
fi
done

awk to rescue:
awk 'length($2)>10' file
This will print all lines having the second field length longer than 10 characters.
Note that it doesn't require any block statement {...} because if the condition is met, awk will by default print the line.

Try this probably:
cat file.tsv | awk '{if (length($2) > 10) print $0;}'
This should be a bit faster since the whole processing is done by the single awk process, while your solution starts 2 processes per loop iteration to make that comparison.

We can use awk if that helps.
awk '{if(length($2) > 10){print}}' filename
$2 here is 2nd field in filename which runs for every line. It would be faster.

Related

How to monitor CPU usage automatically and return results when it reaches a threshold

I am new to shell script , i want to write a script to monitor CPU usage and if the CPU usage reaches a threshold it should print the CPU usage by top command ,here is my script , which is giving me error bad number and also not storing any value in the log files
while sleep 1;do if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]; then echo "$(top -n1)">>sys.log;fi;done
Your script HAS to be indented and stored to a file, especially if you are new to shell !
#!/bin/sh
while sleep 1
do
if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]
then
echo "$(top -n1)" >> sys.log
fi
done
Your condition looks a bit odd. It may work, but it looks really complex. Store intermediate results in variables, and evaluate them.
Then, you will immediately see the syntax error on the “-ge”.
You HAVE to store logfiles within an absolute path for security reasons. Use variables to simplify the reading.
#!/bin/sh
LOGFILE=/absolute_path/sy.log
WHOLEFILE=/absolute_path/sys.log
Thresold=80
while sleep 1
do
TOP="$(top -n1)"
CPU="$(echo $TOP | grep -i ^cpu | awk '{print $2}')"
echo $CPU >> $LOGFILE
if [ "$CPU" -ge "$Threshold" ] ; then
echo "$TOP" >> $WHOLEFILE
fi
done
You have a couple of errors.
If you write output to sy.log with a redirection then that output is no longer available to the shell. You can work around this with tee.
The dash before -ge must not be followed by a space.
Also, a few stylistic remarks:
grep x | awk '{y}' is a useless use of grep; this can usefully and more economically (as well as more elegantly) be rewritten as awk '/x/{y}'
echo "$(command)" is a useless use of echo -- not a deal-breaker, but you simply want command; there is no need to capture what it prints to standard output just so you can print that text to standard output.
If you are going to capture the output of top -n 1 anyway, there is no need really to run it twice.
Further notes:
If you know the capitalization of the field you want to extract, maybe you don't need to search case-insensitively. (I could not find a version of top which prints a CPU prefix with the load in the second field -- it the expression really correct?)
The shell only supports integer arithmetic. Is this a bug? Maybe you want to use Awk (which has floating-point support) to perform the comparison? This also allows for a moderately tricky refactoring. We make Awk output an exit code of 1 if the comparison fails, and use that as the condition for the if.
#!/bin/sh
while sleep 1
do
if top=$(top -n 1 |
awk -v thres="$Threshold" '1; # print every line
tolower($1) ~ /^cpu/ { print $2 >>"sy.log";
exitcode = ($2 >= thres ? 0 : 1) }
END { exit exitcode }')
then
echo "$top" >>sys.log
fi
done
Do you really mean to have two log files with nearly the same name, or is that a typo? Including a time stamp in the log might be useful both for troubleshooting and for actually using the log files.

Linux Scripting with Spaces in Filenames

I am currently working with a vendor-provided software that is trying to handle sending attachment files to another script that will text-extract from the listed file. The script fails when we receive files from an outside source that contain spaces, as the vendor-supplied software does not surround the filename in quotes - meaning when the text-extraction script is run, it receives a filename that will split apart on the space and cause an error on the extractor script. The vendor-provided software is not editable by us.
This whole process is designed to be an automated transfer, so having this wrench that could be randomly thrown into the gears is an issue.
What we're trying to do, is handle the spaced name in our text extractor script, since that is the piece we have some control over. After a quick Google, it seems like changing the IFS value for the script would be the quick solution, but unfortunately, that script would take effect after the extensions have already mutilated the incoming data.
The script I'm using takes in a -e value, a -i value, and a -o value. These values are sent from the vendor supplied script, which I have no editing control over.
#!/bin/bash
usage() { echo "Usage: $0 -i input -o output -e encoding" 1>&2; exit 1; }
while getopts ":o:i:e:" o; do
case "${o}" in
i)
inputfile=${OPTARG}
;;
o)
outputfile=${OPTARG}
;;
e)
encoding=${OPTARG}
;;
*)
usage
;;
esac
done
shift $((OPTIND-1))
...
...
<Uses the inputfile, outputfile, and encoding variables>
I admit, there may be pieces to this I don't fully understand, and it could be a simple fix, but my end goal is to be able to extract -o, -i, and -e that all contain 1 value, regardless of the spaces within each section. I can handle quoting the script after I can extract the filename value
The script fragment that you have posted does not have any issues with spaces in the arguments.
The following, for example, does not need quoting (since it's an assignment):
inputfile=${OPTARG}
All other uses of $inputfile in the script should be double quoted.
What matters is how this script is called.
This would fail and would assign only hello to the variable inputfile:
$ ./script.sh -i hello world.txt
The string world.txt would prompt the getopts function to stop processing the command line and the script would continue with the shift (world.txt would be left in $1 afterwards).
The following would correctly assign the string hello world.txt to inputfile:
$ ./script.sh -i "hello world.txt"
as would
$ ./script.sh -i hello\ world.txt
The following script uses awk to split the arguments while including spaces in the file names. The arguments can be in any order. It does not handle multiple consecutive spaces in an argument, it collapses them to one.
#!/bin/bash
IFS=' '
str=$(printf "%s" "$*")
istr=$(echo "${str}" | awk 'BEGIN {FS="-i"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-e"} {print $1}')
estr=$(echo "${str}" | awk 'BEGIN {FS="-e"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
ostr=$(echo "${str}" | awk 'BEGIN {FS="-o"} {print $2}' | awk 'BEGIN {FS="-e"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
inputfile=""${istr}""
outputfile=""${ostr}""
encoding=""${estr}""
# call the jar
There was an issue when calling the jar where Java threw a MalformedUrlException on a filename with a space.
So after reading through the commentary, we decided that although it may not be the right answer for every scenario, the right answer for this specific scenario was to extract the pieces manually.
Because we are building this for a pre-built script passing to it, and we aren't updating that script any time soon, we can accept with certainty that this script will always receive a -i, -o, and -e flag, and there will be spaces between them, which causes all the pieces passed in to be stored in different variables in $*.
And we can assume that the text after a flag is the response to the flag, until another flag is referenced. This leaves us 3 scenarios:
The variable contains one of the flags
The variable contains the first piece of a parameter immediately after the flag
The variable contains part 2+ of a parameter, and the space in the name was interpreted as a split, and needs to be reinserted.
One of the other issues I kept running into was trying to get string literals to equate to variables in my IF statements. To resolve that issue, I pre-stored all relevant data in array variables, so I could test $variable == $otherVariable.
Although I don't expect it to change, we also handled what to do if the three flags appear in a different order than we anticipate (Our assumption was that they list as i,o,e... but we can't see excatly what is passed). The parameters are dumped into an array in the order they were read in, and a parallel array tracks whether the items in slots 0,1,2 relate to i,o,e.
The final result still has one flaw: if there is more than one consecutive space in the filename, the whitespace is trimmed before processing, and I can only account for one space. But saying as we processed over 4000 files before encountering one with a space, I find it unlikely with the naming conventions that we would encounter something with more than one space.
At that point, we would have to be stepping in for a rare intervention anyways.
Final code change is as follows:
#!/bin/bash
IFS='|'
position=-1
ioeArray=("" "" "")
previous=""
flagArr=("-i" "-o" "-e" " ")
ioePattern=(0 1 2)
#echo "for loop:"
for i in $*; do
#printf "%s\n" "$i"
if [ "$i" == "${flagArr[0]}" ] || [ "$i" == "${flagArr[1]}" ] || [ "$i" == "${flagArr[2]}" ]; then
((position += 1));
previous=$i;
case "$i" in
"${flagArr[0]}")
ioePattern[$position]=0
;;
"${flagArr[1]}")
ioePattern[$position]=1
;;
"${flagArr[2]}")
ioePattern[$position]=2
;;
esac
continue;
fi
if [[ $previous == "-"* ]]; then
ioeArray[$position]=${ioeArray[$position]}$i;
else
ioeArray[$position]=${ioeArray[$position]}" "$i;
fi
previous=$i;
done
echo "extracting (${ioeArray[${ioePattern[0]}]}) to (${ioeArray[${ioePattern[1]}]}) with (${ioeArray[${ioePattern[2]}]}) encoding."
inputfile=""${ioeArray[${ioePattern[0]}]}"";
outputfile=""${ioeArray[${ioePattern[1]}]}"";
encoding=""${ioeArray[${ioePattern[2]}]}"";

how to calculate percentage in shell

I would like to calculate percentage in shell. But I can't do it. My script is
#!/bin/bash
#n1=$(wc -l < input.txt) #input.txt is a text file with 10000 lines
n1=10000
n2=$(awk '{printf "%.2f", $n1*0.05/100}')
echo 0.05% of $n1 is $n2
It is neither showing any value nor terminating when executing this script.
awk will give you an n1 illegal field name if you do that, as it's inside single quotes.
Also, to avoid awk keep reading stdin you should pass /dev/null as file. Then:
n2=$(awk -v n1="$n1" 'BEGIN {printf "%.2f", n1*0.05/100}' /dev/null)
Rather than start a new process to count the records in the file and then passing that to awk, I would suggest you let awk count the records itself which it does anyway in the variable NR. So, your entire script would become:
percentage=$(awk 'END{print NR*0.05/100}' input.txt)

Error with a script in bash

I have a little error with a script I wrote in bash and I can't figure out what's I'm doing wrong
note that I'm using this script for thousands of calculations and this error happened only a few times (like 20 or so), but it still happened
What the script does is this: basically it takes in input a web page that I got from a site with the utility w3m and it counts all the occurrences of the words in it... After it orders them from the most common to the ones that occur only once
this is the code:
#!/bin/bash
# counts the numbers of words from specific sites #
# writes in a file the occurrences ordered from the most common #
touch check # file used to analyze the occurrences
touch distribution # final file ordered
page=$1 # the web page that needs to be analyzed
occurrences=$2 # temporary file for the occurrences
dictionary=$3 # dictionary used for another purpose (ignore this)
# write the words one by column
cat $page | tr -c [:alnum:] "\n" | sed '/^$/d' > check
# lopp to analyze the words
cat check | while read words
do
word=${words}
strlen=${#word}
# ignores blacklisted words or small ones
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word isn't in the file
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else if it is already in the file, it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
### HERE IS THE ERROR, EITHER THE LET OR THE SED ###
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
# orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution
# ignore this, not important
grep -w "1" distribution | awk -F ":" '{print $1}' > temp_dictionary
for line in `cat temp_dictionary`
do
if ! grep -Fxq $line $dictionary
then
echo $line >> $dictionary
fi
done
rm check
rm temp_dictionary
this is the error: (I'm translating it, so it could be different in english)
./wordOccurrences line:30 let:x // where x is a number, usually 9 or 10 (but also 11, 13, etc)
1: syntax error in the espression (the error token is 1)
sed: expression -e #1, character y: command 's' not terminated // where y is another number (this one is also usually 9 or 10) with y being different from x
EDIT:
Talking with kev it looks like it's a newline problem
I added an echo between let and sed to print the sed and it worked perfectly for like 5 to 10 minutes until that error. Usually the sed without error looked like this:
s/^CONSULENTI: 6$/CONSULENTI: 7/g
but when I got the error it was like this:
s/^00145: 1
1$/00145: 4/g
how to fix this?
If you get a new line in $old, it means awk prints two lines so there is a duplicate in $occurences.
The script seems complicated to count words, and not efficient because it launches many processes and process file in a loop ;
maybe you can do something similar with
sort | uniq -c
You should also consider that your case-insensitivity is not consistent throughout the program. I created a page with just "foooo" in it and ran the program, then created one with "Foooo" in it and ran the program again. The 'old=`awk...' line sets 'old' to the empty string because awk is matching case sensitively. This results in the occurrences file not being updated. The subsequent sed and possibly some of the greps are also case sensitive.
This may not be the only error since it doesn't explain the error message you saw, but it is an indication that the same word with different capitalization will be handled erroneously by your script.
The following would separate the words, lowercase them, and then remove the ones smaller than three characters:
tr -cs '[:alnum:]' '\n' <foo | tr '[:upper:]' '[:lower:]' | egrep -v '^.{0,2}$'
Using this at the front of your script would mean that the rest of the script would not have to be case insensitive to be correct.

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Resources