Linux Scripting with Spaces in Filenames

Linux Scripting with Spaces in Filenames - linux

I am currently working with a vendor-provided software that is trying to handle sending attachment files to another script that will text-extract from the listed file. The script fails when we receive files from an outside source that contain spaces, as the vendor-supplied software does not surround the filename in quotes - meaning when the text-extraction script is run, it receives a filename that will split apart on the space and cause an error on the extractor script. The vendor-provided software is not editable by us.
This whole process is designed to be an automated transfer, so having this wrench that could be randomly thrown into the gears is an issue.
What we're trying to do, is handle the spaced name in our text extractor script, since that is the piece we have some control over. After a quick Google, it seems like changing the IFS value for the script would be the quick solution, but unfortunately, that script would take effect after the extensions have already mutilated the incoming data.
The script I'm using takes in a -e value, a -i value, and a -o value. These values are sent from the vendor supplied script, which I have no editing control over.
#!/bin/bash
usage() { echo "Usage: $0 -i input -o output -e encoding" 1>&2; exit 1; }
while getopts ":o:i:e:" o; do
case "${o}" in
i)
inputfile=${OPTARG}
;;
o)
outputfile=${OPTARG}
;;
e)
encoding=${OPTARG}
;;
*)
usage
;;
esac
done
shift $((OPTIND-1))
...
...
<Uses the inputfile, outputfile, and encoding variables>
I admit, there may be pieces to this I don't fully understand, and it could be a simple fix, but my end goal is to be able to extract -o, -i, and -e that all contain 1 value, regardless of the spaces within each section. I can handle quoting the script after I can extract the filename value

The script fragment that you have posted does not have any issues with spaces in the arguments.
The following, for example, does not need quoting (since it's an assignment):
inputfile=${OPTARG}
All other uses of $inputfile in the script should be double quoted.
What matters is how this script is called.
This would fail and would assign only hello to the variable inputfile:
$ ./script.sh -i hello world.txt
The string world.txt would prompt the getopts function to stop processing the command line and the script would continue with the shift (world.txt would be left in $1 afterwards).
The following would correctly assign the string hello world.txt to inputfile:
$ ./script.sh -i "hello world.txt"
as would
$ ./script.sh -i hello\ world.txt

The following script uses awk to split the arguments while including spaces in the file names. The arguments can be in any order. It does not handle multiple consecutive spaces in an argument, it collapses them to one.
#!/bin/bash
IFS=' '
str=$(printf "%s" "$*")
istr=$(echo "${str}" | awk 'BEGIN {FS="-i"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-e"} {print $1}')
estr=$(echo "${str}" | awk 'BEGIN {FS="-e"} {print $2}' | awk 'BEGIN {FS="-o"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
ostr=$(echo "${str}" | awk 'BEGIN {FS="-o"} {print $2}' | awk 'BEGIN {FS="-e"} {print $1}' | awk 'BEGIN {FS="-i"} {print $1}')
inputfile=""${istr}""
outputfile=""${ostr}""
encoding=""${estr}""
# call the jar
There was an issue when calling the jar where Java threw a MalformedUrlException on a filename with a space.

So after reading through the commentary, we decided that although it may not be the right answer for every scenario, the right answer for this specific scenario was to extract the pieces manually.
Because we are building this for a pre-built script passing to it, and we aren't updating that script any time soon, we can accept with certainty that this script will always receive a -i, -o, and -e flag, and there will be spaces between them, which causes all the pieces passed in to be stored in different variables in $*.
And we can assume that the text after a flag is the response to the flag, until another flag is referenced. This leaves us 3 scenarios:
The variable contains one of the flags
The variable contains the first piece of a parameter immediately after the flag
The variable contains part 2+ of a parameter, and the space in the name was interpreted as a split, and needs to be reinserted.
One of the other issues I kept running into was trying to get string literals to equate to variables in my IF statements. To resolve that issue, I pre-stored all relevant data in array variables, so I could test $variable == $otherVariable.
Although I don't expect it to change, we also handled what to do if the three flags appear in a different order than we anticipate (Our assumption was that they list as i,o,e... but we can't see excatly what is passed). The parameters are dumped into an array in the order they were read in, and a parallel array tracks whether the items in slots 0,1,2 relate to i,o,e.
The final result still has one flaw: if there is more than one consecutive space in the filename, the whitespace is trimmed before processing, and I can only account for one space. But saying as we processed over 4000 files before encountering one with a space, I find it unlikely with the naming conventions that we would encounter something with more than one space.
At that point, we would have to be stepping in for a rare intervention anyways.
Final code change is as follows:
#!/bin/bash
IFS='|'
position=-1
ioeArray=("" "" "")
previous=""
flagArr=("-i" "-o" "-e" " ")
ioePattern=(0 1 2)
#echo "for loop:"
for i in $*; do
#printf "%s\n" "$i"
if [ "$i" == "${flagArr[0]}" ] || [ "$i" == "${flagArr[1]}" ] || [ "$i" == "${flagArr[2]}" ]; then
((position += 1));
previous=$i;
case "$i" in
"${flagArr[0]}")
ioePattern[$position]=0
;;
"${flagArr[1]}")
ioePattern[$position]=1
;;
"${flagArr[2]}")
ioePattern[$position]=2
;;
esac
continue;
fi
if [[ $previous == "-"* ]]; then
ioeArray[$position]=${ioeArray[$position]}$i;
else
ioeArray[$position]=${ioeArray[$position]}" "$i;
fi
previous=$i;
done
echo "extracting (${ioeArray[${ioePattern[0]}]}) to (${ioeArray[${ioePattern[1]}]}) with (${ioeArray[${ioePattern[2]}]}) encoding."
inputfile=""${ioeArray[${ioePattern[0]}]}"";
outputfile=""${ioeArray[${ioePattern[1]}]}"";
encoding=""${ioeArray[${ioePattern[2]}]}"";

Related

How to monitor CPU usage automatically and return results when it reaches a threshold

I am new to shell script , i want to write a script to monitor CPU usage and if the CPU usage reaches a threshold it should print the CPU usage by top command ,here is my script , which is giving me error bad number and also not storing any value in the log files
while sleep 1;do if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]; then echo "$(top -n1)">>sys.log;fi;done

Your script HAS to be indented and stored to a file, especially if you are new to shell !
#!/bin/sh
while sleep 1
do
if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]
then
echo "$(top -n1)" >> sys.log
fi
done
Your condition looks a bit odd. It may work, but it looks really complex. Store intermediate results in variables, and evaluate them.
Then, you will immediately see the syntax error on the “-ge”.
You HAVE to store logfiles within an absolute path for security reasons. Use variables to simplify the reading.
#!/bin/sh
LOGFILE=/absolute_path/sy.log
WHOLEFILE=/absolute_path/sys.log
Thresold=80
while sleep 1
do
TOP="$(top -n1)"
CPU="$(echo $TOP | grep -i ^cpu | awk '{print $2}')"
echo $CPU >> $LOGFILE
if [ "$CPU" -ge "$Threshold" ] ; then
echo "$TOP" >> $WHOLEFILE
fi
done

You have a couple of errors.
If you write output to sy.log with a redirection then that output is no longer available to the shell. You can work around this with tee.
The dash before -ge must not be followed by a space.
Also, a few stylistic remarks:
grep x | awk '{y}' is a useless use of grep; this can usefully and more economically (as well as more elegantly) be rewritten as awk '/x/{y}'
echo "$(command)" is a useless use of echo -- not a deal-breaker, but you simply want command; there is no need to capture what it prints to standard output just so you can print that text to standard output.
If you are going to capture the output of top -n 1 anyway, there is no need really to run it twice.
Further notes:
If you know the capitalization of the field you want to extract, maybe you don't need to search case-insensitively. (I could not find a version of top which prints a CPU prefix with the load in the second field -- it the expression really correct?)
The shell only supports integer arithmetic. Is this a bug? Maybe you want to use Awk (which has floating-point support) to perform the comparison? This also allows for a moderately tricky refactoring. We make Awk output an exit code of 1 if the comparison fails, and use that as the condition for the if.
#!/bin/sh
while sleep 1
do
if top=$(top -n 1 |
awk -v thres="$Threshold" '1; # print every line
tolower($1) ~ /^cpu/ { print $2 >>"sy.log";
exitcode = ($2 >= thres ? 0 : 1) }
END { exit exitcode }')
then
echo "$top" >>sys.log
fi
done
Do you really mean to have two log files with nearly the same name, or is that a typo? Including a time stamp in the log might be useful both for troubleshooting and for actually using the log files.

Bash Issue: AWK

I came back to work from a break to see that my Bash script wasn't working like it used to. The below tid-bit of code would grab and filter what's in a file. Here's the contents of said file:
# A colon, ':', is used as the field terminator. A new line terminates
# the entry. Lines beginning with a pound sign, '#', are comments.
#
# Entries are of the form:
# $ORACLE_SID:$ORACLE_HOME:<N|Y>:
#
# The first and second fields are the system identifier and home
# directory of the database respectively. The third filed indicates
# to the dbstart utility that the database should , "Y", or should not,
# "N", be brought up at system boot time.
#
# Multiple entries with the same $ORACLE_SID are not allowed.
#
#
OEM:/software/oracle/agent/agent12c/core/12.1.0.3.0:N
*:/software/oracle/agent/agent11g:N
dev068:/software/oracle/ora-10.02.00.04.11:Y
dev299:/software/oracle/ora-10.02.00.04.11:Y
xtst036:/software/oracle/ora-10.02.00.04.11:Y
xtst161:/software/oracle/ora-10.02.00.04.11:Y
dev360:/software/oracle/ora-11.02.00.04.02:Y
dev361:/software/oracle/ora-11.02.00.04.02:Y
xtst215:/software/oracle/ora-11.02.00.04.02:Y
xtst216:/software/oracle/ora-11.02.00.04.02:Y
dev298:/software/oracle/ora-11.02.00.04.03:Y
xtst160:/software/oracle/ora-11.02.00.04.03:Y
What the code used to produce and throw into an array:
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
It would look at the file (oratab), find the database names (e.g. xtst160), and put them into an array. I then used this array for other tasks later in the script. Here's the relevant Bash script code:
# Collect the databases using a mixture of AWK and regex, and throw it into an array.
printf "\n2) Collecting databases on %s:\n" $HOSTNAME
declare -a arr_dbs=(`awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $ddma_input}' /etc/oratab`)
# Loop through and print the array of databases.
for i in ${arr_dbs[#]}
do
printf "%s " $i
done
It doesn't seem anyone has modified the code or that the oratab file format has changed. So I'm not 100% sure what's going on now. Instead of grabbing the few characters, it's grabbing the entire line:
dev068:/software/oracle/ora-10.02.00.04.11:Y
I'm trying to understand Bash and regex more but I'm stumped. Definitely not my forte. A broken down explanation of the awk line would be greatly appreciated.

I found the error. We changed the amount of arguments being passed in and the order they are received.

printing $1 instead $ddma_input and resolve the issue as well.
declare -a arr_dbs=(`awk -F ":" -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab`)
# Loop through and print the array of databases.
for i in ${arr_dbs[#]}
do
printf "%s " $i
done

You could easily implement this whole thing in native bash with no external tools at all:
arr_dbs=( )
while IFS= read -r line; do
case $line in
"#"*) continue ;;
*:/software/oracle/ora*:*) arr_dbs+=( "${line%%:*}" ) ;;
esac
done </etc/oratab
printf ' %s\n' "${arr_dbs[#]}"
This actually avoids some bugs you had in your original implementation. Let's say you had a line like the following:
*:/software/oracle/ora-default:Y
If you aren't careful with how you handle that *, it'll be replaced with a list of filenames in the current directory by the shell whenever expansion occurs.
What does "whenever expansion occurs" mean in this context? Well:
# this will expand a * into a list of filenames during the assignment to the array
arr=( $(echo "*") ) # vs the correct read -a arr < <(echo "*")
# this will expand a * into a list of filenames while generating items to iterate over
for i in ${arr[#]} # vs the correct for i in "${arr[#]}"
# this will expand a * into a list of filenames while building the argument list for echo
i="*"
echo $i # vs the correct printf '%s\n' "$i"
Note the use of printf over echo -- see the APPLICATION USAGE section of the POSIX specification of echo.

If condition giving error in shell script when checking two strings

In following shell script I want to perform two different tasks depending on file type,
but it is giving an error: "[==c]: command not found"
echo "enter file name"
read num
var_check= echo $str |awk -F . '{if (NF>1) {print $NF}}'
if ["$var_check"=="c"];then
echo "Some task for c"
elif ["$var_check"=="cpp"];then
echo "Some task for cpp"
else
echo "Wrong file extension"
fi

You wrote:
if ["$var_check"=="c"];then
The [ command is a command; its name must be surrounded by spaces (put simplistically).
if [ "$var_check" == "c" ]; then
The last argument, ], must also be preceded by a space. The operands within must also be space separated; they need to be separate arguments. The rules for the [[ ... ]] operator are a bit different, but using spaces helps people read the code even there. What you wrote is a bit like expecting:
ls"-l"/dev/tty
to work; it won't.
You also need to double check whether your test or [ operator supports ==; the normal form is =.
The line:
var_check= echo $str |awk -F . '{if (NF>1) {print $NF}}'
This runs the echo command with var_check set as an environment variable, which is unlikely to be what you wanted. You almost certainly intended to write:
var_check=$(echo $str |awk -F . '{if (NF>1) {print $NF}}')
This runs the echo and awk commands and captures the output in var_check. Use the $(...) notation in preference to the older but more complex to use `...` notation. In simple cases, they look the same; when you nest them, the $(...) notation is far, far simpler to understand and use.
Also, looking on the larger scale (3 lines instead of just 1 line):
echo "enter file name"
read num
var_check=$(echo $str |awk -F . '{if (NF>1) {print $NF}}')
You read the file name into variable num; you then echo $str instead of $num. If you've already got $str set somewhere earlier in the script (in unshown code), what you've got may be fine. Taken as a standalone fragment, it isn't right.
You could also simplify the awk a little:
var_check=$(echo $str |awk -F . 'NF > 1 {print $NF}')
This would work the same as what you wrote, but uses fewer parentheses and braces.

Directing awk output to variable

New guy here with a problem that will hopefully have an easy solution, but I just can't seem to manage.
So, I have a large list of files that I need to process using the same command line program, and I'm trying to write a small shell script to automate this. I wrote something that will read the input file name from a text file, and repeat the command for each of those files. So far so good. My problem though is with naming the output. Each file is named in the general format "lane_number_bla_bla_bla", and they are processed in pairs. So, there will be a "lane_1_bla_bla_bla_001" and "lane_1_bla_bla_bla_002" that need to combine into a single output file. For this, I'm trying to use awk to read the sample number from the .txt list of input files and parse it into the output file number. Here's the code I came up with (note that the echo statement before the command is there just for testing; it's removed when it comes to run the actual program; also this is not the actual command which is rather more complicated, but the principle still applies):
echo "Which input1 should I use?"
read text
input1=$text
echo "Which input2 should I use?"
read text
input2=$text
echo "How many lines?"
read text
n=$text
for i in $(seq 1 $n)
do
awkinput1=$(awk NR==$i $input1)
awkinput2=$(awk NR==$i $input2)
num=$(awk 'NR==$i{print $2 }' FS="_" $input1)
lane=$(awk 'NR==$i{print $1 }' FS="_" $input1)
echo "command $awkinput1.in > $awkinput1.out && command $awkinput2.in > $awkinput2.out && command cat $awkinput1.out $awkinput2.in > $num-$lane-CAT.out &"
if (( $i % 10 == 0 )); then wait; fi # Limit to 10 concurrent subshells.
done
When I run this, both $awkinput fields get replaced properly in the comand line by the appropriate filename, but not the $num and $lane fields, which print nothing.
So, what am I doing wrong? I'm sure it's pretty simple, but I tried quite a lot of different ways to format the relevant awk command, and nothing seems to work. I'm doing this on a remote linux server using SSH protocol, if it makes a difference.
Thanks a lot!

Shell does not parse $i quoted by single quote ('). So quoted string should be terminated before $i.
FS should be set before parsing lines.
Following code will work.
num=$(awk 'BEGIN{FS="_"} NR=='$i'{print $2 }' $input1)
lane=$(awk 'BEGIN{FS="_"} NR=='$i'{print $1 }' $input1)
Code below will be more efficient:
while read in1 ; do
read in2 <&3
num=$(awk 'BEGIN{FS="_"} {print $2 }' <<<"$in1")
lane=$(awk 'BEGIN{FS="_"} {print $1 }' <<<"$in1")
...
done <$input1 3<$input2

Error with a script in bash

I have a little error with a script I wrote in bash and I can't figure out what's I'm doing wrong
note that I'm using this script for thousands of calculations and this error happened only a few times (like 20 or so), but it still happened
What the script does is this: basically it takes in input a web page that I got from a site with the utility w3m and it counts all the occurrences of the words in it... After it orders them from the most common to the ones that occur only once
this is the code:
#!/bin/bash
# counts the numbers of words from specific sites #
# writes in a file the occurrences ordered from the most common #
touch check # file used to analyze the occurrences
touch distribution # final file ordered
page=$1 # the web page that needs to be analyzed
occurrences=$2 # temporary file for the occurrences
dictionary=$3 # dictionary used for another purpose (ignore this)
# write the words one by column
cat $page | tr -c [:alnum:] "\n" | sed '/^$/d' > check
# lopp to analyze the words
cat check | while read words
do
word=${words}
strlen=${#word}
# ignores blacklisted words or small ones
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word isn't in the file
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else if it is already in the file, it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
### HERE IS THE ERROR, EITHER THE LET OR THE SED ###
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
# orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution
# ignore this, not important
grep -w "1" distribution | awk -F ":" '{print $1}' > temp_dictionary
for line in `cat temp_dictionary`
do
if ! grep -Fxq $line $dictionary
then
echo $line >> $dictionary
fi
done
rm check
rm temp_dictionary
this is the error: (I'm translating it, so it could be different in english)
./wordOccurrences line:30 let:x // where x is a number, usually 9 or 10 (but also 11, 13, etc)
1: syntax error in the espression (the error token is 1)
sed: expression -e #1, character y: command 's' not terminated // where y is another number (this one is also usually 9 or 10) with y being different from x
EDIT:
Talking with kev it looks like it's a newline problem
I added an echo between let and sed to print the sed and it worked perfectly for like 5 to 10 minutes until that error. Usually the sed without error looked like this:
s/^CONSULENTI: 6$/CONSULENTI: 7/g
but when I got the error it was like this:
s/^00145: 1
1$/00145: 4/g
how to fix this?

If you get a new line in $old, it means awk prints two lines so there is a duplicate in $occurences.
The script seems complicated to count words, and not efficient because it launches many processes and process file in a loop ;
maybe you can do something similar with
sort | uniq -c

You should also consider that your case-insensitivity is not consistent throughout the program. I created a page with just "foooo" in it and ran the program, then created one with "Foooo" in it and ran the program again. The 'old=`awk...' line sets 'old' to the empty string because awk is matching case sensitively. This results in the occurrences file not being updated. The subsequent sed and possibly some of the greps are also case sensitive.
This may not be the only error since it doesn't explain the error message you saw, but it is an indication that the same word with different capitalization will be handled erroneously by your script.
The following would separate the words, lowercase them, and then remove the ones smaller than three characters:
tr -cs '[:alnum:]' '\n' <foo | tr '[:upper:]' '[:lower:]' | egrep -v '^.{0,2}$'
Using this at the front of your script would mean that the rest of the script would not have to be case insensitive to be correct.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string