Perl line runs 30 times quicker with single quotes than with double quotes - linux

We have a task to change some strings in binary files to lowercase (from mixed/upper/whatever). The relevant strings are references to the other files (it's in connection with an upgrade where we are also moving from Windows to linux as a server environment, so the case suddenly matters). We have written a script which uses a perl loop to do this. We have a directory containing around 300 files (total size of the directory is around 150M) so it's some data but not huge amounts.
The following perl code takes about 6 minutes to do the job:
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe "s{(?i)$file_ref}{$file_ref}g" $forms6_convert_dir/*
done
while the following perl code takes over 3 hours!
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe 's{(?i)$file_ref}{$file_ref}g' $forms6_convert_dir/*
done
Can anyone explain why? Is it that the $file_ref is getting left as the string $file_ref instead of substituted with the value in the single quotes version? in which case, what is it replacing in this version? What we want is to replace all occurances of any filename with itself but in lowercase. If we run strings on the files before and after and search for the filenames then both appeared to have made the same changes. However, if we run diff on the files produced by the two loops (diff firstloop/file1 secondloop/file1) then it reports that they differ.
This is running from within a bash script on linux.

The shell doesn't do variable substitution for single quoted strings. So, the second one is a different program.

As the other answers said, the shell doesn't substitute variables inside single quotes, so the second version is executing the literal Perl statement s{(?i)$file_ref}{$file_ref}g for every line in every file.
As you said in a comment, if $ is the end-of-line metacharacter, $file_ref could never match anything. $ matches before the newline at end-of-line, so the next character would have to be a newline. Therefore, Perl doesn't interpret $ as the metacharacter; it interprets it as the beginning of a variable interpolation.
In Perl, the variable $file_ref is undef, which is treated as the empty string when interpolated. So you're really executing s{(?i)}{}g, which says to replace the empty string with the empty string, and do that for all occurrences in a case-insensitive manner. Well, there's an empty string between every pair of characters, plus one at the beginning and end of each line. Perl is finding each one and replacing it with the empty string. This is a no-op, but it's an expensive one, hence the 3-hour run time.
You must be mistaken about both versions making the same changes. As I just explained, the single-quoted version is just an expensive no-op; it doesn't make any changes at all to the file contents (it just makes a fresh copy of each file). The files you ran it on must have already been converted to lower case.

With double quotes you are using the shell variable, with single quotes Perl is trying to use a variable of that name.
You might wish to consider writing the whole lot in either Perl or Bash to speed things up. Both languages can read files and do pattern matching. In Perl you can change to lower-case using the lc built-in function, and in Bash 4 you can use ${file,,}.

Related

removing prepositions from a text file in linux

What I want to do is that i want to remove all prepositions in a text file in CentOS. Things like 'on of to the in at ....'. Here is my script:
!/bin/bash
list='i me my myself we our ours ourselves you your yours yourself ..... '
cat Hamlet.txt | for item in $list
do
sed 's/$item//g'
done > newHam.txt
but at the end when i open newHam.txt nothing changes! It's the same as Ham.txt. I don't know whether this is a good approach or not. Any suggestion? Any approach??
Assuming your sed understands \< and \> for word boundaries,
sed 's/\<\(i\|me\|my\|myself|\we|\our|\ours|\ourselves|\you|\your|\yours|\yourself\)\> \?//g' Hamlet.txt >newHam.txt
You want to make sure you include word boundaries; your original attempt would replace e.g. i everywhere n the nput.
If you already have the words in a string, you can interpolate it in Bash with
sed "s/\\<\\(${list// /\\|}\\)\\> \\?//g" Hamlet.txt >newHam.txt
but the ${variable//pattern/substitution} parameter expansion is not portable to e.g. /bin/sh. Notice also how double quotes instead of single are necessary for the shell to be allowed to perform variable substitutions within the script, and how all literal backslashes need to be escaped with another backslash within double quotes.
Unfortunately, many details of sed are poorly standardized. Ironically, switching to a tool which isn't standard at all might be the most portable solution.
perl -pe 'BEGIN {
#list = qw(i me my myself we our ours ourselves you your yours yourself .....);
$re = join("|", #list); }
s/\b($re)\b ?//go' Hamlet.txt >newHam.txt
If you want this as a standalone script,
#!/usr/bin/perl
BEGIN {
#list = qw(i me my myself we our ours ourselves you your yours yourself .....);
$re = join("|", #list);
}
while (<>) {
s/\b($re)\b ?//go;
print
}
These words are pronouns, not prepositions.
Finally, take care to fix the shebang of your script; the first line of the script needs to start with exactly the two characters #! because that's what makes it a shebang. You'll also want to avoid the useless cat in the future.

ksh shell script to find first occurence of _ in string and remove everything until that

Im New To Shell Scripting.Using KSH Shell. Could you please help me in this.
My string is like errorfile101_ApplicationData_2_333.txt. I want to remove everything until the first occurence of _.
My output should be ApplicationData_2_333.txt
This is an easy one, assuming you can assign your string to a variable, i.e.
str="errorfile101_ApplicationData_2_333.txt"
echo ${str#*_}
output
ApplicationData_2_333.txt
The # operator in ${str#*_} means remove the following pattern from the left of the variable's value.
There is also ##, which removes the longest match from the left, which would give you
333.txt
There are also similar removal operators for working from the right side of the string, % and a longest match (from right) with %%.
All versions of ksh (and bash, and other shells) support these operators. (sorry if this is the wrong term).
Versions of ksh93 and greater (bash, zsh and probably others) also support a sed like pattern match/sub value like
echo ${str/*_/xx}
#----------|--|>replacement
#----------> pattern to match
output
xx333.txt
which means that / works like sed matching the longest possible string.
IHTH
You can use the cut command:
echo "errorfile101_ApplicationData_2_333.txt" | cut -d"_" -f2-

How to extract string in shell script

I have file names like Tarun_Verma_25_02_2016_10_00_10.csv. How can I extract the string like 25_02_2016_10_00_10 from it in shell script?
It is not confirmed that how many numeric parts there would be after "firstName"_"lastName"
A one-line solution would be preferred.
with sed
$ echo Tarun_Verma_25_02_2016_10_00_10.csv | sed -r 's/[^0-9]*([0-9][^.]*)\..*/\1/'
25_02_2016_10_00_10
extract everything between the first digit and dot.
If you want some control over which parts you pick out (assuming the format is always like <firstname>_<lastname>_<day>_<month>_<year>_<hour>_<minute>_<second>.csv) awk would be pretty handy
echo "Tarun_Verma_25_02_2016_10_00_10.csv" | awk -F"[_.]" 'BEGIN{OFS="_"}{print $3,$4,$5,$6,$7,$8}'
Here awk splits by both underscore and period, sets the Output Field Seperator to an underscore, and then prints the parts of the file name that you are interested in.
ksh93 supports the syntax bash calls extglobs out-of-the-box. Thus, in ksh93, you can do the following:
f='Tarun_Verma_25_02_2016_10_00_10.csv'
f=${f##+([![:digit:]])} # trim everything before the first digit
f=${f%%+([![:digit:]])} # trim everything after the last digit
echo "$f"
To do the same in bash, you'll want to run the following command first
shopt -s extglob
Since this uses shell-native string manipulation, it runs much more quickly than invoking an external command (sed, awk, etc) when processing only a single line of input. (When using ksh93 rather than bash, it's quite speedy even for large inputs).

convert this linux statement into a statement which is supported by windows command prompt

This is my statement supported by unix environment
"cat document.xml | grep \'<w:t\' | sed \'s/<[^<]*>//g\' | grep -v \'^[[:space:]]*$\'"
But I want to execute that statement in windows command prompt .
How do I do that? and what are the commands which are similar to cat, grep,sed .
please tell me the exact code supported for windows similar to above command
The double quotes around the pipeline in your question are a syntax error, and the backslashed single quotes should apparently really not have backslashes, but I assume it's just an artefact of a slightly imprecise presentation.
Here's what the code does.
cat document.xml |
This is a useless use of cat but its purpose is to feed the contents of this file into the pipeline.
grep '<w:t' |
This looks for lines containing the literal string <w:t (probably the start of a tag in the XML format in the file). The single quotes quote the string so that it is not interpreted by the shell (otherwise the < would be interpreted as a redirection operator); they are consumed by the shell, and not passed through to grep.
sed 's/<[^<]*>//g' |
This replaces every pair of open/close brokets with an empty string. The regular expression [^<]* matches zero or more occurrences of a character which can be anything except <. If the XML is well-formed, these should always occur in pairs, and so we effectively remove all XML tags.
grep -v '^[[:space:]]*$'
This removes any line which is empty or consists entirely of whitespace.
Because sed is a superset of grep, the program could easily be rephrased as a single sed script. Perhaps the easiest solution for your immediate problem would be to obtain a copy of sed for your platform.
sed -e '/<w:t/!d' -e 's/<[^<]*>//g' -e '/[^[:space]]/!d' document.xml
I understand quoting rules on Windows may be different; try with double quotes instead of single, or put the script in a file and use sed -f file document.xml where file contains the script itself, like this:
/<w:t/!d
s/<[^<]*>//g
/[^[:space]]/!d
This is a rather crude way to extract the CDATA from an XML document, anyway; perhaps some XML processor would be the proper way forward. E.g. xmlstarlet appears to be available for Windows. It works even if the XML input doesn't have the beginning and ending <w:t> tags on the same line, with nothing else on it. (In fact, parsing XML with line-oriented tools is a massive antipattern.)
May try with "powershell" ?
It is included since Win8 I think,
for sure on W10 it is.
I've just tested a "cat" command and it works.
"grep" don't but may be adapt like this :
PowerShell equivalent to grep -f
and
https://communary.wordpress.com/2014/11/10/grep-the-powershell-way/
The equivalent of grep on windows would be findstr and the equivalent of cat would be type.

How to delete every character after a space?

I am making a script that creates a backup of my /home/ directory called backup.sh. When the backup completes I want the script to spit out the size of the backup in megabytes. Here are the lines I am having trouble with:
# creates an approximate size of the file + the file location
backup_text=$(du $new_backup)
# take off the file name at the end and add an 'M' to specify Megabytes
backup_text=${backup_text%[:blank:]*}M
# print string to console
echo $backup_text
Here is the output I keep getting:
20 /backups/Thu_Aug_22_15:52M
As you can see, the backup size is 20M, which is correct, but the /backups/... part remains. What did I do wrong in my script?
Sorry, probably a noob question, just starting scripting =)
Replacement-Pattern Expansion
There are a number of ways to deal with this with Bash pattern matching, but I'd use replacement expansion with extglob. For example:
$ shopt -u extglob
$ backup_text='foo bar'
$ echo ${backup_text/+([[:blank:]])*/}
foo
Double the braces on your character class:
backup_text=${backup_text%[[:blank:]]*}M
The whole bracket expression ([:blank:]) counts as a single "character" (token) within a character class ([...]), so you need the brackets for both.
Use -h option to get the size in a nice format instead of adding 'M'
Also, you don't need any magic to cut out the filename. Just get the first word!
And do not forget the quotes around "$new_backup". This is very important because things will go wild if your $new_backup contains a space.
sizeStr=$(du -h "$new_backup" | awk '{print $1}')

Resources