Fast directory clean-up with Perl - linux

I have a need to clean-up directory with millions of log files on my webserver. And I've found this great article on how to do this. There is, however, a couple interesting things in that one-liner, which I am interested in.
Here's the Perl code I am interested in:
for(<*>){((stat)[9]<(unlink))}
Runned with perl -e 'code'.
So, here are my questions:
the for(<*>) construction - I assume it iterates through the files in the current directory. But where does it store the iterator?
the stat and unlink functions expect at least one argument, I assume... But where is it?
why the result of calling (stat)[9] is compared to the result of calling (unlink)? And what does it results in?
Sorry, I am a no-perl-ish guy, thus I do not understand all those Perl abbreviations. That's why I am asking this question.
Thanks!

That one liner takes many shortcuts:
The <*> is a special case of the diamond operator. You can't access an iterator object, like in other languages. Here, it calls the glob function. In list context it returns a list from all the results (which are either lines of a file, or, as in your case, contents of a diretory. The return value of that is passed to for which iterates over a list and aliases the values in $_. $_ is the "default variable" for many functions…
Which brings us here. Many core functions default to $_ with no argument. So do unlink and stat.
(stat)[9] means execute stat in list context and select the 10th result (indices start at zero, this is the modify time). (compare that to an array access like $foo[9]).

The code
for(<*>){((stat)[9]<(unlink))}
is equivalent to:
for my $file (<*>) {
my $mtime = (stat($file))[9];
$mtime < unlink($file);
}
<*> can also be replaced with glob "*" which might be more readable.
The code will delete all files in the current directory. It will not delete directories.
Note that the last statement in the loop is completely redundant. If use warnings is in effect, it will give the warning:
Useless use of numeric lt (<) in void context
For this code to make sense, I would expect a comparison that actually matters, like comparing $mtime to some time to know which logs are old, e.g.:
if ($mtime < $oldtime) {
unlink $file or die "Cannot unlink $file: $!";
}
Note also that it might be prudent to check for failure when deleting files.

the for(<*>) construction - I assume it iterates through the files in the current directory. But where does it store the iterator?
for-loops can be used to iterate over arrays/lists, so if <*> produces a list, then your code is just a run of the mill for loop. As it turns out <*> is another way to spell glob(), which is sort of like a regex for retrieving file names, and glob() returns a list in list context--which is the context a for loop provides. See: http://perldoc.perl.org/functions/glob.html.
Note that the single quotes keep the shell from expanding the *, which would prevent perl from ever seeing it.

Related

Rename all functions in a bash script on 'importing' through source command

I was trying to mimic a module system in my bash scripts separate files containing utility functions for string, date, system, etc. which I can import in my main script and do my work. The only thing I miss is, I want to create a namespace style separation on import. So for example, if I import the string-utils file, all functions defined in it should work only when I prepend a string. at the beginning of each function name in my main script. I realize I could just name all my functions as for example string.contains rather than only contains in the utility file itself, but it's not clean and I don't want to do that. I want to somehow declare the function prefix at the time of import. Something like import * from utils/string as string. Is there a way I can do that? Thanks!
From a practical view point, you are trying to eat your soup using a fork. Bash is not meant for employing name spaces. If you insist in doing something similar, I suggest that the autor of your sourced file cooperates in establishing your idea of a "namespace". For instance, if you your sourced file starts with
# This is file function_library.sh
: ${module:=''} # Default: No namespace
fun1() { # Define function without namespace
....
}
# Rename the function to be in namespace, based on the answer in
# https://stackoverflow.com/questions/1203583/how-do-i-rename-a-bash-function
eval "$(echo "${module}_fun1()"; declare -f fun1 | tail -n +2)"
and so on. When sourcing the file, you do a
module=my_name # set namespace
. function_library.sh
This is not only cumbersome, it is also odd that the importer defines the namespace of a library. A more common concept in namespaces is that the imported module itselt fould define, which namespace it is in, and this would make the awkward renaming unnecessary.
You will need to modify the content of the files where your functions are defined. You could do this on the fly with a text processor like awk, for example, but this depends on how your functions are declared. Example if they are always declared starting on a separate line with syntax:
function foo {
...
}
Then, you can preprocess the file with:
$ source <(awk -v prefix="foo." '$1 == "function" {$2 = prefix $2} 1' foo.sh)
If your function declarations are less regular you will have to use a more complicated preprocessor. And, more importantly, this is very fragile. If the functions declared in the sourced file are themselves called in the sourced file, or their name used in one way or another (aliases, arrays of function names, whatever), things will get much more complex: you will need to catch all these references and update them too.

Having some issues with Perl Splitting and Merging Functions

First and foremost, I'm not familiar with Perl at all. I've been studying C++ primarily for the last 1/2 year. I'm in a class now that that is teaching Linux commands, and we have short little topics on languages used in Linux, including Perl, which is totally throwing me for a loop (no pun intended). I have a text file that contains a bunch of random numbers separated by spaces and tabs, maybe even newlines, that gets read into the program via a filehandle. I'm supposed to write 2 lines of code that split the lines of numbers and merge them into one array, inside of a foreach loop. I'm not looking for an answer, just a nudge in the right direction. I've been trying different things for multiple hours and feel totally silly I can't get it, I'm totally lost with the syntax. Its just a bit odd not working inside a compiler and out of my comfort zone working outside of C++. I really appreciate it. I've included a few photos. Basically, the code we are writing it just to store the numbers and the rest of the program will determine the smallest number and sum of all numbers. Mine is currently incorrect because I'm not sure what to do. In the output photo, it will display all the numbers being entered in via the text file, so you can see them.
Several things to fix here. First of all, please don't post screenshots of your sample data or code, as it makes it impossible to copy and paste to test your code or data. Post your code/data by indenting it with four spaces and a newline preceding the code block.
Add use strict; in your script. This should be lesson 0 in your class. After that add my to all variable declarations.
To populate #all_numbers with contents of each line's numbers, without using push, you can use something like this:
foreach my $line (#output_lines)
{
my #numbers = split /\s/, $line;
#all_numbers = (#all_numbers, #numbers);
}
You say you're "not looking for an answer," so here's your nudge:
You're almost there. You split each line well (using split/\s/) and store the numeric values in #all_numbers. However, notice that each time around in the loop, you replace (using the assignment, #all_numbers = ...) the whole contents of #all_numbers with the numbers you found in the current line. Effectively, you're throwing away everything you've stored from the previous lines.
Instead, you want to add to #all_numbers, not replace #all_numbers. Have a look at the push() function for how to do this.
NB: Your split() call is fine, but it's more customary to use split(' ', $line) in this case. (See split(): you can use a single space, ' ', instead of the pattern, /\s/, when you want to split on any whitespace.)
I hope you need to store the all splitting element into array, so you looking for push function.
foreach $line (#input_lines)
{
push(#all_numbers,split(/\s/,$line));
}
Your problem is, in every iteration, the splitted value is over written in an array not to append together. For example,
#array = qw(one two three);
#array = qw(five four seven);
print "#array";
output is five four seven not the one two three five four seven because this is reinitialize with a new values. You want to append the new values in the array in before or after use unshift or push
for example
#array = qw(one two three);
push(#array,qw(five four seven));
Another way:
my #all_numbers = map { split ' ', $_ } #output_lines;
See http://perldoc.perl.org/functions/map.html

need guidance with basic function creation in MATLAB

I have to write a MATLAB function with the following description:
function counts = letterStatistics(filename, allowedChar, N)
This function is supposed to open a text file specified by filename and read its entire contents. The contents will be parsed such that any character that isn’t in allowedChar is removed. Finally it will return a count of all N-symbol combinations in the parsed text. This function should be stored in a file name “letterStatistics.m” and I made a list of some commands and things of how the function should be organized according to my professors' lecture notes:
Begin the function by setting the default value of N to 1 in case:
a. The user specifies a 0 or negative value of N.
b. The user doesn’t pass the argument N into the function, i.e., counts = letterStatistics(filename, allowedChar)
Using the fopen function, open the file filename for reading in text mode.
Using the function fscanf, read in all the contents of the opened file into a string variable.
I know there exists a MATLAB function to turn all letters in a string to lower case. Since my analysis will disregard case, I have to use this function on the string of text.
Parse this string variable as follows (use logical indexing or regular expressions – do not use for loops):
a. We want to remove all newline characters without this occurring:
e.g.
In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.
In my younger and more vulnerableyears my father gave me some advicethat I’ve been turning over in my mindever since.
Replace all newline characters (special character \n) with a single space: ' '.
b. We will treat hyphenated words as two separate words, hence do the same for hyphens '-'.
c. Remove any character that is not in allowedChar. Hint: use regexprep with an empty string '' as an argument for replace.
d. Any sequence of two or more blank spaces should be replaced by a single blank space.
Use the provided permsRep function, to create a matrix of all possible N-symbol combinations of the symbols in allowedChar.
Using the strfind function, count all the N-symbol combinations in the parsed text into an array counts. Do not loop through each character in your parsed text as you would in a C program.
Close the opened file using fclose.
HERE IS MY QUESTION: so as you can see i have made this list of what the function is, what it should do, and using which commands (fclose etc.). the trouble is that I'm aware that closing the file involves use of 'fclose' but other than that I'm not sure how to execute #8. Same goes for the whole function creation. I have a vague idea of how to create a function using what commands but I'm unable to produce the actual code.. how should I begin? Any guidance/hints would seriously be appreciated because I'm having programmers' block and am unable to start!
I think that you are new to matlab, so the documentation may be complicated. The root of the problem is the basic understanding of file I/O (input/output) I guess. So the thing is that when you open the file using fopen, matlab returns a pointer to that file, which is generally called a file ID. When you call fclose you want matlab to understand that you want to close that file. So what you have to do is to use fclose with the correct file ID.
fid = open('test.txt');
fprintf(fid,'This is a test.\n');
fclose(fid);
fid = 0; % Optional, this will make it clear that the file is not open,
% but it is not necessary since matlab will send a not open message anyway
Regarding the function creation the syntax is something like this:
function out = myFcn(x,y)
z = x*y;
fprintf('z=%.0f\n',z); % Print value of z in the command window
out = z>0;
This is a function that checks if two numbers are positive and returns true they are. If not it returns false. This may not be the best way to do this test, but it works as example I guess.
Please comment if this is not what you want to know.

How to write a self reproducing code (prints the source on exec)?

I have seen a lot of C/C++ based solutions to this problem where we have to write a program that upon execution prints its own source.
some solutions --
http://www.cprogramming.com/challenges/solutions/self_print.html
Quine Page solution in many languages
There are many more solutions on the net, each different from the other. I wonder how do we approach to such a problem, what goes inside the mind of the one who solves it. Lend me some insights into this problem... While solutions in interpreted languages like perl, php, ruby, etc might be easy... i would like to know how does one go about designing it in compiled languages...
Aside from cheating¹ there is no difference between compiled and interpreted languages.
The generic approach to quines is quite easy. First, whatever the program looks like, at some point it has to print something:
print ...
However, what should it print? Itself. So it needs to print the "print" command:
print "print ..."
What should it print next? Well, in the mean time the program grew, so it needs to print the string starting with "print", too:
print "print \"print ...\""
Now the program grew again, so there's again more to print:
print "print \"print \\\"...\\\"\""
And so on.
With every added code there's more code to print.
This approach is getting nowhere,
but it reveals an interesting pattern:
The string "print \"" is repeated over and over again.
It would be nice to put the repeating part
into a variable:
a = "print \""
print a
However, the program just changed,
so we need to adjust a:
a = "a = ...\nprint a"
print a
When we now try to fill in the "...",
we run into the same problems as before.
Ultimately, we want to write something like this:
a = "a = " + (quoted contents of a) + "\nprint a"
print a
But that is not possible,
because even if we had such a function quoted() for quoting,
there's still the problem that we define a in terms of itself:
a = "a = " + quoted(a) + "\nprint a"
print a
So the only thing we can do is putting a place holder into a:
a = "a = #\nprint a"
print a
And that's the whole trick!
Anything else is now clear.
Simply replace the place holder
with the quoted contents of a:
a = "a = #\nprint a"
print a.replace("#", quoted(a))
Since we have changed the code,
we need to adjust the string:
a = "a = #\nprint a.replace(\"#\", quoted(a))"
print a.replace("#", quoted(a))
And that's it!
All quines in all languages work that way
(except the cheating ones).
Well, you should ensure that you replace only
the first occurence of the place holder.
And if you use a second place holder,
you can avoid needing to quote the string.
But those are minor issues
and easy to solve.
If fact, the realization of quoted() and replace()
are the only details in which the various quines really differ.
¹ by making the program read its source file
There are a couple of different strategies to writing quines. The obvious one is to just write code that opens the code and prints it out. But the more interesting ones involve language features that allow for self-embedding, like the %s-style printf feature in many languages. You have to figure out how to embed something so that it ends up resolving to the request to be embedded. I suspect, like palindromes, a lot of trial and error is involved.
The usual approach (when you can't cheat*) is to write something that encodes its source in a string constant, then prints out that constant twice: Once as a string literal, and once as code. That gets around the "every time I write a line of code, I have to write another to print it out!" problem.
'Cheating' includes:
- Using an interpreted language and simply loading the source and printing it
- 0-byte long files, which are valid in some languages, such as C.
For fun, I came up with one in Scheme, which I was pretty proud of for about 5 minutes until I discovered has been discovered before. Anyways, there's a slight modification to the "rules" of the game to better count for the duality of data and code in Lisp: instead of printing out the source of the program, it's an S-expression that returns itself:
((lambda (x) (list x `',x)) '(lambda (x) (list x `',x)))
The one on Wikipedia has the same concept, but with a slightly different (more verbose) mechanism for quoting. I like mine better though.
One idea to think about encoding and how to give something a double meaning so that it can be used to output something in a couple of forms. There is also the cavaet that this type of problem comes with restrictions to make it harder as without any rules other than the program output itself, the empty program is a solution.
How about actually reading and printing your source code? Its not difficult at all!! Heres one in php:
<?php
{
header("Content-Type: text/plain");
$f=fopen("5.php","r");
while(!feof($f))
{
echo fgetc($f);
}
fclose($f);
}
?>
In python, you can write:
s='c=chr(39);print"s="+c+s+c+";"+s';c=chr(39);print"s="+c+s+c+";"+s
inspired from this self printing pseudo-code:
Print the following line twice, the second time with quotes.
"Print the following line twice, the second time with quotes."
I've done a AS3 example for those interested in this
var program = "var program = #; function main(){trace(program.replace('#',
String.fromCharCode(34) + program + String.fromCharCode(34)))} main()";
function main(){
trace(program.replace('#', String.fromCharCode(34) + program + String.fromCharCode(34)))
}
main()
In bash it is really easy
touch test; chmod oug+x test; ./test
Empty file, Empty output
In ruby:
puts File.read(_ _ FILE _ _)

How to make this Groovy string search code more efficient?

I'm using the following groovy code to search a file for a string, an account number. The file I'm reading is about 30MB and contains 80,000-120,000 lines. Is there a more efficient way to find a record in a file that contains the given AcctNum? I'm a novice, so I don't know which area to investigate, the toList() or the for-loop. Thanks!
AcctNum = 1234567890
if (testfile.exists())
{
lines = testfile.readLines()
words = lines.toList()
for (word in words)
{
if (word.contains(AcctNum)) { done = true; match = 'YES' ; break }
chunks += 1
if (done) { break }
}
}
Sad to say, I don't even have Groovy installed on my current laptop - but I wouldn't expect you to have to call toList() at all. I'd also hope you could express the condition in a closure, but I'll have to refer to Groovy in Action to check...
Having said that, do you really need it split into lines? Could you just read the whole thing using getText() and then just use a single call to contains()?
EDIT: Okay, if you need to find the actual line containing the record, you do need to call readLines() but I don't think you need to call toList() afterwards. You should be able to just use:
for (line in lines)
{
if (line.contains(AcctNum))
{
// Grab the results you need here
break;
}
}
When you say efficient you usually have to decide which direction you mean: whether it should run quickly, or use as few resources (memory, ...) as possible. Often both lie on opposite sites and you have to pick a trade-off.
If you want to search memory-friendly I'd suggest reading the file line-by-line instead of reading it at once which I suspect it does (I would be wrong there, but in other languages something like readLines reads the whole file into an array of strings).
If you want it to run quickly I'd suggest, as already mentioned, reading in the whole file at once and looking for the given pattern. Instead of just checking with contains you could use indexOf to get the position and then read the record as needed from that position.
I should have explained it better, if I find a record with the AcctNum, I extract out other information on the record...so I thought I needed to split the file into multiple lines.
if you control the format of the file you are reading, the solution is to add in an index.
In fact, this is how databases are able to locate records so quickly.
But for 30MB of data, i think a modern computer with a decent harddrive should do the trick, instead of over complicating the program.

Resources