Removing columns and sorting by Name in finger command - linux

When I use the finger command, it displays Login, Name, Tty, Idle, Login Time, Office, Office Phone, and Host. I just need the information in the Login, Name, Idle, and Login Time columns.
I tried using awk and sed, but they resulted in chart being all over the place (example below).
$ finger | sed -r 's/\S+//3'
Login Name Idle Login Time Office Office Phone Host
user1 Full Name pts/1 20 Feb 3 19:34 (--------------------)
user2 FirstName LastName pts/2 Feb 3 17:04 (--------------)
user3 Name NameName pts/3 1:11 Feb 2 11:37 (-------------------------------)
user4 F Last pts/4 1:09 Feb 13 18:14 (-------------------)
How do I go about removing specific columns while keeping the structure intact?

The problem here is that you cannot extract particular fields based on whitespace separator, because on certain rows the columns might be blank and contain only whitespace, especially the Idle column, which will be blank for sessions with limited idle time. (An additional problem is that the real name field may contain a variable number of spaces.)
So you may have to resort to cut -b ... using hard-coded byte offsets. The following seems to work on my system, as finger seems to use a fixed format output, truncating real names etc as needed, so the byte offsets do not change if the length of the GECOS (real name) field of logged in users is changed.
finger | cut -b 1-20,30-48
Note that it will be inherently fragile if the format of the finger command output were to change in future. You might be able to produce something slightly more robust using regular expression parsing, for example parsing the column headings (first line of finger output) to obtain the byte offsets rather than hard-coding them, but it will still be somewhat fragile. A more robust solution would involve writing your own code to obtain information from the same sources that finger uses, and use that in place of finger. The existing code of an open-source implementation of finger might be a suitable starting point, and then you can adapt it to remove the columns that are not of interest.
Update: building a patched version of finger.
Save this patch as /tmp/patch. It it just a quick-and-dirty patch to suppress certain fields from being printed; they are still calculated.
--- sprint.c~ 2020-06-13 12:27:12.000000000 +0100
+++ sprint.c 2020-06-13 12:32:23.363138500 +0100
## -89,7 +89,7 ##
if (maxlname + maxrname < space-2) { maxlname++; maxrname++; }
(void)xprintf("%-*s %-*s %s\n", maxlname, "Login", maxrname,
- "Name", " Tty Idle Login Time Office Office Phone");
+ "Name", " Idle Login Time");
for (cnt = 0; cnt < entries; ++cnt) {
pn = list[cnt];
for (w = pn->whead; w != NULL; w = w->next) {
## -100,12 +100,6 ##
(void)xprintf(" * * No logins ");
goto office;
}
- (void)xputc(w->info == LOGGEDIN && !w->writable ?
- '*' : ' ');
- if (*w->tty)
- (void)xprintf("%-7.7s ", w->tty);
- else
- (void)xprintf(" ");
if (w->info == LOGGEDIN) {
stimeprint(w);
(void)xprintf(" ");
## -118,17 +112,6 ##
else
(void)xprintf(" %.5s", p + 11);
office:
- if (w->host[0] != '\0') {
- xprintf(" (%s)", w->host);
- } else {
- if (pn->office)
- (void)xprintf(" %-10.10s", pn->office);
- else if (pn->officephone)
- (void)xprintf(" %-10.10s", " ");
- if (pn->officephone)
- (void)xprintf(" %-.14s",
- prphone(pn->officephone));
- }
xputc('\n');
}
}
Then obtain the source code, patch it and build it. (Change destdir as required.)
apt-get source finger
cd bsd-finger-0.17/
pushd finger
patch -p0 < /tmp/patch
popd
destdir=/tmp/finger
mkdir -p $destdir/man/man8 $destdir/sbin $destdir/bin
./configure --prefix=$destdir
make
make install
And run it...
$destdir/bin/finger

Basically, to treat columns, awk is the way to go,
ex: remove third column
finger | awk '{$3="";print}'

Another way: If you found this informations, they have to be wrote somewhere in the system. Using who, awk and cut :
The informations can be gathered by getent passwd.
Created a test user with adduser :
# adduser foobar
Adding user `foobar' ...
Adding new group `foobar' (1001) ...
Adding new user `foobar' (1001) with group `foobar' ...
Creating home directory `/home/foobar' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for foobar
Enter the new value, or press ENTER for the default
Full Name []: Jean-Charles De la tour
Room Number []: 42
Work Phone []: +33140000000
Home Phone []: +33141000000
Other []: sysadmin
Is the information correct? [Y/n] Y
And the new line in /etc/passwd file:
foobar:x:1001:1001:Jean-Charles De la tour,42,+33140000000,+33141000000,sysadmin:/home/foobar:/bin/bash
So it's easy to retrieve in formations from this:
for u in $(who | cut -d' ' -f1); do # iterate over connected users
getent passwd | awk -F'[:,]' -v OFS='\n' -v u="$u" '$1==u{print "user: "$1, "full name: "$5, "room: "$6, "work phone : "$7, "home phone: "$8, "other: "$9}'
done
Just make sure you have , in $5 column.
Output
user: foobar
full name: Jean-Charles De la tour
room: 42
work phone : +33140000000
home phone: +33141000000
other: sysadmin

Related

String variable overwrites instead of concatenating in for-loop

Context
I am trying to write a little awk program to analyze my PokerStars hand history. Hand histories are stored in text files and have the following format:
PokerStars Hand #225343166937: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:07 ET
Table 'Pippa V' 6-max Seat #2 is the button
Seat 2: user1 ($2.12 in chips)
Seat 3: user2 ($2.28 in chips)
Seat 4: me ($2 in chips)
Seat 5: user3 ($1.95 in chips)
Seat 6: user4 ($2.06 in chips)
user2: posts small blind $0.01
me: posts big blind $0.02
*** HOLE CARDS ***
Dealt to me [7d 9c]
user3: folds
user4: folds
user1: raises $0.04 to $0.06
user2: folds
me: folds
Uncalled bet ($0.04) returned to user1
user1 collected $0.05 from pot
user1: doesn't show hand
*** SUMMARY ***
Total pot $0.05 | Rake $0
Seat 2: user1 (button) collected ($0.05)
Seat 3: user2 (small blind) folded before Flop
Seat 4: me (big blind) folded before Flop
Seat 5: user3 folded before Flop (didn't bet)
Seat 6: user4 folded before Flop (didn't bet)
PokerStars Hand #225343172788: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:17 ET
Table 'Pippa V' 6-max Seat #3 is the button
Seat 2: user1 ($2.15 in chips)
...
(Usernames have been changed to respect the players' privacy)
Each record (=hand) is seperated by three line breaks. I came as far as to seperate the hands into records, then loop over each line to save the relevant data into variables and print them. My little awk program looks like this:
BEGIN{
RS="\n\r\n\r\n\r\n";
FS="\n";
OFS=",";
print "Hand ID,Game Type,Time,Holecards";
}
{
for (i = 1; i <= NF; i++)
{
if ($i ~ /^PokerStars Hand/)
{
split($i, aHand, " ");
handID = aHand[3];
gameType = aHand[5]" "aHand[6]" "aHand[7]" "aHand[8];
dateTime = aHand[10]" "aHand[11]" "aHand[12];
}
if ($i ~ /^Dealt to /)
{
split($i, aHoleCards, " ");
holeCards = aHoleCards[4]" "aHoleCards[5];
}
}
print(handID, gameType, dateTime, holeCards);
#printf("%s, %s, %s, %s\n", handID, gameType, dateTime, holeCards); # Same problem here
}
The problem
The output I am expecting to get (for the first hand) is:
Hand ID,Game Type,Time,Holecards
#225343172788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET,[7d 9c]
However, the output is different. For the first record, the variables handID, gameType, and dateTime seem to be empty whereas the holeCards get printed. The other variables then show up on the second line but get somehow "overwritten" by the holeCards variable of the second record:
Hand ID,Game Type,Time,Holecards
,,,[7d 9c]
,[Kd As]72788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET
I hope my description isn't too confusing. I'm very confused myself with the result. I tried using printf instead of print but the result is the same. I suspect I'm mussing something simple here.

Is there any command to do fuzzy matching in Linux based on multiple columns

I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

When I type "lshosts" I am given:
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
server1 X86_64 Intel_EM 60.0 12 191.9G 159.7G Yes ()
server2 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
server3 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
I am trying to return maxmem and maxswp as megabytes, not gigabytes when lshosts is called. I am trying to send Xilinx ISE jobs to my LSF, however the software expects integer, megabyte values for maxmem and maxswp. By doing debugging, it appears that the software grabs these parameters using the lshosts command.
I have already checked in my lsf.conf file that:
LSF_UNIT_FOR_LIMTS=MB
I have tried searching the IBM Knowledge Base, but to no avail.
Do you use a specific command to specify maxmem and maxswp units within the lsf.conf, lsf.shared, or other config files?
Or does LSF force return the most practical unit?
Any way to override this?
LSF_UNIT_FOR_LIMITS should work, if you completely drained the cluster of all running, pending, and finished jobs. According to the docs, MB is the default, so I'm surprised.
That said, you can use something like this to transform the results:
$ cat to_mb.awk
function to_mb(s) {
e = index("KMG", substr(s, length(s)))
m = substr(s, 0, length(s) - 1)
return m * 10^((e-2) * 3)
}
{ print $1 " " to_mb($6) " " to_mb($7) }
$ lshosts | tail -n +2 | awk -f to_mb.awk
server1 191900 159700
server2 191900 191200
server3 191900 191200
The to_mb function should also handle 'K' or 'M' units, should those pop up.
If LSF_UNIT_FOR_LIMITS is defined in lsf.conf, lshosts will always print the output as a floating point number, and in some versions of LSF the parameter is defined as 'KB' in lsf.conf upon installation.
Try searching for any definitions of the parameter in lsf.conf and commenting them all out so that the parameter is left undefined, I think in that case it defaults to printing it out as an integer in megabytes.
(Don't ask me why it works this way)

awk-insert row with specific text within specific position

I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected.
Sample file:
// Generated for: spectre
// Design library name: Kovi
// Design cell name: T_Line
// Design view name: schematic
simulator lang=spectre
global 0
parameters frequency=3.8G Zo=250
// Library name: Kovi
// Cell name: T_Line
// View name: schematic
T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1
How about sed
sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file
Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end.
Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.
Since you awk'd the post
awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file
As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect.
Test file
#
#
#
#Preceded by a tab
begin protect
#
before unprotect
Result
#
#
#
#Preceded by tab
protect
begin protect
#
before unprotect
unprotect
Edit:
Removed the [:space:]* as it seems that is already handled by default.
Support //
If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \.
This would check for at least one /.
Adding a quantifier {2} will match // exactly: /#|\/{2}/

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources