String variable overwrites instead of concatenating in for-loop - linux

Context
I am trying to write a little awk program to analyze my PokerStars hand history. Hand histories are stored in text files and have the following format:
PokerStars Hand #225343166937: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:07 ET
Table 'Pippa V' 6-max Seat #2 is the button
Seat 2: user1 ($2.12 in chips)
Seat 3: user2 ($2.28 in chips)
Seat 4: me ($2 in chips)
Seat 5: user3 ($1.95 in chips)
Seat 6: user4 ($2.06 in chips)
user2: posts small blind $0.01
me: posts big blind $0.02
*** HOLE CARDS ***
Dealt to me [7d 9c]
user3: folds
user4: folds
user1: raises $0.04 to $0.06
user2: folds
me: folds
Uncalled bet ($0.04) returned to user1
user1 collected $0.05 from pot
user1: doesn't show hand
*** SUMMARY ***
Total pot $0.05 | Rake $0
Seat 2: user1 (button) collected ($0.05)
Seat 3: user2 (small blind) folded before Flop
Seat 4: me (big blind) folded before Flop
Seat 5: user3 folded before Flop (didn't bet)
Seat 6: user4 folded before Flop (didn't bet)
PokerStars Hand #225343172788: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:17 ET
Table 'Pippa V' 6-max Seat #3 is the button
Seat 2: user1 ($2.15 in chips)
...
(Usernames have been changed to respect the players' privacy)
Each record (=hand) is seperated by three line breaks. I came as far as to seperate the hands into records, then loop over each line to save the relevant data into variables and print them. My little awk program looks like this:
BEGIN{
RS="\n\r\n\r\n\r\n";
FS="\n";
OFS=",";
print "Hand ID,Game Type,Time,Holecards";
}
{
for (i = 1; i <= NF; i++)
{
if ($i ~ /^PokerStars Hand/)
{
split($i, aHand, " ");
handID = aHand[3];
gameType = aHand[5]" "aHand[6]" "aHand[7]" "aHand[8];
dateTime = aHand[10]" "aHand[11]" "aHand[12];
}
if ($i ~ /^Dealt to /)
{
split($i, aHoleCards, " ");
holeCards = aHoleCards[4]" "aHoleCards[5];
}
}
print(handID, gameType, dateTime, holeCards);
#printf("%s, %s, %s, %s\n", handID, gameType, dateTime, holeCards); # Same problem here
}
The problem
The output I am expecting to get (for the first hand) is:
Hand ID,Game Type,Time,Holecards
#225343172788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET,[7d 9c]
However, the output is different. For the first record, the variables handID, gameType, and dateTime seem to be empty whereas the holeCards get printed. The other variables then show up on the second line but get somehow "overwritten" by the holeCards variable of the second record:
Hand ID,Game Type,Time,Holecards
,,,[7d 9c]
,[Kd As]72788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET
I hope my description isn't too confusing. I'm very confused myself with the result. I tried using printf instead of print but the result is the same. I suspect I'm mussing something simple here.

Related

How to resolve pandas length error for rows/columns

I have raised the SO Question here and blessed to have an answer from #Scott Boston.
However i am raising another question about an error ValueError: Columns must be same length as key as i am reading a text file and all the rows/columns are not of same length, i tried googling but did not get an answer as i don't want them to be skipped.
Error
b'Skipping line 2: expected 13 fields, saw 14\nSkipping line 5: expected 13 fields, saw 14\nSkipping line 6: expected 13 fields, saw 16\nSkipping line 7: expected 13 fields, saw 14\nSkipping line 8: expected 13 fields, saw 15\nSkipping line 9: expected 13 fields, saw 14\nSkipping line 20: expected 13 fields, saw 19\nSkipping line 21: expected 13 fields, saw 16\nSkipping line 23: expected 13 fields, saw 14\nSkipping line 24: expected 13 fields, saw 16\nSkipping line 27: expected 13 fields, saw 14\n'
My pandas dataframe generator
#!/usr/bin/python3
import pandas as pd
#
cvc_file = pd.read_csv('kids_cvc',header=None,error_bad_lines=False)
cvc_file[['cols', 0]] = cvc_file[0].str.split(':', expand=True) #Split first column on ':'
df = cvc_file.set_index('cols').transpose() #set_index and transpose
print(df)
Result
$ ./read_cvc.py
b'Skipping line 2: expected 13 fields, saw 14\nSkipping line 5: expected 13 fields, saw 14\nSkipping line 6: expected 13 fields, saw 16\nSkipping line 7: expected 13 fields, saw 14\nSkipping line 8: expected 13 fields, saw 15\nSkipping line 9: expected 13 fields, saw 14\nSkipping line 20: expected 13 fields, saw 19\nSkipping line 21: expected 13 fields, saw 16\nSkipping line 23: expected 13 fields, saw 14\nSkipping line 24: expected 13 fields, saw 16\nSkipping line 27: expected 13 fields, saw 14\n'
cols ab ad an ed eg et en eck ell it id ig im ish ob og ock ut ub ug um un ud uck ush
0 cab bad ban bed beg bet den beck bell bit bid big dim fish cob bog dock but cub bug bum bun bud buck gush
1 dab dad can fed keg get hen deck cell fit did dig him dish gob cog lock cut hub dug gum fun cud duck hush
2 gab had fan led leg jet men neck dell hit hid fig rim wish job dog rock gut nub hug hum gun dud luck lush
3 jab lad man red peg let pen peck jell kit kid gig brim swish lob fog sock hut rub jug mum nun mud muck mush
4 lab mad pan wed NaN met ten check sell lit lid jig grim NaN mob hog tock jut sub lug sum pun spud puck rush
5 nab pad ran bled NaN net then fleck tell pit rid pig skim NaN rob jog block nut tub mug chum run stud suck blush
File contents
$ cat kids_cvc
ab: cab, dab, gab, jab, lab, nab, tab, blab, crab, grab, scab, stab, slab
at: bat, cat, fat, hat, mat, pat, rat, sat, vat, brat, chat, flat, gnat, spat
ad: bad, dad, had, lad, mad, pad, sad, tad, glad
an: ban, can, fan, man, pan, ran, tan, van, clan, plan, scan, than
ag: bag, gag, hag, lag, nag, rag, sag, tag, wag, brag, drag, flag, snag, stag
ap: cap, gap, lap, map, nap, rap, sap, tap, yap, zap, chap, clap, flap, slap, snap, trap
am: bam, dam, ham, jam, ram, yam, clam, cram, scam, slam, spam, swam, tram, wham
ack: back, hack, jack, lack, pack, rack, sack, tack, black, crack, shack, snack, stack, quack, track
ash: bash, cash, dash, gash, hash, lash, mash, rash, sash, clash, crash, flash, slash, smash
ed: bed, fed, led, red, wed, bled, bred, fled, pled, sled, shed
eg: beg, keg, leg, peg
et: bet, get, jet, let, met, net, pet, set, vet, wet, yet, fret
en: den, hen, men, pen, ten, then, when
eck: beck, deck, neck, peck, check, fleck, speck, wreck
ell: bell, cell, dell, jell, sell, tell, well, yell, dwell, shell, smell, spell, swell
it: bit, fit, hit, kit, lit, pit, sit, wit, knit, quit, slit, spit
id: bid, did, hid, kid, lid, rid, skid, slid
ig: big, dig, fig, gig, jig, pig, rig, wig, zig, twig
im: dim, him, rim, brim, grim, skim, slim, swim, trim, whim
ip: dip, hip, lip, nip, rip, sip, tip, zip, chip, clip, drip, flip, grip, ship, skip, slip, snip, trip, whip
ick: kick, lick, nick, pick, sick, tick, wick, brick, chick, click, flick, quick, slick, stick, thick, trick
ish: fish, dish, wish, swish
in: bin, din, fin, pin, sin, tin, win, chin, grin, shin, skin, spin, thin, twin
ot: cot, dot, got, hot, jot, lot, not, pot, rot, tot, blot, knot, plot, shot, slot, spot
ob: cob, gob, job, lob, mob, rob, sob, blob, glob, knob, slob, snob
og: bog, cog, dog, fog, hog, jog, log, blog, clog, frog
op: cop, hop, mop, pop, top, chop, crop, drop, flop, glop, plop, shop, slop, stop
ock: dock, lock, rock, sock, tock, block, clock, flock, rock, shock, smock, stock
ut: but, cut, gut, hut, jut, nut, rut, shut
ub: cub, hub, nub, rub, sub, tub, grub, snub, stub
ug: bug, dug, hug, jug, lug, mug, pug, rug, tug, drug, plug, slug, snug
um: bum, gum, hum, mum, sum, chum, drum, glum, plum, scum, slum
un: bun, fun, gun, nun, pun, run, sun, spun, stun
ud: bud, cud, dud, mud, spud, stud, thud
uck: buck, duck, luck, muck, puck, suck, tuck, yuck, chuck, cluck, pluck, stuck, truck
ush: gush, hush, lush, mush, rush, blush, brush, crush, flush, slush
Note:
It's making the first row/column as a master one which has 13 values and skipping all the columns which are more than 13 columns.
I couldn't figure out a pandas way to extend the columns, but converting the rows to a dictionary made things easier.
ss = '''
ab: cab, dab, gab, jab, lab, nab, tab, blab, crab, grab, scab, stab, slab
at: bat, cat, fat, hat, mat, pat, rat, sat, vat, brat, chat, flat, gnat, spat
ad: bad, dad, had, lad, mad, pad, sad, tad, glad
.......
un: bun, fun, gun, nun, pun, run, sun, spun, stun
ud: bud, cud, dud, mud, spud, stud, thud
uck: buck, duck, luck, muck, puck, suck, tuck, yuck, chuck, cluck, pluck, stuck, truck
ush: gush, hush, lush, mush, rush, blush, brush, crush, flush, slush
'''.strip()
with open ('kids.cvc','w') as f: f.write(ss) # write data file
######################################
import pandas as pd
dd = {}
maxcnt=0
with open('kids.cvc') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove \n
len1 = len(line) # words have leading space
line = line.replace(' ','')
cnt = len1 - len(line) # get word (space) count
if cnt > maxcnt: maxcnt = cnt # max word count
rec = line.split(':') # header : words
dd[rec[0]] = rec[1].split(',') # split words
for k in dd:
dd[k] = dd[k] + ['']*(maxcnt-len(dd[k])) # add extra values to match max column
df = pd.DataFrame(dd) # convert dictionary to dataframe
print(df.to_string(index=False))
Output
ab at ad an ag ap am ack ash ed eg et en eck ell it id ig im ip ick ish in ot ob og op ock ut ub ug um un ud uck ush
cab bat bad ban bag cap bam back bash bed beg bet den beck bell bit bid big dim dip kick fish bin cot cob bog cop dock but cub bug bum bun bud buck gush
dab cat dad can gag gap dam hack cash fed keg get hen deck cell fit did dig him hip lick dish din dot gob cog hop lock cut hub dug gum fun cud duck hush
gab fat had fan hag lap ham jack dash led leg jet men neck dell hit hid fig rim lip nick wish fin got job dog mop rock gut nub hug hum gun dud luck lush
jab hat lad man lag map jam lack gash red peg let pen peck jell kit kid gig brim nip pick swish pin hot lob fog pop sock hut rub jug mum nun mud muck mush
lab mat mad pan nag nap ram pack hash wed met ten check sell lit lid jig grim rip sick sin jot mob hog top tock jut sub lug sum pun spud puck rush
nab pat pad ran rag rap yam rack lash bled net then fleck tell pit rid pig skim sip tick tin lot rob jog chop block nut tub mug chum run stud suck blush
tab rat sad tan sag sap clam sack mash bred pet when speck well sit skid rig slim tip wick win not sob log crop clock rut grub pug drum sun thud tuck brush
blab sat tad van tag tap cram tack rash fled set wreck yell wit slid wig swim zip brick chin pot blob blog drop flock shut snub rug glum spun yuck crush
crab vat glad clan wag yap scam black sash pled vet dwell knit zig trim chip chick grin rot glob clog flop rock stub tug plum stun chuck flush
grab brat plan brag zap slam crack clash sled wet shell quit twig whim clip click shin tot knob frog glop shock drug scum cluck slush
scab chat scan drag chap spam shack crash shed yet smell slit drip flick skin blot slob plop smock plug slum pluck
stab flat than flag clap swam snack flash fret spell spit flip quick spin knot snob shop stock slug stuck
slab gnat snag flap tram stack slash swell grip slick thin plot slop snug truck
spat stag slap wham quack smash ship stick twin shot stop
snap track skip thick slot
trap slip trick spot

Removing columns and sorting by Name in finger command

When I use the finger command, it displays Login, Name, Tty, Idle, Login Time, Office, Office Phone, and Host. I just need the information in the Login, Name, Idle, and Login Time columns.
I tried using awk and sed, but they resulted in chart being all over the place (example below).
$ finger | sed -r 's/\S+//3'
Login Name Idle Login Time Office Office Phone Host
user1 Full Name pts/1 20 Feb 3 19:34 (--------------------)
user2 FirstName LastName pts/2 Feb 3 17:04 (--------------)
user3 Name NameName pts/3 1:11 Feb 2 11:37 (-------------------------------)
user4 F Last pts/4 1:09 Feb 13 18:14 (-------------------)
How do I go about removing specific columns while keeping the structure intact?
The problem here is that you cannot extract particular fields based on whitespace separator, because on certain rows the columns might be blank and contain only whitespace, especially the Idle column, which will be blank for sessions with limited idle time. (An additional problem is that the real name field may contain a variable number of spaces.)
So you may have to resort to cut -b ... using hard-coded byte offsets. The following seems to work on my system, as finger seems to use a fixed format output, truncating real names etc as needed, so the byte offsets do not change if the length of the GECOS (real name) field of logged in users is changed.
finger | cut -b 1-20,30-48
Note that it will be inherently fragile if the format of the finger command output were to change in future. You might be able to produce something slightly more robust using regular expression parsing, for example parsing the column headings (first line of finger output) to obtain the byte offsets rather than hard-coding them, but it will still be somewhat fragile. A more robust solution would involve writing your own code to obtain information from the same sources that finger uses, and use that in place of finger. The existing code of an open-source implementation of finger might be a suitable starting point, and then you can adapt it to remove the columns that are not of interest.
Update: building a patched version of finger.
Save this patch as /tmp/patch. It it just a quick-and-dirty patch to suppress certain fields from being printed; they are still calculated.
--- sprint.c~ 2020-06-13 12:27:12.000000000 +0100
+++ sprint.c 2020-06-13 12:32:23.363138500 +0100
## -89,7 +89,7 ##
if (maxlname + maxrname < space-2) { maxlname++; maxrname++; }
(void)xprintf("%-*s %-*s %s\n", maxlname, "Login", maxrname,
- "Name", " Tty Idle Login Time Office Office Phone");
+ "Name", " Idle Login Time");
for (cnt = 0; cnt < entries; ++cnt) {
pn = list[cnt];
for (w = pn->whead; w != NULL; w = w->next) {
## -100,12 +100,6 ##
(void)xprintf(" * * No logins ");
goto office;
}
- (void)xputc(w->info == LOGGEDIN && !w->writable ?
- '*' : ' ');
- if (*w->tty)
- (void)xprintf("%-7.7s ", w->tty);
- else
- (void)xprintf(" ");
if (w->info == LOGGEDIN) {
stimeprint(w);
(void)xprintf(" ");
## -118,17 +112,6 ##
else
(void)xprintf(" %.5s", p + 11);
office:
- if (w->host[0] != '\0') {
- xprintf(" (%s)", w->host);
- } else {
- if (pn->office)
- (void)xprintf(" %-10.10s", pn->office);
- else if (pn->officephone)
- (void)xprintf(" %-10.10s", " ");
- if (pn->officephone)
- (void)xprintf(" %-.14s",
- prphone(pn->officephone));
- }
xputc('\n');
}
}
Then obtain the source code, patch it and build it. (Change destdir as required.)
apt-get source finger
cd bsd-finger-0.17/
pushd finger
patch -p0 < /tmp/patch
popd
destdir=/tmp/finger
mkdir -p $destdir/man/man8 $destdir/sbin $destdir/bin
./configure --prefix=$destdir
make
make install
And run it...
$destdir/bin/finger
Basically, to treat columns, awk is the way to go,
ex: remove third column
finger | awk '{$3="";print}'
Another way: If you found this informations, they have to be wrote somewhere in the system. Using who, awk and cut :
The informations can be gathered by getent passwd.
Created a test user with adduser :
# adduser foobar
Adding user `foobar' ...
Adding new group `foobar' (1001) ...
Adding new user `foobar' (1001) with group `foobar' ...
Creating home directory `/home/foobar' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for foobar
Enter the new value, or press ENTER for the default
Full Name []: Jean-Charles De la tour
Room Number []: 42
Work Phone []: +33140000000
Home Phone []: +33141000000
Other []: sysadmin
Is the information correct? [Y/n] Y
And the new line in /etc/passwd file:
foobar:x:1001:1001:Jean-Charles De la tour,42,+33140000000,+33141000000,sysadmin:/home/foobar:/bin/bash
So it's easy to retrieve in formations from this:
for u in $(who | cut -d' ' -f1); do # iterate over connected users
getent passwd | awk -F'[:,]' -v OFS='\n' -v u="$u" '$1==u{print "user: "$1, "full name: "$5, "room: "$6, "work phone : "$7, "home phone: "$8, "other: "$9}'
done
Just make sure you have , in $5 column.
Output
user: foobar
full name: Jean-Charles De la tour
room: 42
work phone : +33140000000
home phone: +33141000000
other: sysadmin

Is there any command to do fuzzy matching in Linux based on multiple columns

I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}

Adding an integer after each printed line from dictonaries

I am learning how to program in Python 3 and I am working on a project that lets you buy a ticket to a movie. After that you can see your shopping cart with all the tickets that you have bought.
Now, I want after each printed line to add a integer.
For example: 1. Movie1 , 2. Movie2 , etc..
Here is my code that I use to print the films:
if choice == 3:
#try:
print("Daca doresti sa vezi ce filme sunt valabile, scrie exit.")
bilet = str(input("Ce film doresti sa vizionezi?: ").title())
pret = films[bilet]["price"]
cumperi = input("Doresti sa adaugi in cosul de cumparaturi {}$ (y/n)?".format(bilet)).strip().lower()
if cumperi == "y":
bani[0] -= pret
cos.append(bilet)
if choice == 4:
print (*cos, sep="\n")
You can use an integral variable and increase it's value whenever you perform a task.
example set count = 0 and when you does a task place this there count += 1.

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources