Search in directory of files based on keywords from another file - linux

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;

It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Related

Replace $$ or more with single spaceusing Regex in python

In the following list of string i want to remove $$ or more with only one space.
eg- if i have $$ then one space character or if there are $$$$ or more then also only 1 space is to be replaced.
I am using the following regex but i'm not sure if it serves the purpose
regex_pattern = r"['$$']{2,}?"
Following is the test string list:
['1', 'Patna City $$$$ $$$$$$$$View Details', 'Serial No:$$$$5$$$$ $$$$Deed No:$$$$5$$$$ $$$$Token No:$$$$7$$$$ $$$$Reg Year:2020', 'Anil Kumar Singh Alias Anil Kumar$$$$$$$$Executant$$$$$$$$Late. Harinandan Singh$$$$$$$$$$$$Md. Shahzad Ahmad$$$$$$$$Claimant$$$$$$$$Late. Md. Serajuddin', 'Anil Kumar Singh Alias Anil Kumar', 'Executant', 'Late. Harinandan Singh', 'Md. Shahzad Ahmad', 'Claimant', 'Late. Md. Serajuddin', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000', 'Circle:Patna City Mauja: $$$$ $$$$Khata : na$$$$ $$$$Plot :2497 Area(in Decimal):1.5002 Land Type :Res. Branch Road Land Value :1520000 MVR Value :1000000']
About
I am using the following regex but i'm not sure if it serves the
purpose
The pattern ['$$']{2,}? can be written as ['$]{2,}? and matches 2 or more chars being either ' or $ in a non greedy way.
Your pattern currently get the right matches, as there are no parts present like '' or $'
As the pattern is non greedy, it will only match 2 chars and will not match all 3 characters in $$$
You could write the pattern matching 2 or more dollar signs without making it non greedy so the odd number of $ will also be matched:
regex_pattern = r"\${2,}"
In the replacement use a space.
Is this what you need?:
import re
for d in data:
d = re.sub(r'\${2,}', ' ', d)

String variable overwrites instead of concatenating in for-loop

Context
I am trying to write a little awk program to analyze my PokerStars hand history. Hand histories are stored in text files and have the following format:
PokerStars Hand #225343166937: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:07 ET
Table 'Pippa V' 6-max Seat #2 is the button
Seat 2: user1 ($2.12 in chips)
Seat 3: user2 ($2.28 in chips)
Seat 4: me ($2 in chips)
Seat 5: user3 ($1.95 in chips)
Seat 6: user4 ($2.06 in chips)
user2: posts small blind $0.01
me: posts big blind $0.02
*** HOLE CARDS ***
Dealt to me [7d 9c]
user3: folds
user4: folds
user1: raises $0.04 to $0.06
user2: folds
me: folds
Uncalled bet ($0.04) returned to user1
user1 collected $0.05 from pot
user1: doesn't show hand
*** SUMMARY ***
Total pot $0.05 | Rake $0
Seat 2: user1 (button) collected ($0.05)
Seat 3: user2 (small blind) folded before Flop
Seat 4: me (big blind) folded before Flop
Seat 5: user3 folded before Flop (didn't bet)
Seat 6: user4 folded before Flop (didn't bet)
PokerStars Hand #225343172788: Hold'em No Limit ($0.01/$0.02 USD) - 2021/03/30 16:14:17 ET
Table 'Pippa V' 6-max Seat #3 is the button
Seat 2: user1 ($2.15 in chips)
...
(Usernames have been changed to respect the players' privacy)
Each record (=hand) is seperated by three line breaks. I came as far as to seperate the hands into records, then loop over each line to save the relevant data into variables and print them. My little awk program looks like this:
BEGIN{
RS="\n\r\n\r\n\r\n";
FS="\n";
OFS=",";
print "Hand ID,Game Type,Time,Holecards";
}
{
for (i = 1; i <= NF; i++)
{
if ($i ~ /^PokerStars Hand/)
{
split($i, aHand, " ");
handID = aHand[3];
gameType = aHand[5]" "aHand[6]" "aHand[7]" "aHand[8];
dateTime = aHand[10]" "aHand[11]" "aHand[12];
}
if ($i ~ /^Dealt to /)
{
split($i, aHoleCards, " ");
holeCards = aHoleCards[4]" "aHoleCards[5];
}
}
print(handID, gameType, dateTime, holeCards);
#printf("%s, %s, %s, %s\n", handID, gameType, dateTime, holeCards); # Same problem here
}
The problem
The output I am expecting to get (for the first hand) is:
Hand ID,Game Type,Time,Holecards
#225343172788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET,[7d 9c]
However, the output is different. For the first record, the variables handID, gameType, and dateTime seem to be empty whereas the holeCards get printed. The other variables then show up on the second line but get somehow "overwritten" by the holeCards variable of the second record:
Hand ID,Game Type,Time,Holecards
,,,[7d 9c]
,[Kd As]72788:,No Limit ($0.01/$0.02 USD),2021/03/30 16:14:17 ET
I hope my description isn't too confusing. I'm very confused myself with the result. I tried using printf instead of print but the result is the same. I suspect I'm mussing something simple here.

Removing columns and sorting by Name in finger command

When I use the finger command, it displays Login, Name, Tty, Idle, Login Time, Office, Office Phone, and Host. I just need the information in the Login, Name, Idle, and Login Time columns.
I tried using awk and sed, but they resulted in chart being all over the place (example below).
$ finger | sed -r 's/\S+//3'
Login Name Idle Login Time Office Office Phone Host
user1 Full Name pts/1 20 Feb 3 19:34 (--------------------)
user2 FirstName LastName pts/2 Feb 3 17:04 (--------------)
user3 Name NameName pts/3 1:11 Feb 2 11:37 (-------------------------------)
user4 F Last pts/4 1:09 Feb 13 18:14 (-------------------)
How do I go about removing specific columns while keeping the structure intact?
The problem here is that you cannot extract particular fields based on whitespace separator, because on certain rows the columns might be blank and contain only whitespace, especially the Idle column, which will be blank for sessions with limited idle time. (An additional problem is that the real name field may contain a variable number of spaces.)
So you may have to resort to cut -b ... using hard-coded byte offsets. The following seems to work on my system, as finger seems to use a fixed format output, truncating real names etc as needed, so the byte offsets do not change if the length of the GECOS (real name) field of logged in users is changed.
finger | cut -b 1-20,30-48
Note that it will be inherently fragile if the format of the finger command output were to change in future. You might be able to produce something slightly more robust using regular expression parsing, for example parsing the column headings (first line of finger output) to obtain the byte offsets rather than hard-coding them, but it will still be somewhat fragile. A more robust solution would involve writing your own code to obtain information from the same sources that finger uses, and use that in place of finger. The existing code of an open-source implementation of finger might be a suitable starting point, and then you can adapt it to remove the columns that are not of interest.
Update: building a patched version of finger.
Save this patch as /tmp/patch. It it just a quick-and-dirty patch to suppress certain fields from being printed; they are still calculated.
--- sprint.c~ 2020-06-13 12:27:12.000000000 +0100
+++ sprint.c 2020-06-13 12:32:23.363138500 +0100
## -89,7 +89,7 ##
if (maxlname + maxrname < space-2) { maxlname++; maxrname++; }
(void)xprintf("%-*s %-*s %s\n", maxlname, "Login", maxrname,
- "Name", " Tty Idle Login Time Office Office Phone");
+ "Name", " Idle Login Time");
for (cnt = 0; cnt < entries; ++cnt) {
pn = list[cnt];
for (w = pn->whead; w != NULL; w = w->next) {
## -100,12 +100,6 ##
(void)xprintf(" * * No logins ");
goto office;
}
- (void)xputc(w->info == LOGGEDIN && !w->writable ?
- '*' : ' ');
- if (*w->tty)
- (void)xprintf("%-7.7s ", w->tty);
- else
- (void)xprintf(" ");
if (w->info == LOGGEDIN) {
stimeprint(w);
(void)xprintf(" ");
## -118,17 +112,6 ##
else
(void)xprintf(" %.5s", p + 11);
office:
- if (w->host[0] != '\0') {
- xprintf(" (%s)", w->host);
- } else {
- if (pn->office)
- (void)xprintf(" %-10.10s", pn->office);
- else if (pn->officephone)
- (void)xprintf(" %-10.10s", " ");
- if (pn->officephone)
- (void)xprintf(" %-.14s",
- prphone(pn->officephone));
- }
xputc('\n');
}
}
Then obtain the source code, patch it and build it. (Change destdir as required.)
apt-get source finger
cd bsd-finger-0.17/
pushd finger
patch -p0 < /tmp/patch
popd
destdir=/tmp/finger
mkdir -p $destdir/man/man8 $destdir/sbin $destdir/bin
./configure --prefix=$destdir
make
make install
And run it...
$destdir/bin/finger
Basically, to treat columns, awk is the way to go,
ex: remove third column
finger | awk '{$3="";print}'
Another way: If you found this informations, they have to be wrote somewhere in the system. Using who, awk and cut :
The informations can be gathered by getent passwd.
Created a test user with adduser :
# adduser foobar
Adding user `foobar' ...
Adding new group `foobar' (1001) ...
Adding new user `foobar' (1001) with group `foobar' ...
Creating home directory `/home/foobar' ...
Copying files from `/etc/skel' ...
New password:
Retype new password:
passwd: password updated successfully
Changing the user information for foobar
Enter the new value, or press ENTER for the default
Full Name []: Jean-Charles De la tour
Room Number []: 42
Work Phone []: +33140000000
Home Phone []: +33141000000
Other []: sysadmin
Is the information correct? [Y/n] Y
And the new line in /etc/passwd file:
foobar:x:1001:1001:Jean-Charles De la tour,42,+33140000000,+33141000000,sysadmin:/home/foobar:/bin/bash
So it's easy to retrieve in formations from this:
for u in $(who | cut -d' ' -f1); do # iterate over connected users
getent passwd | awk -F'[:,]' -v OFS='\n' -v u="$u" '$1==u{print "user: "$1, "full name: "$5, "room: "$6, "work phone : "$7, "home phone: "$8, "other: "$9}'
done
Just make sure you have , in $5 column.
Output
user: foobar
full name: Jean-Charles De la tour
room: 42
work phone : +33140000000
home phone: +33141000000
other: sysadmin

Is there any command to do fuzzy matching in Linux based on multiple columns

I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.
Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}

lists and reading from file comma separated elements

Here is my code
#course registration
list_courses=[]
for line in open("courses.txt",'r').readlines():
list_courses.append(line.strip())
print ("Gathering course information from file: \n",list_courses)
close("courses.txt")
list_student=[]
for line in open("students.txt",'r').readlines():
list_student.append(line.strip())
print("Here is student info: \n",list_student)
close("students.txt")
this is giving me errors when I try to close the files. How do I close, I am basically reading contents of file and storing them in a list. Now later on want to close the open files.There I get error.
I edited the code as per suggestions below.
The new code is
list_courses=[]
with open("courses.txt",'r') as myfile1:
list_courses=myfile1.readlines()
list_courses=[x.strip() for x in list_courses]
print ("Gathering course information from file: \n",list_courses)
list_student=[]
with open("students.txt",'r') as myfile1:
list_student=myfile1.readlines()
list_student=[x.strip() for x in list_student]
print("Here is student info: \n",list_student)
The information in courses.txt is
cs101,C programming
cs102,Digital logic and design
cs103,Electrical engineering
cs231,IT networks
cs232,IT Workshop
cs233,IT programming
cs301,Compilers and automata
cs302,Operating Systems
cs303,Networks
cs401,Game Theory
cs402,Systems Programming
cs403,Automata
ec101,Digitization
ec102,Analog cicuit design
ec103,IP Telephony
ec201,Wireless Network
ec202,Microwave engineering
ec203,Antenna
ec301,Maths2
ec302,Theory of Circuits
ec303,PCB design
ec401,PLC programming
ec402,Scada
ec403,VLSI
When I run the code I get output
Gathering course information from file:
['cs101,C programming', 'cs102,Digital logic and design', 'cs103,Electrical engineering', 'cs231,IT networks', 'cs232,IT Workshop', 'cs233,IT programming', 'cs301,Compilers and automata', 'cs302,Operating Systems', 'cs303,Networks', 'cs401,Game Theory', 'cs402,Systems Programming', 'cs403,Automata', 'ec101,Digitization', 'ec102,Analog cicuit design', 'ec103,IP Telephony', 'ec201,Wireless Network', 'ec202,Microwave engineering', 'ec203,Antenna', 'ec301,Maths2', 'ec302,Theory of Circuits', 'ec303,PCB design', 'ec401,PLC programming', 'ec402,Scada', 'ec403,VLSI']
Instead of it what I want is the input cs101 from the first line of courses.txt to go in list_courses[0] and list_courses[1] to have c programming i.e.
list_courses[0]=cs101
list_courses[1]=C programming
So I tried methods where programe have taken a line and read the line stored that line as an element of list but there is a comma which separates two elements in courses.txt and comma separated values should be separate list elements.
This will work for you:
students=[]
subject=[]
with open("students.txt","r") as f:
for line in f.readlines():
eachl=line.split(",")
students.append(eachl[0])
subject.append(eachl[1][:-1])
you will get two lists containing student names and another one with subjects:
students list will look like:
['cs101', 'cs102', 'cs103', 'cs231', 'cs232', 'cs233', 'cs301', 'cs302', 'cs303', 'cs401', 'cs402', 'cs403', 'ec101', 'ec102', 'ec103', 'ec201', 'ec202', 'ec203', 'ec301', 'ec302', 'ec303', 'ec401', 'ec402', 'ec40`3]
subjects list will look like:
['C programming', 'Digital logic and design', 'Electrical engineering', 'IT networks', 'IT Workshop', 'IT programming', 'Compilers and automata', 'Operating Systems', 'Networks', 'Game Theory', 'Systems Programming', 'Automata', 'Digitization', 'Analog cicuit design', 'IP Telephony', 'Wireless Network', 'Microwave engineering', 'Antenna', 'Maths2', 'Theory of Circuits', 'PCB design', 'PLC programming', 'Scada', 'VLSI']
Why don’t you try
students = []
courses = []
open(“courses.txt”, “r”) as f
for line in f.readlines()
a, b = line.split(“,”)
students.append(a)
courses.append(b[:-1])
f.close()
This will produce two lists students and courses

Resources