Python3: grouping list of objects by words in description - python-3.x

I have a standard list of objects, where each object is defined as
class MyRecord(object):
def __init__(self, name, date, category, memo):
self.name = name
self.date = date
self.category = category
self.memo = memo.strip().split()
When I create an object usually the input memo is a long sentence, for example: "Hello world this is a new funny-memo", which then in the init function turns into a list ['Hello', 'world', 'is', 'a', 'new', 'funny-memo'].
Given let's say a 10000 of such records in the list (with different memos) I want to group them (as fast as possible) in the following way:
'Hello' : [all the records, which memo contains word 'Hello']
'world' : [all the records, which memo contains word 'world']
'is' : [all the records, which memo contains word 'is']
I know how to use group-by to group the records by for example name, date, or category (since it is a single value), but I'm having a problem to group in the way described above.

If you want to group them really fast then you should do it once and never recalculate. To achieve this you may try approach used for caching that is group objects during the creation:
class MyRecord():
__groups = dict()
def __init__(self, name, date, category, memo):
self.name = name
self.date = date
self.category = category
self.memo = memo.strip().split()
for word in self.memo:
self.__groups.setdefault(word, set()).add(self)
#classmethod
def get_groups(cls):
return cls.__groups
records = list()
for line in [
'Hello world this is a new funny-memo',
'Hello world this was a new funny-memo',
'Hey world this is a new funny-memo']:
records.append(MyRecord(1, 1, 1, line))
print({key: len(val) for key, val in MyRecord.get_groups().items()})
Output:
{'Hello': 2, 'world': 3, 'this': 3, 'is': 2, 'a': 3, 'new': 3, 'funny-memo': 3, 'was': 1, 'Hey': 1}

Related

Python nested dictionary Issue when iterating

I have 5 list of words, which basically act as values in a dictionary where the keys are the IDs of the documents.
For each document, I would like to apply some calculations and display the values and results of the calculation in a nested dictionary.
So far so good, I managed to do everything but I am failing in the easiest part.
When showing the resulting nested dictionary, it seems it's only iterating over the last element of each of the 5 lists, and therefore not showing all the elements...
Could anybody explain me where I am failing??
This is the original dictionary data_docs:
{'doc01': ['simpl', 'hello', 'world', 'test', 'python', 'code'],
'doc02': ['today', 'wonder', 'day'],
'doc03': ['studi', 'pac', 'today'],
'doc04': ['write', 'need', 'cup', 'coffe'],
'doc05': ['finish', 'pac', 'use', 'python']}
This is the result I am getting (missing 'simpl','hello', 'world', 'test', 'python' in doc01 as example):
{'doc01': {'code': 0.6989700043360189},
'doc02': {'day': 0.6989700043360189},
'doc03': {'today': 0.3979400086720376},
'doc04': {'coffe': 0.6989700043360189},
'doc05': {'python': 0.3979400086720376}}
And this is the code:
def tfidf (data, idf_score): #function, 2 dictionaries as parameters
tfidf = {} #dict for output
for word, val in data.items(): #for each word and value in data_docs(first dict)
for v in val: #for each value in each list
a = val.count(v) #count the number of times that appears in that list
scores = {v :a * idf_score[v]} # dictionary that will act as value in the nested
tfidf[word] = scores #final dictionary, the key is doc01,doc02... and the value the above dict
return tfidf
tfidf(data_docs, idf_score)
Thanks,
Did you mean to do this?
def tfidf(data, idf_score): # function, 2 dictionaries as parameters
tfidf = {} # dict for output
for word, val in data.items(): # for each word and value in data_docs(first dict)
scores = {} # <---- a new dict for each outer iteration
for v in val: # for each value in each list
a = val.count(v) # count the number of times that appears in that list
scores[v] = a * idf_score[v] # <---- keep adding items to the dictionary
tfidf[word] = scores # final dictionary, the key is doc01,doc02... and the value the above dict
return tfidf
... see my changes with <----- arrow :)
Returns:
{'doc01': {'simpl': 1,
'hello': 1,
'world': 1,
'test': 1,
'python': 1,
'code': 1},
'doc02': {'today': 1, 'wonder': 1, 'day': 1},
'doc03': {'studi': 1, 'pac': 1, 'today': 1},
'doc04': {'write': 1, 'need': 1, 'cup': 1, 'coffe': 1},
'doc05': {'finish': 1, 'pac': 1, 'use': 1, 'python': 1}}

sorting a list of string based on a dictionary

I have a dictionary containing the high-level job titles and their order. for example
{'ceo':0,'founder':1,'chairman':2}
I also have a list of job titles:
['ceo', 'manager','founder','partner', 'chairman']
what I want is this
['ceo','founder', 'chairman', 'manager','partner']
Try:
order = {"ceo": 0, "founder": 1, "chairman": 2}
lst = ["ceo", "manager", "founder", "partner", "chairman"]
out = sorted(lst, key=lambda v: order.get(v, float("inf")))
print(out)
Prints:
["ceo", "founder", "chairman", "manager", "partner"]

List of Starters in dictionaries

in the given method called solve which takes as parameter a list of strings called items.
You have to print the list of items for each alphabet. Print in sorted order of alphabets.
Example Input:
noodles, rice, banan, sweets, ramen, souffle, apricot, apple, bread
Output:
a : apple apricot
b : banana bread
n : noodles
r : ramen rice
s : souffle sweets
import collections
def solve(items):
result = {}
for word in items:
char = word[0]
if char in result:
result[char].append(word)
else:
result[char] = [word]
od = collections.OrderedDict(sorted(result.items()))
for key, value in od.items():
print ("%s : %s"%(key,value))
but, im getting it in brakets...! not like a desired output...
Alternatively you can try to leverage Python collections.defaultdict as this will simply the code logic:
You could convert this easily to the function - maybe as an exercise? If you have any questions, please ask.
from collections import defaultdict
inputs = "noodles, rice, banan, sweets, ramen, souffle, apricot, apple, bread"
groups = defaultdict(list)
lst = inputs.split(', ')
#print(lst)
for item in lst:
groups[item[0]].append(item)
for k, val in sorted(groups.items()):
print(k, ": ", *val) # *val to expand the list content
Output:
a : apricot apple
b : banan bread
n : noodles
r : rice ramen
s : sweets souffle
You are not performing any comparisons to find the maxEnglish or the maxTotalMarks. The reason print('Max Marks:',d['name']) is printing the correct result is because Dwight is the last entry in the Ordered Dictionary and you are printing the last item's name.
One of the ways you could tackle this question is by keeping variables that keep track of the maximum scores and as you iterate through the dictionary, you can compare against these stored value to determine if the current value
that you are iterating is greater or lesser than all the values that you have seen so far. Something like this:
def solve(stats):
maxEnglishMarks = -1
maxTotalMarks = -1
maxEnglishStudentName = maxTotalStudentName = None
for stat in stats:
totalMarks = sum([i for i in stat.values() if str(i).isnumeric() == True])
if totalMarks > maxTotalMarks:
maxTotalStudentName = stat['name']
maxTotalMarks = totalMarks
if stat['English'] > maxEnglishMarks:
maxEnglishStudentName = stat['name']
maxEnglishMarks = stat['English']
print('Max English:', maxEnglishStudentName)
print('Max Marks:', maxTotalStudentName)
stats = [
{'name': 'Jim', 'English': 92, 'Math': 80, 'Physics': 70},
{'name': 'Pam', 'French': 72, 'English': 80, 'Biology': 65},
{'name': 'Dwight', 'Farming': 95, 'English': 85, 'Chemistry': 97}
]
solve(stats)

How can give a specific dictionary, which is in a list, a name?

I have this function that creates a dictionary for one student
It's been days of me looking over the web and trying things out, but the only change in output that I've made is putting an empty list (without a name) into the json file. A [] outputted to the file.
def add_student_to_database(fname, lname, test1, test2, test3):
fullname= '%s %s' % (fname, lname)
all_students = []
def lettergrade(test1,test2,test3):
overall = ( int(test1+test2+test3) )/3
if overall >= 93:
letter = 'A'
elif overall >= 90:
letter = 'A-'
elif overall >= 87:
letter = 'B+'
elif overall >= 83:
letter = 'B'
elif overall >= 80:
letter = 'B-'
elif overall >= 77:
letter = 'C+'
elif overall >= 70:
letter = 'C'
elif overall >= 60:
letter = 'D'
elif overall < 60:
letter = 'F'
return letter
student = {
"First name": fname,
"Last name": lname,
"Test 1": test1,
"Test 2": test2,
"Test 3": test3,
"Grade": lettergrade(test1,test2,test3)
}
all_students.append(student)
with open('students.json','a+')as json_file:
json.dump(all_students,json_file, indent= 4)
I expect to get:
'all_students': [
{'John Doe':
'tests':{
'test 1': 100,
'test 2': 100,
'test 3': 100
}
{'Will Smith':
'tests': {}(repeat for a bunch of students)
]
Instead, when it does run well, I get
{
'first name': 'John',
'Last name': 'Doe',
'Test 1': 100,
'Test 2': 100,
'Test 3': 100
}
I want to name the list "all_students" and each individual student's dictionary named by the variable fullname.
I tried starting all over again with the original code that I had (the one posted here) and its throwing this error:
Traceback (most recent call last):
File "./grades.py", line 12, in <module>
class STUDENTS(object):
File "./grades.py", line 81, in STUDENTS
add_student_to_database(fn,ln,t1,t2,t3)
File "./grades.py", line 54, in add_student_to_database
"Grade": lettergrade(test1,test2,test3)
NameError: name 'student' is not defined
Which I managed to fix but forgot how I did it. So, can you help me with all of this please?
I tested your code (by substituting a print statement for the final two lines) and it outputs what I expected, which is a single dictionary contained within a list.
[{'First name': 'Chris', 'Last name': 'Sullivan', 'Test 1': 86, 'Test 2': 99, 'Test 3': 88, 'Grade': 'A-'}]`.
Also, I don't think you would want to have the '+' in the call to open as I don't think you are going to do anything but append to the file.
Finally I don't think the all_students list is going to ever have more than one element as it is initialized every time your run add_student_to_database. To build up the list you would either have to declare it outside the function, build it into a class, or use a callback function.
Here's a class that is hopefully close to what you're after.
import json
class all_students():
def __init__(self):
""" Set up an empty dictionary to hold the student information, and a
another dictionary to contain the thresholds for each grade level.
The keys for the dictionary must be in descending order.
"""
self.all_students = {}
self.grades = {93: 'A', 90: 'A-', 87: 'B+', 83: 'B', 77: 'B-', 70: 'C', 60: 'D', 9: 'F'}
def lettergrade(self, tests):
""" Returns letter grade when passed a tuple of individual test scores.
The first argument is a tuple containing the test scores (e.g. (91, 66, 82))
The average is calculated by dividing the sum of the elements in the
tuple by the number of elements.
Then the grades dictionary is searched for the first score which is
higher than the average. When that is found, the grade is returned.
"""
overall = int(sum(tests)/len(tests))
for score, grade in self.grades.items():
if overall >= score:
return grade
def add_student_to_database(self, fname, lname, *tests):
""" Adds (or replaces) a student grade entry to the all_students dictionary. The new entry
is represented by a dictionary containing the individual test scores and the letter grade
The test scores are passed as individual arguments. *tests gathers
those positional arguments into a tuple, e.g. (91, 66, 82)
A new dictionary containing the first & last names plus the letter
grade is added to the all_students dictionary, with the key equal
to the student's full name.
Finally, that dictionary is updated with the individual test scores.
That update uses "Test 1", "Test 2", etc. as the key. The enumerate
function provides the test number in the same order as it is stored
in the tuple (Note the use of an f-string to format the key). This
function will return the index (position) of each element to the
variable i, and the value of the test score in variable g. The
indices start at 0 so we add 1 to start with Test 1, not Test 0.
"""
fullname = f"{fname} {lname}"
self.all_students[fullname] = {
'First name': fname,
'Last name': lname,
'Grade': self.lettergrade(tests)
}
self.all_students[fullname].update(dict((f'Test {i+1}', g) for i, g in enumerate(tests)))
def show_students(self):
""" Prints the names and letter grade for each student. Note the
use of an f-string to format the output. Also see the use of
the items() method to return the key/value pairs to the variables
fullname/grades respectively for each iteration of the for loop.
"""
for fullname, grades in self.all_students.items():
print(f"{fullname}: Grade is {grades['Grade']}")
def write_file(self, fname='students.json'):
""" Writes student info to json file fname & prints summary
This is basically the same as your original.
"""
with open('students.json','a') as json_file:
json.dump(self.all_students,json_file, indent= 4)
self.show_students()
# The following is run as a test when this file is run (e.g. Python programname.py)
if __name__ == '__main__':
students = all_students()
students.add_student_to_database('John', 'Doe', 80, 88, 92)
students.add_student_to_database('John', 'Public', 95, 91, 80)
students.write_file()
It uses a dictionary of dictionaries, with the outer dictionary keyed by the student's full name, and the inner dictionary similar to what you already had. I decided to allow an arbitrary number of test scores. It will work from 1 to n. It should probably check that that the number of test scores is greater than zero, and that all scores fall between 0 and 100. Each call to add_student_to_database will overwrite the previous entry for the same student.
Here's the json file it produces.
{
"John Doe": {
"First name": "John",
"Last name": "Doe",
"Grade": "B",
"Test 1": 80,
"Test 2": 88,
"Test 3": 92
},
"John Public": {
"First name": "John",
"Last name": "Public",
"Grade": "B+",
"Test 1": 95,
"Test 2": 91,
"Test 3": 80
}
}

Python string duplicates

I have a list
a=['apple', 'elephant', 'ball', 'country', 'lotus', 'potato']
I am trying to find largest element in the list with no duplicates.
For example script should return "country" as it doesn't have any duplicates.
Please help
You could also use collections.Counter for this:
from collections import Counter
a = ['apple', 'elephant', 'ball', 'country', 'lotus', 'potato']
a = set(a)
no_dups = []
for word in a:
counts = Counter(word)
if all(v == 1 for v in counts.values()):
no_dups.append(word)
print(max(no_dups, key = len))
Which follows this procedure:
Converts a to a set, since we only need to look at a word once, just in case a contains duplicates.
Creates a Counter() object of each word.
Only appends words that have a count of 1 for each letter, using all().
Get longest word from this resultant list, using max().
Note: This does not handle ties, you may need to do further work to handle this.
def has_dup(x):
unique = set(x) # pick unique letters
return any([x.count(e) != 1 for e in unique]) # find if any letter appear more than once
def main():
a = ['apple', 'elephant', 'ball', 'country', 'lotus', 'potato']
a = [e for e in a if not has_dup(e)] # filter out duplicates
chosen = max(a, key=len) # choose with max length
print(chosen)
if __name__ == '__main__':
main()

Resources