I'm a very new Python user. My project is to take a very long (20k lines) file that includes movies and actors in them and refine it. I'm trying to find out which of the movies listed has the highest number of actors.
I'm not sure how to do multiple counts of a single file.
This is the file that starts the project. It repeats like that with different movie titles for 20k lines. Pic of original file The first part of the project is to build a list which contains every movie's full cast list which is what the code below does. Now what I'm trying to do is get the program to count how many actors is in each movie and print which one has the highest number of actors.
lines_seen = list()
fhand = open...
#opens but I don't want to show address
actors = list()
titles = list()
is_Actor = True
for line in fhand:
line = line.rstrip()
if (is_Actor):
titles.append(line)
if line not in lines_seen:
lines_seen.append("The title of the movie is:")
lines_seen.append(line)
print(" ")
print("The title of the movie is '", line, "'")
print("The actors in the movie are:")
elif not (is_Actor):
lines_seen.append(line)
print(line)
actors.append(line)
is_Actor = not(is_Actor)
fhand.close()
Heres what I've done so far
actors = dict()
is_Title = True
for line in fhand:
words = line.split()
if (is_Title):
if line not in actors:
actors[line] = 1
else:
actors[line] = actors[line] + 1
is_Title = not is_Title
Now I'm trying to get it to return the highest value. I've googled it and it tells me to use max() but that returns a value of 97 when I know the highest value is 207. What do I do from here?
Recommendation #1: Make yourself a small chunk of data that you can experiment with and read/print results. It will be 55x (my estimate) easier to troubleshoot than 20k lines. Maybe 2 movies, 1 with 2 actors, 1 with 1 actor.
Are you familiar with python dictionaries? It seems what you want to do is associate a list of actors with a movie title. Then you can inspect the sizes of the lists in the dictionary to find the one with the highest length.
In basic Python, you should ...
make an empty dictionary outside of your loop to hold the results, as you are doing with actors, etc.
start reading the file. It seems like your data is in a predictable pattern that the title is followed by a single actor name, so if you want to keep your current reading construct (an alternate would be to read 2 lines each pass through a different loop) you need to "hold onto the movie title" until the next loop to get the actor, so in pseudocode you could modify your loop to something like:
title = None
is_actor = False
for line in fhand:
if not is_actor: # you have a title...
title = line
else: # you have an actor
# get the list from the dictionary for the current title, or make a new list if no entry yet
# add the actor to the list
# put the list back into the dictionary
is_actor = not is_actor
Then inspect your dictionary and manipulate it as needed
For a primer on dictionaries (and other introductory concepts) I strongly recommend Think Python. See the whole chapter on dictionaries.
Related
I have a text file to import to dictionary but I have an issue trying to get the program to identify the correct line no as item 1 and items 2 in a list in dictionary
The format of text file is like this (there is no empty line between each lines and only at the end of each record, there is a line break):
ProductA
2020-08-03 16:26:21
This painting was done by XNB.
The artist seeks to portray the tragedies caused by event XYZ.
The painting weighs 2kg.
####blank line#####
ProductB
2020-08-03 16:26:21
This painting is done by ONN.
It was stolen during world war 2.
Decades later, it was discovered in the black market of country XYZ.
It was bought for 2 million dollars by ABC.
###blank line###
Desired outcome in dictionary:
{ 'ProductA' : ['2020-08-03 16:26:21', 'This painting was done by XNB.The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB':['2020-08-03 16:26:21','This painting is done by ONN.This painting is done by ONN.Decades later, it was discovered in the black market of country XYZ.It was bought for 2 million dollars by ABC.']}
where item_2 is a single string that is combined from line 3 onwards till the end of the information where it meets a blank line.
Problem: I don't know how to code the logic in such as way that the program will be able to properly assign it to where I want it to.
header = ""
header = True
for line in records:
data = line.splitlines()
if line!= '\n': # check for line break which indicate new record
if Header: #
#code which will assign 1st line of each record as key to dictionary
else:
# This is where I need help.
# Code which will assign 2nd line as item_1 and then assign 3rd lines onwards till the end of record as item_2 in a single string.
# items_2 may have different number of lines being combined into 1 string for each record.
# I try to form a rough idea how the logic might be in code below but I feel that something is missing and I got a bit confused.
for line in list: # result in TypeError, 'type' object is not iterable.
dict[line[1]] = dict[header].append(line[1].strip("\n"))
# Since the outer if has already done its job of identifying 1st line of record. The line of code seeks to assign the next line (line 2 in text file) which I think would be interpreted by the program as line[1] to item 2.
dict[line[2:]] = dict[header].append(line[2:].strip("\n"))
# Assign 3rd line of text file onwards as a single string which is item_2 in the list of value for dictionary.
else:
#code which reset boolean for header
Try this:
with open('data.txt') as fp:
data = fp.read().split('\n\n')
res = {}
for x in data:
k, v = x.strip().split('\n', 1)
v = v.split('\n')
res[k] = [v[0], ' '.join(v[1:])]
print(res)
Output:
{'ProductA': ['2020-08-03 16:26:21', 'This painting was done by XNB. The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB': ['2020-08-03 16:26:21', 'This painting is done by ONN. It was stolen during world war 2. Decades later, it was discovered in the black market of country XYZ. It was bought for 2 million dollars by ABC.']}
I have file with lines like:
1. 'abc0123,spja,40'
2. 'sed0898,spja,15'
3. 'sed0898,spja,10'
4. 'abc0123,udbs,10'
5. 'bem0334,dim,18'
6. 'bem0334,dim,0'
7. 'bem0334,spja,30'
etc. first word before comma means student login, second mean subject of exam and third means points for exam. One row represents one attempt on exam. I need return only students who passed on exams to which they tried. Doesn't matter on order by lines. In case above passed students bem0334 and sed0898. For passing student must have 15 and more points. So i started with saving lines into list of strings but i don't know how to test if students has passed on all his exams. `
def vrat_uspesne(soubor_vysledky):
f = open(soubor_vysledky, "r")
studens = []
exams = []
tmp = ""
for line in f:
spliter = line.split(',')
exams.append(line.rstrip('\n'))
student.append(spliter[0])
student = set(student)
student = list(student)
return student
You appear to have a typo in that code snippet (student vs students).
The general approach I would suggest is to map lines to data structs, then group the data by student login using a dictionary.
import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.
You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.
I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.
add name, where is a string denoting a contact name. This must store as a new contact in the application.
find partial, where is a string denoting a partial name to search the application for. It must count the number of contacts starting with and print the count on a new line.
Given sequential add and find operations, perform each operation in order.
Input:
4
add hack
add hackerrank
find hac
find hak
Sample Output
2
0
We perform the following sequence of operations:
1.Add a contact named hack.
2.Add a contact named hackerrank.
3.Find and print the number of contact names beginning with hac.
There are currently two contact names in the application
and both of them start with hac, so we print 2 on a new line.
4.Find and print the number of contact names beginning with hak.
There are currently two contact names in the application
but neither of them start with hak, so we print 0 on a new line.
i solved it but it is taking long time for large number of string. my code is
addlist =[]
findlist=[]
n = int(input().strip())
for a0 in range(n):
op, contact = input().strip().split(' ')
if(op=='add'):
addlist.append(contact)
else:
findlist.append(contact)
for item in findlist:
count=0
count=[count+1 for item2 in addlist if item in item2 if item==item2[0:len(item)]]
print(sum(count))
is there any other way to avoid the long time to computation.
As far as optimizing goes I broke your code apart a bit for readability and removed a redundant if statement. I'm not sure if its possible to optimize any further.
addlist =[]
findlist=[]
n = int(input().strip())
for a0 in range(n):
op, contact = input().strip().split(' ')
if(op=='add'):
addlist.append(contact)
else:
findlist.append(contact)
for item in findlist:
count = 0
for item2 in addlist:
if item == item2[0:len(item)]:
count += 1
print(count)
I tested 10562 entries at once and it processed instantly so if it lags for you it can be blamed on your processor
I have a homework question which asks:
Write a function print_word_counts(filename) that takes the name of a
file as a parameter and prints an alphabetically ordered list of all
words in the document converted to lower case plus their occurrence
counts (this is how many times each word appears in the file).
I am able to get an out of order set of each word with it's occurrence; however when I sort it and make it so each word is on a new line the count disappears.
import re
def print_word_counts(filename):
input_file = open(filename, 'r')
source_string = input_file.read().lower()
input_file.close()
words = re.findall('[a-zA-Z]+', source_string)
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
sorted_count = sorted(counts)
print("\n".join(sorted_count))
When I run this code I get:
a
aborigines
absence
absolutely
accept
after
and so on.
What I need is:
a: 4
aborigines: 1
absence: 1
absolutely: 1
accept: 1
after: 1
I'm not sure how to sort it and keep the values.
It's a homework question, so I can't give you the full answer, but here's enough to get you started. Your mistake is in this line
sorted_count = sorted(counts)
Firstly, you cant sort a dictionary by nature. Secondly, what this does is take the keys of the dictionary, sorts them, and returns a list.
You can just print the value of counts, or, if you really need them in sorted order, consider changing the dictionary items into a list, then sorting them.
lst = list(count.items())
#sort and return lst