Convert everything in a dictionary to lower case, then filter on it? - python-3.x

import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.

You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.

I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.

Related

How to filter a certain type of python list

I have a list of strings. Each string has the same length/number of characters in the format
xyzw01.ext or xyzv02.ext, etc.
For example
list 1: ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
list 2: ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
I would like from these lists to build new lists with only the strings with highest number.
So from list 1 I would like to get
['ADEJ01.ext','ABCJ02.ext','CDEJ03.ext']
while for list 2 I would like to get the same list since all numbers are 01.
Is there a "simple" way of achieving this?
You can use defaultdict and max
from collections import defaultdict
def fun(lst):
res = defaultdict(list)
for x in lst:
res[x[:4]].append(x)
return [max(res[x], key=lambda x: x[4:6]) for x in res]
lst = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
lst2 = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
print(fun(lst))
print(fun(lst2))
Output:
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
The easiest way is probably to use an intermediate data structure, like a dict - sort the list items into buckets based on the first part of their names, and then take the maximum number for each bucket. We can just use the built-in max() without a key, since as-given lexicographic sorting works to find the largest. If that's not sufficient, you could use more regex to take the number out of the item and use it as the key instead.
import re
def filter_list(lst):
prefixes = {}
for item in lst:
# use regex to isolate the non-numeric characters at the start of the string
prefix = re.match(r'^([^0-9]*)', item).group(1)
# make a bucket based on each prefix, and put the item in it
prefixes.setdefault(prefix, [])
prefixes[prefix].append(item)
# make a list comprehension taking the maximum item from each bucket
return [max(value) for value in prefixes.values()]
>>> a = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
>>> b = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
>>> filter_list(a)
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
>>> filter_list(b)
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
In python 3.7+, this should preserve the order of list from the first occurrence of each prefix (i.e. CDEJ03.ext will precede ADEJ01.ext in the output because CDEJ02.ext precedes it in the input).
To get the output in the exact same order as the original list, behavior, you'd want to explicitly reassign the key instead of using .setdefault(), perhaps with a pattern like prefixes[prefix] = prefixes[prefix] if prefix in prefixes else [].

iterating through values in a dictionary to invert key and values

I am trying to invert an italian-english dictionary using the code that follows.
Some terms have one translation, while others have multiple possibilities. If an entry has multiple translations I iterate through each word, adding it to english-italian dict (if not already present).
If there is a single translation it should not iterate, but as I have written the code, it does. Also only the last translation in the term with multiple translations is added to the dictionary. I cannot figure out how to rewrite the code to resolve what should be a really simple task
from collections import defaultdict
def invertdict():
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': 'wind'}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
if len(words) > 1: # more than one translation ?
for word in words: # if true, iterate through each word
word = str(word).strip(' ')
print(word)
else: # only one translation, don't iterate!!
word = str(words).strip(' ')
print(word)
if word in english_dict.keys(): # check to see if the term already exists
if english_dict[word] != parola: # check that the italian is not present
#english_dict[word] = [english_dict[word], parola]
english_dict[word].append(parola).strip('')
else:
english_dict[word] = parola.strip(' ')
print(len(english_dict))
for key,value in english_dict.items():
print(key, value)
When this code is run, I get :
hog
keelson
inner keel
w
i
n
d
2
inner keel paramezzale (s.m.)
d vento (s.m.)
instead of
hog: paramezzale, keelson: paramezzale, inner keel: paramezzale, wind: vento
It would be easier to use lists everywhere in the dictionary, like:
source_dict = {'many translations': ['a', 'b'], 'one translation': ['c']}
Then you need 2 nested loops. Right now you're not always running the inner loop.
for italian_word, english_words in source_dict.items():
for english_word in english_words:
# print, add to english dict, etc.
If you can't change the source_dict format, you need to check the type explicitly. I would transform the single item in a list.
for italian_word, item in source_dict.items():
if not isinstance(item, list):
item = [item]
Full code:
source_dict ={'paramezzale (s.m.)': ['hog', 'keelson', 'inner keel'], 'vento (s.m.)': ['wind']}
english_dict = defaultdict(list)
for parola, words in source_dict.items():
for word in words:
word = str(word).strip(' ')
# add to the list if not already present
# english_dict is a defaultdict(list) so we can use .append directly
if parola not in english_dict[word]:
english_dict[word].append(parola)

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

String to dictionary word count and display

I have a homework question which asks:
Write a function print_word_counts(filename) that takes the name of a
file as a parameter and prints an alphabetically ordered list of all
words in the document converted to lower case plus their occurrence
counts (this is how many times each word appears in the file).
I am able to get an out of order set of each word with it's occurrence; however when I sort it and make it so each word is on a new line the count disappears.
import re
def print_word_counts(filename):
input_file = open(filename, 'r')
source_string = input_file.read().lower()
input_file.close()
words = re.findall('[a-zA-Z]+', source_string)
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
sorted_count = sorted(counts)
print("\n".join(sorted_count))
When I run this code I get:
a
aborigines
absence
absolutely
accept
after
and so on.
What I need is:
a: 4
aborigines: 1
absence: 1
absolutely: 1
accept: 1
after: 1
I'm not sure how to sort it and keep the values.
It's a homework question, so I can't give you the full answer, but here's enough to get you started. Your mistake is in this line
sorted_count = sorted(counts)
Firstly, you cant sort a dictionary by nature. Secondly, what this does is take the keys of the dictionary, sorts them, and returns a list.
You can just print the value of counts, or, if you really need them in sorted order, consider changing the dictionary items into a list, then sorting them.
lst = list(count.items())
#sort and return lst

In python, how do you loop a dictionary but keep the changes made to it

So basically im new to python and programming in general. I was wondering say you have a situation where you have a dictionary and are asking the user if they want to add or delete terms in the dictionary. So I know how to add or delete the term in dictionaries but how do "save" that data for the next time the program starts. Basically, if the user added a word to the dictionary and then I asked them if they wanted to return to the main menu using a while loop, how would you make it so the word they added is now permanently in the dictionary when he returns to the menu and starts the program over?
Here is what I had. Mind you I'm a beginner and so if it looks weird, then sorry...lol....nothing serious:
loop=None
while True:
#The initial dictionary
things={"house":"a place where you live",
"computer":"you use to do lots of stuff",
"iPod":"mp3 player",
"TV":"watch shows on it",
"bed":"where you sleep",
"wii":"a game system",
"pizza":"food"}
#Menu
print("""
Welcome to the Dictionary of Things
Choose your preference:
0-Quit
1-Look up a Term
2-Add a Term
3-Redefine a Term
4-Delete a Term
""")
choice=input("\nWhat do you want to do?: ")
elif choice=="2": #Adds a term for the user
term=input("What term do you want to add? ")
if term not in things:
definition=input("Whats the definition? ")
things[term]=definition #adds the term to the dictionary
print(term,"has been added to the dictionary")
menu=input("""
Would you like to go back to the menu?
Yes(Y) or No(N): """)
if menu=="Y":
loop=None ----->#Ok so if they want to go back to the menu the program should remember what they added
elif menu=="N":
break
Update:
Your problem is that you redefine the dictionary at the start of each loop. Move the start definition of the dictionary to before the While loop, and you are in business.
Dictionaries and lists are mutable objects. Hence, if it is modified in a function, it stays modified where it was called too:
def main_function():
do someting
mydict = {'a': 2, 'b': 3}
subfunction(mydict)
print mydict
def otherfunction(thedict):
dict['c'] = 5
If you now run main_function, it will print out a dictionary that includes 'c'.
As misha already said, pickle is a good idea, but an easier way is to use the shelve module,which uses (c)pickle internally and does exactly what you ask for.
From the docs:
import shelve
d = shelve.open(filename) # open
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError if no
# such key)
I think it might help to be more specific about the structure of your program. It sounds like you want to persist a dictionary as an external file, to be loaded/reloaded on subsequent runs of your app. In this case you could use the pickle library like so:
import pickle
dictionary = {"foo": "bar", "spam": "egg"}
# save it to a file...
with open("myfile.dct", "wb") as outf:
pickle.dump(dictionary, outf)
# load it in again:
reloaded = {}
with open("myfile.dct", "rb") as inf:
reloaded = pickle.load(inf)

Resources