converting a file into dict - python-3.x

my_file = "The Itsy Bitsy Spider went up the water spout.
Down came the rain & washed the spider out.
Out came the sun & dried up all the rain,
And the Itsy Bitsy Spider went up the spout again. "
Expected output:
{'the': ['itsy', 'water', 'rain', 'spider', 'sun', 'rain', 'itsy', 'spout'], 'itsy': ['bitsy', 'bitsy'], 'bitsy': ['spider', 'spider'], 'spider': ['went', 'out', 'went'], 'went': ['up', 'up'], 'up': ['the', 'all', 'the'], 'water': ['spout'], 'spout': ['down', 'again'], 'down': ['came'], 'came': ['the', 'the'], 'rain': ['washed', 'and'], 'washed': ['the'], 'out': ['out', 'came'], 'sun': ['dried'], 'dried': ['up'], 'all': ['the'], 'and': ['the'], 'again': []}
My code:
import string
words_set = {}
for line in my_file:
lower_text = line.lower()
for word in lower_text.split():
word = word.strip(string.punctuation + string.digits)
if word:
if word in words_set:
words_set[word] = words_set[word] + 1
else:
words_set[word] = 1

You can reproduce your expected results with a few concepts:
Given
import string
import itertools as it
import collections as ct
data = """\
The Itsy Bitsy Spider went up the water spout.
Down came the rain & washed the spider out.
Out came the sun & dried up all the rain,
And the Itsy Bitsy Spider went up the spout again.
"""
Code
def clean_string(s:str) -> str:
"""Return a list of lowered strings without punctuation."""
table = str.maketrans("","", string.punctuation)
return s.lower().translate(table).replace(" ", " ").replace("\n", " ")
def get_neighbors(words:list) -> dict:
"""Return a dict of right-hand, neighboring words."""
dd = ct.defaultdict(list)
for word, nxt in it.zip_longest(words, words[1:], fillvalue=""):
dd[word].append(nxt)
return dict(dd)
Demo
words = clean_string(data).split()
get_neighbors(words)
Results
{'the': ['itsy', 'water', 'rain', 'spider', 'sun', 'rain', 'itsy', 'spout'],
'itsy': ['bitsy', 'bitsy'],
'bitsy': ['spider', 'spider'],
'spider': ['went', 'out', 'went'],
'went': ['up', 'up'],
'up': ['the', 'all', 'the'],
'water': ['spout'],
'spout': ['down', 'again'],
'down': ['came'],
'came': ['the', 'the'],
'rain': ['washed', 'and'],
'washed': ['the'],
'out': ['out', 'came'],
'sun': ['dried'],
'dried': ['up'],
'all': ['the'],
'and': ['the'],
'again': ['']}
Details
clean_string
You can use any number of ways to remove punctuation. Here we use a translation table to replace most of the punctuation. Others are directly removed via str.replace().
get_neighbors
A defaultdict makes a dict of lists. A new list value is made if a key is missing.
We make the dict by iterating two juxtaposed word lists, one ahead of the other.
These lists are zipped by the longest list, filling the shorter list with an empty string.
dict(dd) ensures a simply dict is returned.
If you solely wish to count words:
Demo
ct.Counter(words)
Results
Counter({'the': 8,
'itsy': 2,
'bitsy': 2,
'spider': 3,
'went': 2,
'up': 3,
'water': 1,
'spout': 2,
'down': 1,
'came': 2,
'rain': 2,
'washed': 1,
'out': 2,
'sun': 1,
'dried': 1,
'all': 1,
'and': 1,
'again': 1})

Related

Python nested dictionary Issue when iterating

I have 5 list of words, which basically act as values in a dictionary where the keys are the IDs of the documents.
For each document, I would like to apply some calculations and display the values and results of the calculation in a nested dictionary.
So far so good, I managed to do everything but I am failing in the easiest part.
When showing the resulting nested dictionary, it seems it's only iterating over the last element of each of the 5 lists, and therefore not showing all the elements...
Could anybody explain me where I am failing??
This is the original dictionary data_docs:
{'doc01': ['simpl', 'hello', 'world', 'test', 'python', 'code'],
'doc02': ['today', 'wonder', 'day'],
'doc03': ['studi', 'pac', 'today'],
'doc04': ['write', 'need', 'cup', 'coffe'],
'doc05': ['finish', 'pac', 'use', 'python']}
This is the result I am getting (missing 'simpl','hello', 'world', 'test', 'python' in doc01 as example):
{'doc01': {'code': 0.6989700043360189},
'doc02': {'day': 0.6989700043360189},
'doc03': {'today': 0.3979400086720376},
'doc04': {'coffe': 0.6989700043360189},
'doc05': {'python': 0.3979400086720376}}
And this is the code:
def tfidf (data, idf_score): #function, 2 dictionaries as parameters
tfidf = {} #dict for output
for word, val in data.items(): #for each word and value in data_docs(first dict)
for v in val: #for each value in each list
a = val.count(v) #count the number of times that appears in that list
scores = {v :a * idf_score[v]} # dictionary that will act as value in the nested
tfidf[word] = scores #final dictionary, the key is doc01,doc02... and the value the above dict
return tfidf
tfidf(data_docs, idf_score)
Thanks,
Did you mean to do this?
def tfidf(data, idf_score): # function, 2 dictionaries as parameters
tfidf = {} # dict for output
for word, val in data.items(): # for each word and value in data_docs(first dict)
scores = {} # <---- a new dict for each outer iteration
for v in val: # for each value in each list
a = val.count(v) # count the number of times that appears in that list
scores[v] = a * idf_score[v] # <---- keep adding items to the dictionary
tfidf[word] = scores # final dictionary, the key is doc01,doc02... and the value the above dict
return tfidf
... see my changes with <----- arrow :)
Returns:
{'doc01': {'simpl': 1,
'hello': 1,
'world': 1,
'test': 1,
'python': 1,
'code': 1},
'doc02': {'today': 1, 'wonder': 1, 'day': 1},
'doc03': {'studi': 1, 'pac': 1, 'today': 1},
'doc04': {'write': 1, 'need': 1, 'cup': 1, 'coffe': 1},
'doc05': {'finish': 1, 'pac': 1, 'use': 1, 'python': 1}}

Python Pandas How to get rid of groupings with only 1 row?

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)

How to specifically name each value in a list of tuples in python?

The Tuple_list prints out something like "[ (2000, 1, 1, 1, 135) , (2000, 1, 1, 2, 136) ) , etc...]" and I can't figure out how to assign "year,month,day,hour,height" to every tuple in the list..."
def read_file(filename):
with open(filename, "r") as read:
pre_list = list()
for line in read.readlines():
remove_symbols = line.strip()
make_list = remove_symbols.replace(" ", ", ")
pre_list += make_list.split('()')
Tuple_list = [tuple(map(int, each.split(', '))) for each in pre_list]
for n in Tuple_list:
year, month, day, hour, height = Tuple_list[n][0], Tuple_list[
n][1], Tuple_list[n][2], Tuple_list[n][3], Tuple_list[n][4]
print(month)
return Tuple_list
swag = read_file("VIK_sealevel_2000.txt")
Maybe "Named tuples" is what you are looking for.
In [1]: from collections import namedtuple
In [2]: Measure = namedtuple('Measure', ['year', 'month', 'day', 'hour', 'height'])
In [3]: m1 = Measure(2005,1,1,1,4444)
In [4]: m1
Out[4]: Measure(year=2005, month=1, day=1, hour=1, height=4444)

join strings within a list of lists by 4

Background
I have a list of lists as seen below
l = [['NAME',':', 'Mickey', 'Mouse', 'was', 'here', 'and', 'Micky', 'mouse', 'went', 'out'],
['Donal', 'duck', 'was','Date', 'of', 'Service', 'for', 'Donald', 'D', 'Duck', 'was', 'yesterday'],
['I', 'like','Pluto', 'the', 'carton','Dog', 'bc','he', 'is','fun']]
Goal
Join l by every 4 elements (when possible)
Problem
But sometimes 4 elements won't cleanly join as 4 as seen in my desired output
Desired Output
desired_l = [['NAME : Mickey Mouse', 'was here and Micky', 'mouse went out'],
['Donal duck was Date', 'of Service for Donald', 'D Duck was yesterday'],
['I like Pluto the', 'carton Dog bc he', 'is fun']]
Question
How do I achive desired_l?
itertools has some nifty functions, one of which can do this to do just this.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
[[' '.join(filter(None, x)) for x in list(grouper(sentence, 4, fillvalue=''))] for sentence in l]
Result:
[['NAME : Mickey Mouse', 'was here and Micky', 'mouse went out'],
['Donal duck was Date', 'of Service for Donald', 'D Duck was yesterday'],
['I like Pluto the', 'carton Dog bc he', 'is fun']]

Using python need to get the substrings

Q)After executing the code Need to print the values [1, 12, 123, 2, 23, 3, 13], but iam getting [1, 12, 123, 2, 23, 3]. I have missing the letter 13. can any one tell me the reason to overcome that error?
def get_all_substrings(string):
length = len(string)
list = []
for i in range(length):
for j in range(i,length):
list.append(string[i:j+1])
return list
values = get_all_substrings('123')
results = list(map(int, values))
print(results)
count = 0
for i in results:
if i > 1 :
if (i % 2) != 0:
count += 1
print(count)
Pretty straight forward issue in your nested for loops within get_all_substrings(), lets walk it!
You are iterating over each element of your string 123:
for i in range(length) # we know length to be 3, so range is 0, 1, 2
You then iterate each subsequent element from the current i:
for j in range(i,length)
Finally you append a string from position i to j+1 using the slice operator:
list.append(string[i:j+1])
But what exactly is happening? Well we can step through further!
The first value of i is 0, so lets skip the first for, go to the second:
for j in range(0, 3): # i.e. the whole string!
# you would eventually execute all of the following
list.append(string[0:0 + 1]) # '1'
list.append(string[0:1 + 1]) # '12'
list.append(string[0:2 + 1]) # '123'
# but wait...were is '13'???? (this is your hint!)
The next value of i is 1:
for j in range(1, 3):
# you would eventually execute all of the following
list.append(string[1:1 + 1]) # '2'
list.append(string[1:2 + 1]) # '23'
# notice how we are only grabbing values of position i or more?
Finally you get to i is 2:
for j in range(2, 3): # i.e. the whole string!
# you would eventually execute all of the following
list.append(string[2:2 + 1]) # '3'
I've shown you what is happening (as you've asked in your question), I leave it to you to devise your own solution. A couple notes:
You need to look at all index combinations from position i
Dont name objects by their type (i.e. dont name a list object list)
I would try something like this using itertools and powerset() recipe
from itertools import chain, combinations
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
output = list(map(''.join, powerset('123')))
output.pop(0)
Here is another option, using combinations
from itertools import combinations
def get_sub_ints(raw):
return [''.join(sub) for i in range(1, len(raw) + 1) for sub in combinations(raw, i)]
if __name__ == '__main__':
print(get_sub_ints('123'))
>>> ['1', '2', '3', '12', '13', '23', '123']

Resources