Create dictionary with count of values from list - python-3.x

I'm trying to figure out how to create a dictionary with the key as the school and values the wins-losses-draws, based on each item in the list. For example, calling my_dict['Clemson'] would return the string "1-1-1"
"
team_score_list =[['Georgia', 'draw'], ['Duke', 'loss'], ['Virginia Tech', 'win'], ['Virginia', 'loss'], ['Clemson', 'loss'], ['Clemson', 'win'], ['Clemson', 'draw']]
The output for the above list should be the following dictionary:
{'Georgia': 0-0-1, 'Duke': 0-1-0, 'Virginia Tech': 1-0-0, 'Virginia': 0-1-0, 'Clemson': 1-1-1}
For context, the original data comes from a CSV, where each line is in the form of Date,Opponent,Location,Points For,Points Against.
For example: 2016-12-31,Kentucky,Neutral,33,18.
I've managed to wrangle the data into the above list (albeit probably not in the most efficient manner), however just not exactly sure how to get this into the format above.
Any help would be greatly appreciated!

Not beautiful but this should work.
team_score_list = [
["Georgia", "draw"],
["Duke", "loss"],
["Virginia Tech", "win"],
["Virginia", "loss"],
["Clemson", "loss"],
["Clemson", "win"],
["Clemson", "draw"],
]
def gen_dict_lst(team_score_list):
"""Generates dict of list based on team record"""
team_score_dict = {}
for team_record in team_score_list:
if team_record[0] not in team_score_dict.keys():
team_score_dict[team_record[0]] = [0, 0, 0]
if team_record[1] == "win":
team_score_dict[team_record[0]][0] += 1
elif team_record[1] == "loss":
team_score_dict[team_record[0]][1] += 1
elif team_record[1] == "draw":
team_score_dict[team_record[0]][2] += 1
return team_score_dict
def convert_format(score_dict):
"""formats list to string for output validation"""
output_dict = {}
for key, value in score_dict.items():
new_val = []
for index, x in enumerate(value):
if index == 2:
new_val.append(str(x))
else:
new_val.append(str(x) + "-")
new_str = "".join(new_val)
output_dict[key] = new_str
return output_dict
score_dict = gen_dict_lst(team_score_list)
out_dict = convert_format(score_dict)
print(out_dict)

You can first make a dictionary and insert/increment values of wins,loss and draw while iterating over the dictionary values. Here I have shown a way using variable name same as the string used for win,loss and draw and then increased corresponding value in dictionary using global()['str'] (from another answer)
dct={}
for i in team_score_list:
draw=2
win=0
loss=1
if i[0] in dct:
dct[i[0]][globals()[i[1]]]+=1
else:
dct[i[0]]=[0,0,0]
dct[i[0]][globals()[i[1]]]=1
You can then convert your list to string by using '-'.join(...) to get it in a format you want in the dictionary.

I now get what you mean:
You could do
a = dict()
f = lambda x,s: str(int(m[x]=='1' or j==s))
for (i,j) in team_score_list:
m = a.get(i,'0-0-0')
a[i] = f"{f(0,'win')}-{f(2,'draw')}-{f(4,'loss')}"
{'Georgia': '0-1-0',
'Duke': '0-0-1',
'Virginia Tech': '1-0-0',
'Virginia': '0-0-1',
'Clemson': '1-1-1'}
Now this is an answer only for this example. If you had many data, it would be good to use a list then join at the end. Eg
b = dict()
g = lambda x,s: str(int(m[x]) + (j==s))
for (i,j) in team_score_list:
m = b.get(i,[0,0,0])
b[i] =[g(0,"win"),g(1,"draw"),g(2,"loss")]
{key:'-'.join(val) for key,val in b.items()}
{'Georgia': '0-1-0',
'Duke': '0-0-1',
'Virginia Tech': '1-0-0',
'Virginia': '0-0-1',
'Clemson': '1-1-1'}

Related

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

Get dict values based on a filtered list with a varying number of returned elements

I am filtering elements from a list that include '"' by the following code:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
This will return any element with '"' as desired. Example strings below.
Random Text 0.5" Random Text
0.25" Random Text
1.5" x 0.5" Random Text
I .split then apply above function to return:
['0.5"']
['0.25"']
['1.5"', '0.5"']
I now need to lookup each of the elements in a dictionary and return the value from the key:value pair as new individual variables so I will be able to add them to a new string that contains a number of other variables. See example of desired Result below:
val_1 = '0.5"'
val_1 = '0.5"'
val_1 = '1.5"' and val_2 = '0.5"'
Random Text val_1 Random Text
Random Text val_1 Random Text
Random Text val_1 Random Text val_2
I already have my function to lookup/retrieve the value from dictionary however since I started retrieving the values via filter, I haven't been able to figure out how to retrieve the dict value.
def item_size_one_final(size_dict):
for x in sizes():
for key in size_dict:
if key in sizes():
return size_dict[key]
return "Hmmmm"
return"Not Working"
The above for loops result in ['Hmmmm'] on all of it. Does anyone have any suggestions for how to do this?
#Riedler - Sure, hopefully this example helps.
size_dict = {
'1/4"' : '8mm - 1/4"',
'1.5"' : '40mm - 11/2"',
'0.5"' : '15mm - 1/2"',}
Raw Data Input:
0.5" Pipe
1/4" Flange
1.5" x 0.5" Reducer
My company uses SAP with set item codes and item description formats so I am taking those three descriptions and putting them in our format. This:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
add a step in between to split into a list then this is returned:
['0.5"']
['0.25"']
['1.5"', '0.5"']
From this point, I need to run these elements through my dictionary and get the value (key:value pair). My current for loop doesn't work and I'm not sure why or what I can alter to correct it.
The final result should be:
var_1 = 15mm - 1/2"
var_1 = 8mm - 1/4"
var_1 = 40mm - 11/2", var_2 = 15mm - 1/2"
If I understand correctly, your mistake is that you are cycling over the dictionary for every value in the list of sizes, why?
The idea of a dictionary is that when you have a value that matches a key, you can get the matching value from the dictionary.
Also, there is no need to call sizes twice to receive the same values, save the result before and then use that.
def item_size_one_final(size_dict):
sizes_lst = sizes()
res = []
for x in sizes_lst:
if x in size_dict:
res.append(size_dict[x])
return res

How to concatenate data frames from two different dictionaries into a new data frame in python?

This is my sample code
dataset_current=dataset_seq['Motor_Current_Average']
dataset_consistency=dataset_seq['Consistency_Average']
#technique with non-overlapping the values(for current)
dataset_slide=dataset_current.tolist()
from window_slider import Slider
import numpy
list = numpy.array(dataset_slide)
bucket_size = 336
overlap_count = 0
slider = Slider(bucket_size,overlap_count)
slider.fit(list)
empty_dictionary = {}
count = 0
while True:
count += 1
window_data = slider.slide()
empty_dictionary['df_current%s'%count] = window_data
empty_dictionary['df_current%s'%count] =pd.DataFrame(empty_dictionary['df_current%s'%count])
empty_dictionary['df_current%s'%count]= empty_dictionary['df_current%s'%count].rename(columns={0: 'Motor_Current_Average'})
if slider.reached_end_of_list(): break
locals().update(empty_dictionary)
#technique with non-overlapping the values(for consistency)
dataset_slide_consistency=dataset_consistency.tolist()
list = numpy.array(dataset_slide_consistency)
slider_consistency = Slider(bucket_size,overlap_count)
slider_consistency.fit(list)
empty_dictionary_consistency = {}
count_consistency = 0
while True:
count_consistency += 1
window_data_consistency = slider_consistency.slide()
empty_dictionary_consistency['df_consistency%s'%count_consistency] = window_data_consistency
empty_dictionary_consistency['df_consistency%s'%count_consistency] =pd.DataFrame(empty_dictionary_consistency['df_consistency%s'%count_consistency])
empty_dictionary_consistency['df_consistency%s'%count_consistency]= empty_dictionary_consistency['df_consistency%s'%count_consistency].rename(columns={0: 'Consistency_Average'})
if slider_consistency.reached_end_of_list(): break
locals().update(empty_dictionary_consistency)
import pandas as pd
output_current ={}
increment = 0
while True:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%count_consistency],empty_dictionary['df_current%s'%count]],axis=1)
My question is i have two dictionaries that contains 79 data frames in each one of them namely "empty_dictionary_consistency" and "empty_dictionary" . I want to create a new data frame for each one of them so that it concatenates df1 from empty_dictionary_consistency with df1 from empty_dictionary .So , it will start from concatenating df1 from empty_dictionary_consistency with df1 from empty_dictionary till df79 from empty_dictionary_consistency with df79 from empty_dictionary . I tried using while loop to increment it but does not shows any output.
output_current ={}
increment = 0
while True:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%count_consistency],empty_dictionary['df_current%s'%count]],axis=1)
Can anyone help me regarding this? How can i do this.
I am not near my computer now, so I can not test the code, but it seems that the problem is in indices. In the last loop, on every iteration you increment a variable called 'increment', but you still use indices from previous loops for dictionaries that you want to concatenate. Try to change variables that you use for indexing all dictionaries to 'increment'.
And one more thing - I can't see when this loop is going to finish?
UPD
I mean this:
length = len(empty_dictionary_consistency)
increment = 0
while increment < length:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%increment],empty_dictionary['df_current%s'%increment]],axis=1)
While iterating over your dictionaries you should use a variable that you increment as an index in all three dictionaries. And as soon as you do not use a Slider object in the loop, you have to stop it when the first dictionary is over.

Retrieving dict value via hardcoded key, works. Retrieving via computed key doesn't. Why?

I'm generating a common list of IDs by comparing two sets of IDs (the ID sets are from a dictionary, {ID: XML "RECORD" element}). Once I have the common list, I want to iterate over it and retrieve the value corresponding to the ID from a dictionary (which I'll write to disc).
When I compute the common ID list using my diff_comm_checker function, I'm unable to retrieve the dict value the ID corresponds to. It doesn't however fail with a KeyError. I can also print the ID out.
When I hard code the ID in as the common_id value, I can retrieve the dict value.
I.e.
common_ids = diff_comm_checker( list_1, list_2, "text")
# does nothing - no failures
common_ids = ['0603599998140032MB']
#gives me:
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE788>
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE3E0>
So I suspected there was some difference between the strings. I checked both the function output and compared it against the hard-coded values using:
print [(_id, type(_id), repr(_id)) for _id in common_ids][0]
I get exactly the same for both:
>>> ('0603599998140032MB', <type 'str'>, "'0603599998140032MB'")
I have also followed the advice of another question and used difflib.ndiff:
common_ids1 = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "text")
common_ids = ['0603599998140032MB']
print "\n".join(difflib.ndiff(common_ids1, common_ids))
>>> 0603599998140032MB
So again, doesn't appear that there's any difference between the two.
Here's a full, working example of the code:
from StringIO import StringIO
import xml.etree.cElementTree as ET
from itertools import chain, islice
def diff_comm_checker(list_1, list_2, text):
"""Checks 2 lists. If no difference, pass. Else return common set between two lists"""
symm_diff = set(list_1).symmetric_difference(list_2)
if not symm_diff:
pass
else:
mismatches_in1_not2 = set(list_1).difference( set(list_2) )
mismatches_in2_not1 = set(list_2).difference( set(list_1) )
if mismatches_in1_not2:
mismatch_logger(
mismatches_in1_not2,"{}\n1: {}\n2: {}".format(text, list_1, list_2), 1, 2)
if mismatches_in2_not1:
mismatch_logger(
mismatches_in2_not1,"{}\n2: {}\n1: {}".format(text, list_1, list_2), 2, 1)
set_common = set(list_1).intersection( set(list_2) )
if set_common:
return sorted(set_common)
else:
return "no common set: {}\n".format(text)
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def get_elements_iteratively(file):
"""Create unique ID out of image number and case number, return it along with corresponding xml element"""
tag = "RECORD"
tree = ET.iterparse(StringIO(file), events=("start","end"))
context = iter(tree)
_, root = next(context)
for event, record in context:
if event == 'end' and record.tag == tag:
xml_element_2 = ''
xml_element_1 = ''
for child in record.getchildren():
if child.tag == "IMAGE_NUMBER":
xml_element_1 = child.text
if child.tag == "CASE_NUM":
xml_element_2 = child.text
r_id = "{}{}".format(xml_element_1, xml_element_2)
record.set("R", r_id)
yield (r_id, record)
root.clear()
def get_chunks(file, chunk_size):
"""Breaks XML into chunks, yields dict containing unique IDs and corresponding xml elements"""
iterable = get_elements_iteratively(file)
for chunk in chunks(iterable, chunk_size):
ids_records = {}
for k in chunk:
ids_records[k[0]]=k[1]
yield ids_records
def create_new_xml(xml_list):
chunk = 5000
chunk_rec_ids_1 = get_chunks(xml_list[0], chunk)
chunk_rec_ids_2 = get_chunks(xml_list[1], chunk)
to_write = [chunk_rec_ids_1, chunk_rec_ids_2]
######################################################################################
### WHAT'S GOING HERE ??? WHAT'S THE DIFFERENCE BETWEEN THE OUTPUTS OF THESE TWO ? ###
common_ids = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
#common_ids = ['0603599998140032MB']
######################################################################################
for _id in common_ids:
print _id
for gen_obj in to_write:
for kv_pair in gen_obj:
if kv_pair[_id]:
print _id, kv_pair[_id].attrib, kv_pair[_id]
if __name__ == '__main__':
xml_1 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
xml_2 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
create_new_xml([xml_1, xml_2])
The problem is not in the type or value of common_ids returned from diff_comm_checker. The problem is that the function diff_comm_checker or in constructing the arguments to the function that destroys the values of to_write
If you try this you will see what I mean
common_ids = ['0603599998140032MB']
diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
This will give the erroneous behavior without using the return value from diff_comm_checker()
This is because to_write is a generator and the call to diff_comm_checker exhausts that generator. The generator is then finished/empty when used in the if-statement in the loop. You can create a list from a generator by using list:
chunk_rec_ids_1 = list(get_chunks(xml_list[0], chunk))
chunk_rec_ids_2 = list(get_chunks(xml_list[1], chunk))
But this may have other implications (memory usage...)
Also, what is the intention of this construct in diff_comm_checker?
if not symm_diff:
pass
In my opinion nothing will happen regardless if symm_diff is None or not.

how to maintain the keys in TransformDict in python

https://github.com/fluentpython/example-code/blob/master/03-dict-set/transformdict.py
I see the demo:
'''Dictionary that calls a transformation function when looking
up keys, but preserves the original keys.
>>> d = TransformDict(str.lower)
>>> d['Foo'] = 5
>>> d['foo'] == d['FOO'] == d['Foo'] == 5
True
>>> set(d.keys())
{'Foo'}
'''
but , i dont know the object how to maintain the keys.
thanks
I really want to ask how the keys method works
It keeps 2 dictionaries, 1 for the keys and one for the values see getitem in line 51:
def getitem(self, key):
'D.getitem(key) -> (stored key, value)'
transformed = self._transform(key)
original = self._original[transformed] # original keys!
value = self._data[transformed] # values!
return original, value

Resources