Loading dictionary stored as .npz fails - python-3.x

l have a dictionary store at once as one object using np.savez when l open it with np.load as follow :
l get the following :
my_dic=np.load('/home/values.npz')
my_dic.files
['scores']
However when l try :
my_dic['scores'] # len(my_dic['scores'])=1 but contains 3000 keys and 3000 values
it outputs all the keys and values as one object.
Is there any way to access the values and keys ?
something like :
for k,values in my_dic['scores'].items():
# do something
Thank you

Sounds like you did:
In [80]: np.savez('test.npz', score={'a':1, 'b':2})
In [81]: d = np.load('test.npz')
In [83]: d.files
Out[83]: ['score']
In [84]: d['score']
Out[84]: array({'a': 1, 'b': 2}, dtype=object)
This is a 1 item array with a object dtype. Extract that item with item():
In [85]: d['score'].item()
Out[85]: {'a': 1, 'b': 2}
If instead I save the dictionary with kwargs syntax:
In [86]: np.savez('test.npz', **{'a':1, 'b':2})
In [87]: d = np.load('test.npz')
In [88]: d.files
Out[88]: ['a', 'b']
Now each dictionary key is a file in the archive:
In [89]: d['a']
Out[89]: array(1)
In [90]: d['b']
Out[90]: array(2)

Following the indications given by #hpalj,
l did the following to solve the problem:
x=list(my_dic['scores'].item()) #allows me to get the keys
keys=[]
values=[]
for i in np.arange(len(x))
value=my_dic['scores'].item()[x[i]]
values.append(value)
keys.append(x[i])
final_dic=dict(zip(keys,values))

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Python: Convert 2d list to dictionary with indexes as values

I have a 2d list with arbitrary strings like this:
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
I want to create a dictionary out of this:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
How do I do this? This answer answers for 1D list for non-repeated values, but, I have a 2d list and values can repeat. Is there a generic way of doing this?
Maybe you could use two for-loops:
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
d = {}
overall_idx = 0
for sub_lst in lst:
for word in sub_lst:
if word not in d:
d[word] = overall_idx
# Increment overall_idx below if you want to only increment if word is not previously seen
# overall_idx += 1
overall_idx += 1
print(d)
Output:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
You could first convert the list of lists to a list using a 'double' list comprehension.
Next, get rid of all the duplicates using a dictionary comprehension, we could use set for that but would lose the order.
Finally use another dictionary comprehension to get the desired result.
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
# flatten list of lists to a list
flat_list = [item for sublist in lst for item in sublist]
# remove duplicates
ordered_set = {x:0 for x in flat_list}.keys()
# create required output
the_dictionary = {v:i for i, v in enumerate(ordered_set)}
print(the_dictionary)
""" OUTPUT
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}
"""
also, with collections and itertools:
import itertools
from collections import OrderedDict
lstdict={}
lst = [['a', 'xyz' , 'tps'], ['rtr' , 'xyz']]
lstkeys = list(OrderedDict(zip(itertools.chain(*lst), itertools.repeat(None))))
lstdict = {lstkeys[i]: i for i in range(0, len(lstkeys))}
lstdict
output:
{'a': 0, 'xyz': 1, 'tps': 2, 'rtr': 3}

Convert string to set array when loading csv file in pandas DataFrame

I'm trying to convert a pandas column from string to a set so I can perform set operations (-) and methods (.union) between two datafame on two set_array columns. The files are imported from two csv file with a set_array column. However, once I run pd.read_csv in pandas, the columns type becomes str, which prevents me from doing set operations and methods.
csv1:
set_array
0 {985,784}
1 {887}
2 set()
3 {123,469,789}
4 set()
After loading csv1 into a DataFrame using df = pd.read_csv(csv1), the data type becomes str, and when I try to call the first index using df['set_array'].values[0], I get the following:
'{985, 784}'
However, if I were to create my own DataFrame with a set column using df1 = pd.DataFrame({'set_array':[{985, 784},{887},{},{123, 469, 789},{}]}), and call the first index again using df['set_array'].values[0], I get the following (Desired output):
{985, 784} <-without the ''
Here is what I tried so far:
1) df.replace('set()', '') <-removes the set() portion from df
2) df['set_array'] = df['set_array'].apply(set) <-does not work
3) df['set_array'] = df['set_array'].apply(lambda x: {x}) <-does not work
4) df['set_array'].astype(int) <-convert to int first then convert to set, does not work
5) df['set_array'].astype(set) <-does not work
6) df['set_array'].to_numpy() <-convert to array, does not work
I'm also thinking to change the column to set at the pd.read_csv stage as a potential solution.
Is there any way to load csv using pandas and keep the set data type, or just simply convert the column from str to set so it looks like the desired output above?
Thanks!!
I agree with Cainã that dealing with the root input data cause would be the best approach here. But, if that's not possible, then something like this would be a lot more predictable than using eval if this is for some kind of production environment:
import re
def parse_set_string(s):
if s == 'set()':
return None # or return set() if you prefer
else:
string_nums_only = re.sub('[^0-9,]', '', s)
split_nums = string_nums_only.split(',')
return set(map(int, split_nums))
df.set_array.apply(parse_set_string)
We've seen this problem before when columns originally contained lists or numpy arrays. csv is a 2d format - rows and columns. So to_csv can only save these embedded objects as strings. What does the file look like?.
read_csv by default just loads the strings. To confuse things further, the pandas display does not quote strings. So the str of a set looks the same as the set itself.
With lists, it's enough to do a eval (or ast.literal_eval). With ndarray the string has to be edited first.
Make a dataframe and fill it with some objects:
In [107]: df = pandas.DataFrame([None,None,None])
In [108]: df
Out[108]:
0
0 None
1 None
2 None
In [109]: df[0][0]
In [110]: df[0][0]=[1,2,3]
In [111]: df[0][1]=np.array([1,2,3])
In [112]: df[0][2]={1,2,3}
In [113]: df
Out[113]:
0
0 [1, 2, 3]
1 [1, 2, 3]
2 {1, 2, 3}
The numpy equivalent:
In [114]: df.to_numpy()
Out[114]:
array([[list([1, 2, 3])],
[array([1, 2, 3])],
[{1, 2, 3}]], dtype=object)
Write it to a file:
In [115]: df.to_csv('test.pd')
In [116]: cat test.pd
,0
0,"[1, 2, 3]"
1,[1 2 3]
2,"{1, 2, 3}"
Read it
In [117]: df1 = pandas.read_csv('test.pd')
In [118]: df1
Out[118]:
Unnamed: 0 0
0 0 [1, 2, 3]
1 1 [1 2 3]
2 2 {1, 2, 3}
Ignoring the indexing that I should have suppressed, it looks a lot like the original df. But it contains strings, not list, array, or set.
In [119]: df1.to_numpy()
Out[119]:
array([[0, '[1, 2, 3]'],
[1, '[1 2 3]'],
[2, '{1, 2, 3}']], dtype=object)
Changing the frame to contains sets of differing sizes:
In [120]: df[0][1]=set()
In [122]: df[0][0]=set([1])
In [123]: df
Out[123]:
0
0 {1}
1 {}
2 {1, 2, 3}
In [124]: df.to_csv('test.pd')
In [125]: cat test.pd
,0
0,{1}
1,set()
2,"{1, 2, 3}"
In [136]: df2 =pandas.read_csv('test.pd',index_col=0)
In [137]: df2
Out[137]:
0
0 {1}
1 set()
2 {1, 2, 3}
Looks like eval can convert the empty set as well as the others:
In [138]: df3 =df2['0'].apply(eval)
In [139]: df3
Out[139]:
0 {1}
1 {}
2 {1, 2, 3}
Name: 0, dtype: object
In [140]: df2.to_numpy()
Out[140]:
array([['{1}'],
['set()'],
['{1, 2, 3}']], dtype=object)
In [141]: df3.to_numpy()
Out[141]: array([{1}, set(), {1, 2, 3}], dtype=object)
The problem with your DataFrame is that set_array contains
the text representation of both:
set literals,
Python code.
To cope with this case:
import ast.
Define the following conversion function:
def mySetConv(txt):
return set() if txt == 'set()' else ast.literal_eval(txt)
Apply it:
df.set_array = df.set_array.apply(mySetConv)
To check the result, you can run:
for it in df.set_array:
print(it, type(it))
getting:
{784, 985} <class 'set'>
{887} <class 'set'>
set() <class 'set'>
{789, 123, 469} <class 'set'>
set() <class 'set'>
If you had in your source file {} instead of set(), you
could run:
df.set_array = df.set_array.apply(ast.literal_eval)
Just a single line of code.

sort values of lists inside dictionary based on length of characters

d = {'A': ['A11117',
'33465'
'17160144',
'A11-33465',
'3040',
'A11-33465 W1',
'nor'], 'B': ['maD', 'vern', 'first', 'A2lRights']}
I have a dictionary d and I would like to sort the values based on length of characters. For instance, for key A the value A11-33465 W1 would be first because it contains 12 characters followed by 'A11-33465' because it contains 9 characters etc. I would like this output:
d = {'A': ['A11-33465 W1',
' A11-33465',
'17160144',
'A11117',
'33465',
'3040',
'nor'],
'B': ['A2lRights',
'first',
'vern',
'maD']}
(I understand that dictionaries are not able to be sorted but I have examples below that didn't work for me but the answer contains a dictionary that was sorted)
I have tried the following
python sorting dictionary by length of values
print(' '.join(sorted(d, key=lambda k: len(d[k]), reverse=True)))
Sort a dictionary by length of the value
sorted_items = sorted(d.items(), key = lambda item : len(item[1]))
newd = dict(sorted_items[-2:])
How do I sort a dictionary by value?
import operator
sorted_x = sorted(d.items(), key=operator.itemgetter(1))
But they both do not give me what I am looking for.
How do I get my desired output?
You are not sorting the dict, you are sorting the lists inside it. The simplest will be a loop that sorts the lists in-place:
for k, lst in d.items():
lst.sort(key=len, reverse=True)
This will turn d into:
{'A': ['3346517160144', 'A11-33465 W1', 'A11-33465', 'A11117', '3040', 'nor'],
'B': ['A2lRights', 'first', 'vern', 'maD']}
If you want to keep the original data intact, use a comprehension like:
sorted_d = {k: sorted(lst, key=len, reverse=True) for k, lst in d.items()}

testing if the values of a dictionary are non zero with all() function

I use Python 3
I want to check if all of my tested values in the nested dictionary are non 0.
So here is the simplified example dict:
d = {'a': {'1990': 10, '1991': 0, '1992': 30},
'b': {'1990': 15, '1991': 40, '1992': 0}}
and I want to test if for both dicts 'a' and 'b' the values of the keys '1990' and '1991' are not zero
for i in d:
for k in range(2):
year = 1990
year = year + k
if all((d[i][str(year)]) != 0):
print(d[i])
so it should only return b, because a['1991']=0
but this is the first time I work with the all() function and I get the error core: TypeError: 'bool' object is not iterable
the error is in the if all() line
thank you very much!
This can done a bit more generally with a list comprehension where you iterate over the items in dict d. A simple comprehension to iterate over the keys and values in our dictionary looks like this:
>>> [k for k, v in d.items()]
['a', 'b']
In the above k will contain the keys and v the values. The comprehension also has an if clause. With that you can filter out the items you don't want. So we define years = ('1990', '1991'). Now we can do another comprehension to test our year values.
To iterate over only 'a', we could do this:
>>> [d['a'][y] for y in years]
[10, 0]
>>> all([d['a'][y] for y in years])
False
Gluing the whole thing together:
>>> d={'a' :{ '1990': 10, '1991':0, '1992':30},'b':{ '1990':15, '1991':40, '1992':0}}
>>> years = ('1990', '1991')
>>> [k for k, v in d.items() if all([v[y] for y in years])]
['b']
See the python docs for more information on list comprehensions.

Resources