pandas dictionary to list of dictionary key/values - python-3.x

I am trying to perform list comprehension with nested list of dictionary from data-frame and I get this after some tryouts. Is there pandas functionality that I might be missing than using for loops?
file = ['a.txt','a.txt','b.txt','c.txt']
year = ['2016','2017','2016','2018']
paper = ['Biology','Biology','Math','English']
name = ['Ann,Matt','Maya','Rob',np.nan]
df = pd.DataFrame({
'file':file,
'year':year,
'paper':paper,
'name':name
})
df
dfd = df.to_dict('index')
dfd
>>>
{0: {'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'},
1: {'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'},
2: {'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'},
3: {'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan}}
Tried:
d = []
for i in dfd.items():
d.append(i)
>>>
[(0,
{'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'}),
(1, {'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'}),
(2, {'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'}),
(3, {'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan})]
I am trying to get it like this: its in tuple format.
[{'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'},
{'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'},
{'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'},
{'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan}]

You almost had it correct above. You can use dfd.items() to iterate over both the keys and values at once of your dfd dict. Then you can ignore the key part of the tuple and just add the value to the list comprehension like this:
d = [v for k,v in dfd.items()]
Just tested that with the data and it gives the output you want

Related

ValueError: Invalid broadcasting comparison with block values - how to resolve it in pythonic way

Hi I have two data frames and trying to compare the values in it but facing a ValueError in broadcasting:
dict_1 = {'a': {0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}]},
"c" : "hello"}
dict_2 = {'a': {0: np.nan},
'b': {0: [{'value': 'B4785',
'label': 'Innovation'},
{'value': 'B4635', 'label': 'Businees Value'},
{'value': 'B1234', 'label': 'Requirements'},
{'value': 'B9853', 'label': 'Stories'}]},
'c': "hello"
}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
Here I wanted to compare two rows only but not two complete dataframes (as I had a scenario that shape of df1=(500, 2) and shape of shape of df2 = (1, 2)). So I used the below code two extract the different values in the rows .
df1[~(df1[['a', 'b', 'c']] == df2[['a', 'b', 'c']].iloc[0])]
The desired result should be:
Here, df2 which has one row should compare with every row values of df1(in my scenario I have more than 1 row). If they are identical then it should be nan else I should get the corresponding values of df1
You can use mask and replace True matches with np.nan. If df2 and df1 have a single row
condition = df1 == df2
df1.mask(condition, other=np.nan)
Output:
Now if df2 has more than one row you can apply a callable that return True or False values, in this case calling apply to compare each row of df1 to the first element of df2. Otherwise one gets a different shape error.
dict_1 = {'a':
{0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}],
1: [{'value': 'B123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]
},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
1: [{'value': 'C5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
},
"c" : {0: "hello", 1: "hola"}}
# New df1 with two rows
df1 = pd.DataFrame(dict_1)
condition = df1.apply(lambda x: x==df2.iloc[0], axis=1)
df1.mask(condition, other=np.nan)
Output

create nested object from records oriented dictionary

I have the following data frame:
[{'Name': 'foo', 'Description': 'foobar', 'Value': '5'}, {'Name': 'baz', 'Description': 'foobaz', 'Value': '4'}, {'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}]
And I'd like to create two nested categories. One category for Name, Description keys and another category for Value key. Example of output for one object:
{'details': {'Name': 'foo', 'Description': 'foobar'}, 'stats': { 'Value': '5' }}
so far I'm only able to achieve this by joining "manually" each items. I'm pretty sure this is not the right solution.
Here is one solution:
data = [{'Name': 'foo', 'Description': 'foobar', 'Value': '5'}, {'Name': 'baz', 'Description': 'foobaz', 'Value': '4'}, {'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}]
df = pd.DataFrame(data)
m = df.to_dict('records')
stats = [{'stats':i.popitem()} for i in m]
details = [{'details':i} for i in m]
g = list(zip(details,stats))
print(*g)
({'details': {'Name': 'foo', 'Description': 'foobar'}}, {'stats': ('Value', '5')}) ({'details': {'Name': 'baz', 'Description': 'foobaz'}}, {'stats': ('Value', '4')}) ({'details': {'Name': 'bar', 'Description': 'foofoo'}}, {'stats': ('Value', '8')})
The major function here is popitem(), which destructively pulls out a pair from the dictionary.
Using list comprehension:
from json import dump
result = [{
'details': {col: row[col] for col in ['Name', 'Description']},
'stat': {col: row[col] for col in ['Value']}
} for row in df.to_dict(orient='records')]
# Write to file
with open('result.json', 'w') as f:
dump(result, f)

Python list of dictionaries search with multiple input

Sorry for such a stupid question but im on deadend right now (1st time using python), how do i search Python list of dictionaries with multiple
attribute ?
My current code is only can search by 1 attribute.
people = [{'name': 'Alex', 'age': '19',
'grade': 80},
{'name': 'Brian', 'age': '17', 'grade':
90},
{'name': 'Junior', 'age': '17', 'grade':
90},
{'name': 'Zoey', 'age': '19', 'grade':
95},
{'name': 'joe', 'age': '18', 'grade':
90}]
entry=input("Check the name you want to search the grade :")
list(filter(lambda person: person['name'] == entry, people))
I want it to search by multitple attribute, so if i input either '17' or 90, the expected output is
[{'name': 'Brian', 'age': '17', 'grade': 90},
{'name': 'Junior', 'age': '17', 'grade': 90}]
You could just use two conditions connected by an or (while taking measures not to compare strings with numbers):
list(filter(lambda person: str(person['grade']) == entry or str(person['grade']) == entry, people))
At some point, a comprehension will be more readable:
[p for p in people if str(person['grade']) == entry or str(p['grade']) == entry]
And if you want to add more search keys, you can further DRY this out, using any:
keys = ('name', 'grade', 'age')
filtered = [p for p in people if any(str(p[k]) == entry for k in keys)]

Pick two items from a list based on a condition

Here is the simplified version of the problem ;)
Given following list,
my_list = [{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'mango', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'},
{'name': 'leek', 'type': 'vegetable'}]
How to pick only two items from the list for a particular type to achieve following?
filtered = [{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'leek', 'type': 'vegetable'}]
You can use itertools.groupby to group the elements of your list based on type and the grab only the first 2 elements from each group.
>>> from itertools import groupby
>>> f = lambda k: k['type']
>>> n = 2
>>> res = [grp for _,grps in groupby(sorted(my_list, key=f), f) for grp in list(grps)[:n]]
>>> pprint(res)
[{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]
you can groupby then pick the first 2:
from itertools import groupby
a = [list(j)[:2] for i, j in groupby(my_list, key = lambda x: x['type'])]
print(a)
[[{'name': 'apple', 'type': 'fruit'}, {'name': 'orange', 'type': 'fruit'}],
[{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]]
sum(a,[])
Out[299]:
[{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]

How to extract multiple data points from multiple strings in Python?

I have a dataset that consists of thousands of entries such as the following:
[{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
....17020-ish rows later.....
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'},
...]
I want to create a DataFrame using pandas such that y-axis = 'id' and x-axis = 'date', with 'value' being the stored value. I can't figure out the best way to approach this...
EDIT:
Imagine a sheet with just numbers ('value' from the dataset). The x-axis columns would be the extracted date and the y-axis rows would be the country id ('id'). The final object would be a dataset that is y*x in size. The numbers would all be of type 'float'.
EDIT 2:
The dataset represents ~304 countries from 1960 - 2016, so there are approximately 304 * 56 = 17024 entries in the dataset. I need to store the 'value' (where for entry 2, value = 392168030) with respect to each country and date.
EDIT 3:
Using the above data, an example output data set would be structured thusly:
2016 . 2015 . 2014 . ... 1960
1A . None . 392168030 384356146 . ... w
...
XH . x y z 163861743
First extract the information from origin dataset:
dataset = [{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'}]
df = [[entry['country']['id'], entry['date'], entry['value']] for entry in dataset]
df = pd.DataFrame(df, columns=['id','date','value'])
Then pivot the datafrme:
df = df.pivot(index='id',columns='date',values='value')
The output:
date 1960 2014 2015 2016
id
1A None 384356146 392168030 None
XH 163861743 None None None
I had to make a guess about how the "thousands of entries" might look but I came up with this possible solution.
entry1 = {
'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entry2 = {
'country': {'id': '1B', 'value': 'Another World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entries = [entry1, entry2]
countries_index = []
date_cols = []
countries_index = []
date_cols = []
for entry in entries:
date_cols.append(entry['date'])
countries_index.append(entry['country']['id'])
import pandas as pd
df = pd.DataFrame(date_cols, columns=['date'], index=countries_index)
This creates a data frame,df which looks like this...
date
1A 2016
1B 2016

Resources