How to extract multiple data points from multiple strings in Python? - python-3.x

I have a dataset that consists of thousands of entries such as the following:
[{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
....17020-ish rows later.....
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'},
...]
I want to create a DataFrame using pandas such that y-axis = 'id' and x-axis = 'date', with 'value' being the stored value. I can't figure out the best way to approach this...
EDIT:
Imagine a sheet with just numbers ('value' from the dataset). The x-axis columns would be the extracted date and the y-axis rows would be the country id ('id'). The final object would be a dataset that is y*x in size. The numbers would all be of type 'float'.
EDIT 2:
The dataset represents ~304 countries from 1960 - 2016, so there are approximately 304 * 56 = 17024 entries in the dataset. I need to store the 'value' (where for entry 2, value = 392168030) with respect to each country and date.
EDIT 3:
Using the above data, an example output data set would be structured thusly:
2016 . 2015 . 2014 . ... 1960
1A . None . 392168030 384356146 . ... w
...
XH . x y z 163861743

First extract the information from origin dataset:
dataset = [{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'}]
df = [[entry['country']['id'], entry['date'], entry['value']] for entry in dataset]
df = pd.DataFrame(df, columns=['id','date','value'])
Then pivot the datafrme:
df = df.pivot(index='id',columns='date',values='value')
The output:
date 1960 2014 2015 2016
id
1A None 384356146 392168030 None
XH 163861743 None None None

I had to make a guess about how the "thousands of entries" might look but I came up with this possible solution.
entry1 = {
'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entry2 = {
'country': {'id': '1B', 'value': 'Another World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entries = [entry1, entry2]
countries_index = []
date_cols = []
countries_index = []
date_cols = []
for entry in entries:
date_cols.append(entry['date'])
countries_index.append(entry['country']['id'])
import pandas as pd
df = pd.DataFrame(date_cols, columns=['date'], index=countries_index)
This creates a data frame,df which looks like this...
date
1A 2016
1B 2016

Related

ValueError: Invalid broadcasting comparison with block values - how to resolve it in pythonic way

Hi I have two data frames and trying to compare the values in it but facing a ValueError in broadcasting:
dict_1 = {'a': {0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}]},
"c" : "hello"}
dict_2 = {'a': {0: np.nan},
'b': {0: [{'value': 'B4785',
'label': 'Innovation'},
{'value': 'B4635', 'label': 'Businees Value'},
{'value': 'B1234', 'label': 'Requirements'},
{'value': 'B9853', 'label': 'Stories'}]},
'c': "hello"
}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
Here I wanted to compare two rows only but not two complete dataframes (as I had a scenario that shape of df1=(500, 2) and shape of shape of df2 = (1, 2)). So I used the below code two extract the different values in the rows .
df1[~(df1[['a', 'b', 'c']] == df2[['a', 'b', 'c']].iloc[0])]
The desired result should be:
Here, df2 which has one row should compare with every row values of df1(in my scenario I have more than 1 row). If they are identical then it should be nan else I should get the corresponding values of df1
You can use mask and replace True matches with np.nan. If df2 and df1 have a single row
condition = df1 == df2
df1.mask(condition, other=np.nan)
Output:
Now if df2 has more than one row you can apply a callable that return True or False values, in this case calling apply to compare each row of df1 to the first element of df2. Otherwise one gets a different shape error.
dict_1 = {'a':
{0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}],
1: [{'value': 'B123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]
},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
1: [{'value': 'C5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
},
"c" : {0: "hello", 1: "hola"}}
# New df1 with two rows
df1 = pd.DataFrame(dict_1)
condition = df1.apply(lambda x: x==df2.iloc[0], axis=1)
df1.mask(condition, other=np.nan)
Output

Remove some same-index elements from two lists based on one of them

Let's suppose that I have these two lists:
a = [{'id': 3}, {'id': 7}, None, {'id': 1}, {'id': 6}, None]
b = ['5', '5', '3', '5', '3', '5']
I want to filter both at the same-index based though only on a and specifically on filtering out the None elements of a.
So finally I want to have this:
[{'id': 3}, {'id': 7}, {'id': 1}, {'id': 6}]
['5', '5', '5', '3']
I have written this code for this:
a_temp = []
b_temp = []
for index, el in enumerate(a):
if el:
a_temp.append(a[index])
b_temp.append(b[index])
a = a_temp[:]
b = b_temp[:]
I am wondering though if there is any more pythonic way to do this?
This solution
uses zip() to group corresponding elements of a and b together
Makes a list of 2-tuples of corresponding elements, such that the corresponding element of a is not None
Use the zip(*iterable) idiom to flip the dimensions of the list, thus separating the single list of 2-tuples into two lists of singletons, which we assign to new_a and new_b
a = [{'id': 3}, {'id': 7}, None, {'id': 1}, {'id': 6}, None]
b = ['5', '5', '3', '5', '3', '5']
new_a, new_b = zip(*((x, y) for x, y in zip(a, b) if x))
# new_a = ({'id': 3}, {'id': 7}, {'id': 1}, {'id': 6})
# new_b = ('5', '5', '5', '3')
If you just want a simple solution, please try:
a = [{'id': 3}, {'id': 7}, None, {'id': 1}, {'id': 6}, None]
b = ['5', '5', '3', '5', '3', '5']
n = []
for i in range(len(b)):
if a[i] is None:
n.append(i)
for i in sorted(n, reverse=True):
a.pop(i)
b.pop(i)
a
[{'id': 3}, {'id': 7}, {'id': 1}, {'id': 6}]
b
['5', '5', '5', '3']

How to convert list of list into dictionary object?

I have a list of lists that looks like this:
[[['1',
'1#1`']],
[['2', '2#2.com']],
[['3', '3#3.com']],
[['4', '4#4.com']],
[['5', '5#5.com']],
[['6', '6#6']],
[['7', '7#7']],
[['8', '8#8']],
[['8.5', '8.5#8.5']],
[['9', '9#9']],
[['10', '10#10']],
[['11', '11#11']],
[['12', '12#12']],
[['13', '13#13.com']],
[['14', '14#14.com']],
[['15', '15#15.com']],
[['16', '16#16.com']],
[['17', '17#17.com']],
[['18#18.com', '18']],
[['19', '19#19.com']]]
is there anyway I can clean up the list by making it into a dictionary object like so:
[{id:1,email:1#1},{id:2,email:2#2.com}]
Ideally if there are any emails in the id spot they flipped to the email spot?
You can use a list comprehension:
In [1]: mylist = [[['1',
...: '1#1`']],
...: [['2', '2#2.com']],
...: [['3', '3#3.com']],
...: [['4', '4#4.com']],
...: [['5', '5#5.com']],
...: [['6', '6#6']],
...: [['7', '7#7']],
...: [['8', '8#8']],
...: [['8.5', '8.5#8.5']],
...: [['9', '9#9']],
...: [['10', '10#10']],
...: [['11', '11#11']],
...: [['12', '12#12']],
...: [['13', '13#13.com']],
...: [['14', '14#14.com']],
...: [['15', '15#15.com']],
...: [['16', '16#16.com']],
...: [['17', '17#17.com']],
...: [['18#18.com', '18']],
...: [['19', '19#19.com']]]
In [2]: [{'id': i, 'email': e} for i, e in (pair[0] if '#' not in pair[0][0] else reversed(pair[0]) for pair in mylist)]
Out[2]:
[{'id': '1', 'email': '1#1`'},
{'id': '2', 'email': '2#2.com'},
{'id': '3', 'email': '3#3.com'},
{'id': '4', 'email': '4#4.com'},
{'id': '5', 'email': '5#5.com'},
{'id': '6', 'email': '6#6'},
{'id': '7', 'email': '7#7'},
{'id': '8', 'email': '8#8'},
{'id': '8.5', 'email': '8.5#8.5'},
{'id': '9', 'email': '9#9'},
{'id': '10', 'email': '10#10'},
{'id': '11', 'email': '11#11'},
{'id': '12', 'email': '12#12'},
{'id': '13', 'email': '13#13.com'},
{'id': '14', 'email': '14#14.com'},
{'id': '15', 'email': '15#15.com'},
{'id': '16', 'email': '16#16.com'},
{'id': '17', 'email': '17#17.com'},
{'id': '18', 'email': '18#18.com'},
{'id': '19', 'email': '19#19.com'}]
If you have arbitrary nesting, you can try this:
def flatten(lst):
for sub in lst:
if isinstance(sub, list):
yield from flatten(sub)
else:
yield sub
[{'id': i, 'email': e} for i, e in (pair if '#' not in pair[0] else reversed(pair) for pair in zip(*[flatten(mylist)]*2))]

Pick two items from a list based on a condition

Here is the simplified version of the problem ;)
Given following list,
my_list = [{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'mango', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'},
{'name': 'leek', 'type': 'vegetable'}]
How to pick only two items from the list for a particular type to achieve following?
filtered = [{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'leek', 'type': 'vegetable'}]
You can use itertools.groupby to group the elements of your list based on type and the grab only the first 2 elements from each group.
>>> from itertools import groupby
>>> f = lambda k: k['type']
>>> n = 2
>>> res = [grp for _,grps in groupby(sorted(my_list, key=f), f) for grp in list(grps)[:n]]
>>> pprint(res)
[{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]
you can groupby then pick the first 2:
from itertools import groupby
a = [list(j)[:2] for i, j in groupby(my_list, key = lambda x: x['type'])]
print(a)
[[{'name': 'apple', 'type': 'fruit'}, {'name': 'orange', 'type': 'fruit'}],
[{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]]
sum(a,[])
Out[299]:
[{'name': 'apple', 'type': 'fruit'},
{'name': 'orange', 'type': 'fruit'},
{'name': 'tomato', 'type': 'vegetable'},
{'name': 'potato', 'type': 'vegetable'}]

pandas dictionary to list of dictionary key/values

I am trying to perform list comprehension with nested list of dictionary from data-frame and I get this after some tryouts. Is there pandas functionality that I might be missing than using for loops?
file = ['a.txt','a.txt','b.txt','c.txt']
year = ['2016','2017','2016','2018']
paper = ['Biology','Biology','Math','English']
name = ['Ann,Matt','Maya','Rob',np.nan]
df = pd.DataFrame({
'file':file,
'year':year,
'paper':paper,
'name':name
})
df
dfd = df.to_dict('index')
dfd
>>>
{0: {'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'},
1: {'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'},
2: {'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'},
3: {'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan}}
Tried:
d = []
for i in dfd.items():
d.append(i)
>>>
[(0,
{'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'}),
(1, {'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'}),
(2, {'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'}),
(3, {'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan})]
I am trying to get it like this: its in tuple format.
[{'file': 'a.txt', 'year': '2016', 'paper': 'Biology', 'name': 'Ann,Matt'},
{'file': 'a.txt', 'year': '2017', 'paper': 'Biology', 'name': 'Maya'},
{'file': 'b.txt', 'year': '2016', 'paper': 'Math', 'name': 'Rob'},
{'file': 'c.txt', 'year': '2018', 'paper': 'English', 'name': nan}]
You almost had it correct above. You can use dfd.items() to iterate over both the keys and values at once of your dfd dict. Then you can ignore the key part of the tuple and just add the value to the list comprehension like this:
d = [v for k,v in dfd.items()]
Just tested that with the data and it gives the output you want

Resources