Related
Variable tsv_data has the following structure:
[
{'id':1,'name':'bob','type':'blue','size':2},
{'id':2,'name':'bob','type':'blue','size':3},
{'id':3,'name':'bob','type':'blue','size':4},
{'id':4,'name':'bob','type':'red','size':2},
{'id':5,'name':'sarah','type':'blue','size':2},
{'id':6,'name':'sarah','type':'blue','size':3},
{'id':7,'name':'sarah','type':'green','size':2},
{'id':8,'name':'jack','type':'blue','size':5},
]
Which I would like to restructure into:
[
{'name':'bob', 'children':[
{'name':'blue','children':[
{'id':1, 'size':2},
{'id':2, 'size':3},
{'id':3, 'size':4}
]},
{'name':'red','children':[
{'id':4, 'size':2}
]}
]},
{'name':'sarah', 'children':[
{'name':'blue','children':[
{'id':5, 'size':2},
{'id':6, 'size':3},
]},
{'name':'green','children':[
{'id':7, 'size':2}
]}
]},
{'name':'jack', 'children':[
{'name':'blue', 'children':[
{'id':8, 'size':5}
]}
]}
]
What is obstructing my progress is not knowing how many items will be in the children list for each major category. In a similar vein, we also don't know which categories will be present. It could be blue or green or red -- all three or in any combination (like only red and green or only green).
Question
How might we devise a fool-proof way to compile the basic list of list contained in tsv_data into a multi-tier hierarchical data structure as above?
Given your major categories as a list:
categories = ['name', 'type']
You can first transform the input data into a nested dict of lists so that it's easier and more efficient to access children by keys than your desired output format, a nested list of dicts:
tree = {}
for record in tsv_data:
node = tree
for category in categories[:-1]:
node = node.setdefault(record.pop(category), {})
node.setdefault(record.pop(categories[-1]), []).append(record)
tree would become:
{'bob': {'blue': [{'id': 1, 'size': 2}, {'id': 2, 'size': 3}, {'id': 3, 'size': 4}], 'red': [{'id': 4, 'size': 2}]}, 'sarah': {'blue': [{'id': 5, 'size': 2}, {'id': 6, 'size': 3}], 'green': [{'id': 7, 'size': 2}]}, 'jack': {'blue': [{'id': 8, 'size': 5}]}}
You can then transform the nested dict to your desired output structure with a recursive function:
def transform(node):
if isinstance(node, dict):
return [
{'name': name, 'children': transform(child)}
for name, child in node.items()
]
return node
so that transform(tree) would return:
[{'name': 'bob', 'children': [{'name': 'blue', 'children': [{'id': 1, 'size': 2}, {'id': 2, 'size': 3}, {'id': 3, 'size': 4}]}, {'name': 'red', 'children': [{'id': 4, 'size': 2}]}]}, {'name': 'sarah', 'children': [{'name': 'blue', 'children': [{'id': 5, 'size': 2}, {'id': 6, 'size': 3}]}, {'name': 'green', 'children': [{'id': 7, 'size': 2}]}]}, {'name': 'jack', 'children': [{'name': 'blue', 'children': [{'id': 8, 'size': 5}]}]}]
Demo: https://replit.com/#blhsing/NotableCourageousTranslations
I need it to stop cloning and add the new dictionaries to the list
I even tried update for updating dictionaries but did not work
import random
# This line creates a set with 6 random numbers
#We use a 22 range or similar, otherwise the players will not get enough correct numbers for creating a solution in a learning enviroment.
lottery_numbers = set(random.sample(range(22), 6))
# Here are your players; they all decided to get their numbers randomly find out who has the most numbers matching lottery_numbers!
players = [
{'name': 'Rolf', 'numbers': set(random.sample(range(22), 6))},
{'name': 'Charlie', 'numbers':set(random.sample(range(22), 6))},
{'name': 'Anna', 'numbers': set(random.sample(range(22), 6))},
{'name': 'Jen', 'numbers': set(random.sample(range(22), 6))}
]
num_player = [1000]
dicc_A = {}
print("Lottery numbers ", lottery_numbers)
print("")
for a in players:
print("Line by line",a)
print("")
for i,j in a.items():
if i == "name":
dicc_A["Name"] = j
print("name Dicc: ", dicc_A)
if i == "numbers":
dicc_A["Num"] = j.intersection(lottery_numbers)
print(" xxxxxxxxxx NUMPLAYER before APPEND inside FOR ",num_player)
print("******number Dicc: ", dicc_A)
num_player.append(dicc_A)
print("")
print(" ///////// NUMPLAYER after APPEND inside FOR ",num_player)
This is the output****************
Lottery numbers {5, 9, 13, 14, 19, 20}
Line by line {'name': 'Rolf', 'numbers': {2, 3, 5, 11, 12, 19}}
name Dicc: {'Name': 'Rolf'}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000]
******number Dicc: {'Name': 'Rolf', 'Num': {19, 5}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {19, 5}}]
Line by line {'name': 'Charlie', 'numbers': {0, 4, 7, 8, 17, 20}}
name Dicc: {'Name': 'Charlie', 'Num': {19, 5}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Charlie', 'Num': {20}}]
******number Dicc: {'Name': 'Charlie', 'Num': {20}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Charlie', 'Num': {20}}, {'Name': 'Charlie', 'Num': {20}}]
Line by line {'name': 'Anna', 'numbers': {4, 5, 6, 10, 16, 17}}
name Dicc: {'Name': 'Anna', 'Num': {20}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Anna', 'Num': {5}}, {'Name': 'Anna', 'Num': {5}}]
******number Dicc: {'Name': 'Anna', 'Num': {5}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Anna', 'Num': {5}}, {'Name': 'Anna', 'Num': {5}}, {'Name': 'Anna', 'Num': {5}}]
Line by line {'name': 'Jen', 'numbers': {2, 4, 5, 8, 9, 10}}
name Dicc: {'Name': 'Jen', 'Num': {5}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Jen', 'Num': {9, 5}}, {'Name': 'Jen', 'Num': {9, 5}}, {'Name': 'Jen', 'Num': {9, 5}}]
******number Dicc: {'Name': 'Jen', 'Num': {9, 5}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Jen', 'Num': {9, 5}}, {'Name': 'Jen', 'Num': {9, 5}}, {'Name': 'Jen', 'Num': {9, 5}}, {'Name': 'Jen', 'Num': {9, 5}}]
Oh sorry. now I see to append a dictionary to a list you need to create a copy
So the solution was simply this
import random
# This line creates a set with 6 random numbers
#We use a 22 range or similar, otherwise the players will not get enough correct numbers for creating a solution in a learning enviroment.
lottery_numbers = set(random.sample(range(22), 6))
# Here are your players; they all decided to get their numbers randomly find out who has the most numbers matching lottery_numbers!
players = [
{'name': 'Rolf', 'numbers': set(random.sample(range(22), 6))},
{'name': 'Charlie', 'numbers':set(random.sample(range(22), 6))},
{'name': 'Anna', 'numbers': set(random.sample(range(22), 6))},
{'name': 'Jen', 'numbers': set(random.sample(range(22), 6))}
]
num_player = [1000]
dicc_A = {}
print("Lottery numbers ", lottery_numbers)
print("")
for a in players:
print("Line by line",a)
print("")
for i,j in a.items():
if i == "name":
dicc_A["Name"] = j
print("name Dicc: ", dicc_A)
if i == "numbers":
dicc_A["Num"] = j.intersection(lottery_numbers)
print(" xxxxxxxxxx NUMPLAYER before APPEND inside FOR ",num_player)
print("******number Dicc: ", dicc_A)
dictionary_copy = dicc_A.copy()
num_player.append(dictionary_copy)
print("")
print(" ///////// NUMPLAYER after APPEND inside FOR ",num_player)
print("")
print("")
*************outputs
Lottery numbers {2, 7, 12, 17, 20, 21}
Line by line {'name': 'Rolf', 'numbers': {3, 4, 9, 12, 13, 14}}
name Dicc: {'Name': 'Rolf'}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000]
******number Dicc: {'Name': 'Rolf', 'Num': {12}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}]
Line by line {'name': 'Charlie', 'numbers': {4, 8, 10, 12, 13, 18}}
name Dicc: {'Name': 'Charlie', 'Num': {12}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}]
******number Dicc: {'Name': 'Charlie', 'Num': {12}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}, {'Name': 'Charlie', 'Num': {12}}]
Line by line {'name': 'Anna', 'numbers': {10, 14, 16, 17, 20, 21}}
name Dicc: {'Name': 'Anna', 'Num': {12}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}, {'Name': 'Charlie', 'Num': {12}}]
******number Dicc: {'Name': 'Anna', 'Num': {17, 20, 21}}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}, {'Name': 'Charlie', 'Num': {12}}, {'Name': 'Anna', 'Num': {17, 20, 21}}]
Line by line {'name': 'Jen', 'numbers': {3, 6, 8, 9, 13, 14}}
name Dicc: {'Name': 'Jen', 'Num': {17, 20, 21}}
xxxxxxxxxx NUMPLAYER before APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}, {'Name': 'Charlie', 'Num': {12}}, {'Name': 'Anna', 'Num': {17, 20, 21}}]
******number Dicc: {'Name': 'Jen', 'Num': set()}
///////// NUMPLAYER after APPEND inside FOR [1000, {'Name': 'Rolf', 'Num': {12}}, {'Name': 'Charlie', 'Num': {12}}, {'Name': 'Anna', 'Num': {17, 20, 21}}, {'Name': 'Jen', 'Num': set()}]
Hi I have two data frames and trying to compare the values in it but facing a ValueError in broadcasting:
dict_1 = {'a': {0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}]},
"c" : "hello"}
dict_2 = {'a': {0: np.nan},
'b': {0: [{'value': 'B4785',
'label': 'Innovation'},
{'value': 'B4635', 'label': 'Businees Value'},
{'value': 'B1234', 'label': 'Requirements'},
{'value': 'B9853', 'label': 'Stories'}]},
'c': "hello"
}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
Here I wanted to compare two rows only but not two complete dataframes (as I had a scenario that shape of df1=(500, 2) and shape of shape of df2 = (1, 2)). So I used the below code two extract the different values in the rows .
df1[~(df1[['a', 'b', 'c']] == df2[['a', 'b', 'c']].iloc[0])]
The desired result should be:
Here, df2 which has one row should compare with every row values of df1(in my scenario I have more than 1 row). If they are identical then it should be nan else I should get the corresponding values of df1
You can use mask and replace True matches with np.nan. If df2 and df1 have a single row
condition = df1 == df2
df1.mask(condition, other=np.nan)
Output:
Now if df2 has more than one row you can apply a callable that return True or False values, in this case calling apply to compare each row of df1 to the first element of df2. Otherwise one gets a different shape error.
dict_1 = {'a':
{0: [{'value': 'A123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}],
1: [{'value': 'B123',
'label': 'Professional'},
{'value': 'B141', 'label': 'Passion'}]
},
'b': {0: [{'value': 'B5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
1: [{'value': 'C5529',
'label': 'Innovation'},
{'value': 'B3134', 'label': 'Businees Value'},
{'value': 'B3856',
'label': 'Electrofication'},
{'value': 'B3859', 'label': 'Insurance'},
{'value': 'B3856', 'label': 'Requirements'},
{'value': 'B3345', 'label': 'Stories'}],
},
"c" : {0: "hello", 1: "hola"}}
# New df1 with two rows
df1 = pd.DataFrame(dict_1)
condition = df1.apply(lambda x: x==df2.iloc[0], axis=1)
df1.mask(condition, other=np.nan)
Output
[{'id': 6, 'name': 'Jorge'}, {'id': 6, 'name': 'Matthews'}, {'id': 6, 'name': 'Matthews'}, {'id': 7, 'name': 'Christine'}, {'id': 7, 'name': 'Smith'}, {'id': 7, 'name': 'Chris'}]
And i wanna make collection of list having same id like this
[{'id': 6, 'name': ['Jorge','Matthews','Matthews']}, {'id': 7, 'name': ['Christine','Smith','Chris']}]
L = [{'id': 6, 'name': 'Jorge'}, {'id': 6, 'name': 'Matthews'}, {'id': 6, 'name': 'Matthews'}, {'id': 7, 'name': 'Christine'}, {'id': 7, 'name': 'Smith'}, {'id': 7, 'name': 'Chris'}]
temp = {}
for d in L:
if d['id'] not in temp:
temp[d['id']] = []
temp[d['id']].append(d['name'])
answer = []
for k in sorted(temp):
answer.append({'id':k, 'name':temp[k]})
You can use itertools.groupby to group all the ids and then just extract the name for each element in the group:
In [1]:
import itertools as it
import operator as op
L = [{'id': 6, 'name': 'Jorge'}, ...]
_id = op.itemgetter('id')
[{'id':k, 'name':[e['name'] for e in g]} for k, g in it.groupby(sorted(L, key=_id), key=_id)]
Out[1]:
[{'id': 6, 'name': ['Jorge', 'Matthews', 'Matthews']},
{'id': 7, 'name': ['Christine', 'Smith', 'Chris']}]
I have a dataset that consists of thousands of entries such as the following:
[{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
....17020-ish rows later.....
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'},
...]
I want to create a DataFrame using pandas such that y-axis = 'id' and x-axis = 'date', with 'value' being the stored value. I can't figure out the best way to approach this...
EDIT:
Imagine a sheet with just numbers ('value' from the dataset). The x-axis columns would be the extracted date and the y-axis rows would be the country id ('id'). The final object would be a dataset that is y*x in size. The numbers would all be of type 'float'.
EDIT 2:
The dataset represents ~304 countries from 1960 - 2016, so there are approximately 304 * 56 = 17024 entries in the dataset. I need to store the 'value' (where for entry 2, value = 392168030) with respect to each country and date.
EDIT 3:
Using the above data, an example output data set would be structured thusly:
2016 . 2015 . 2014 . ... 1960
1A . None . 392168030 384356146 . ... w
...
XH . x y z 163861743
First extract the information from origin dataset:
dataset = [{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2015',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '392168030'},
{'country': {'id': '1A', 'value': 'Arab World'},
'date': '2014',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '384356146'},
{'country': {'id': 'XH', 'value': 'IDA blend'},
'date': '1960',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': '163861743'}]
df = [[entry['country']['id'], entry['date'], entry['value']] for entry in dataset]
df = pd.DataFrame(df, columns=['id','date','value'])
Then pivot the datafrme:
df = df.pivot(index='id',columns='date',values='value')
The output:
date 1960 2014 2015 2016
id
1A None 384356146 392168030 None
XH 163861743 None None None
I had to make a guess about how the "thousands of entries" might look but I came up with this possible solution.
entry1 = {
'country': {'id': '1A', 'value': 'Arab World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entry2 = {
'country': {'id': '1B', 'value': 'Another World'},
'date': '2016',
'decimal': '0',
'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
'value': None
}
entries = [entry1, entry2]
countries_index = []
date_cols = []
countries_index = []
date_cols = []
for entry in entries:
date_cols.append(entry['date'])
countries_index.append(entry['country']['id'])
import pandas as pd
df = pd.DataFrame(date_cols, columns=['date'], index=countries_index)
This creates a data frame,df which looks like this...
date
1A 2016
1B 2016