flatten nested dict values in a list in Python 3 - python-3.x

I've got this data structure coming from Vimeo API
{'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
I want to transofrm in
[720, "sv", "http..", "incident.." "http..", "Warsaw", "New Europe.."]
to load it in a Google spreadsheet. I also need to maintain consistence values order.
PS. I see similar questions but answers are not in Python 3
Thanks

I'm going to use the csv module to create a CSV file like you've described out of your data.
First, we should use a header row for your file, so the order doesn't matter, only dict keys do:
import csv
# This defines the order they'll show up in final file
fieldnames = [
'name', 'link', 'duration', 'language',
'user_name', 'user_link', 'user_location',
]
# Open the file with Python
with open('my_file.csv', 'w', newline='') as my_file:
# Attach a CSV writer to the file with the desired fieldnames
writer = csv.DictWriter(my_file, fieldnames)
# Write the header row
writer.writeheader()
Notice the DictWriter, this will allow us to write dicts based on their keys instead of the order (dicts are unordered pre-3.6). The above code will end up with a file like this:
name;link;duration;language;user_name;user_link;user_location
Which we can then add rows to, but let's convert your data first, so the keys match the above field names:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
for key, value in data['user'].items():
data['user_{}'.format(key)] = value
del data['user']
This ends up with the data dictionary like this:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user_link': 'https://vimeo.com/neweuropefilmsales',
'user_location': 'Warsaw, Poland',
'user_name': 'New Europe Film Sales',
}
We can now simply insert this as a whole row to the CSV writer, and everything else is done automatically:
# Using the same writer from above, insert the data from above
writer.writerow(data)
That's it, now just import this into your Google spreadsheets :)

This is a simple solution using recursion:
dictionary = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
def flatten(current: dict, result: list=[]):
if isinstance(current, dict):
for key in current:
flatten(current[key], result)
else:
result.append(current)
return result
result = flatten(dictionary)
print(result)
Explanation: We call flatten() until we reach a value of the dictionary, that is not a dictionary itself (if isinstance(current, dict):). If we reach this value, we append it to our result list. It will work for any number of nested dictionaries.
See: How would I flatten a nested dictionary in Python 3?
I used the same solution, but I've changed the result collection to be a list.

Related

Auto-extracting columns from nested dictionaries in pandas

So I have this nested multiple dictionaries in a jsonl file column as below:
`df['referenced_tweets'][0]`
producing (shortened output)
'id': '1392893055112400898',
'public_metrics': {'retweet_count': 0,
'reply_count': 1,
'like_count': 2,
'quote_count': 0},
'conversation_id': '1392893055112400898',
'created_at': '2021-05-13T17:22:37.000Z',
'reply_settings': 'everyone',
'entities': {'annotations': [{'start': 65,
'end': 77,
'probability': 0.9719000000000001,
'type': 'Person',
'normalized_text': 'Jill McMillan'}],
'mentions': [{'start': 23,
'end': 36,
'username': 'usasklibrary',
'protected': False,
'description': 'The official account of the University Library at USask.',
'created_at': '2019-06-04T17:19:12.000Z',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'url': '*removed*',
'expanded_url': 'http://library.usask.ca',
'display_url': 'library.usask.ca'}]}},
'name': 'University Library',
'url': '....',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
'verified': False,
'id': '1135959197902921728',
'public_metrics': {'followers_count': 365,
'following_count': 119,
'tweet_count': 556,
'listed_count': 9}}]},
'text': 'Wonderful session with #usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!',
...
My intention is to create a function that would auto extract specific columns (e.g. text,type) in the entire dataframe (not just a row). So I wrote the function:
### x = df['referenced_tweets']
def extract_TextType(x):
dic = {}
for i in x:
if i != " ":
new_df= pd.DataFrame.from_dict(i)
dic['refd_text']=new_df['text']
dic['refd_type'] = new_df['type']
else:
print('none')
return dic
However running the function:
df['referenced_tweets'].apply(extract_TextType)
produces an error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
The whole point is to extract these two nested columns (texts & type) from the original "referenced tweets" column and match them to the original rows.
What am I doing wrong pls?
P.S.
The original df is shotgrabbed below:
A couple things to consider here. referenced_tweets holds a list so this line new_df= pd.DataFrame.from_dict(i) is most likely not parsing correctly the way you are entering it.
Also, because it's possible there are multiple tweets in that list you are correctly iterating over it but you don't need put it into a df to do so. This will also create a new dictionary in each cell as you are using a .apply(). If that's what you want that is ok. If you really just want a new dataframe, you can adapt the following. I don't have access to referenced_tweets so I'm using entities as an example.
Here's my example:
ents = df[df.entities.notnull()]['entities']
dict_hold_list = []
for ent in ents:
# print(ent['hashtags'])
for htag in ent['hashtags']:
# print(htag['text'])
# print(htag['indices'])
dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)
Because you have not provided a good working json or dataframe, I can't test this, but your solution could look like this
refs = df[df.referenced_tweets.notnull()]['referenced_tweets']
dict_hold_list = []
for ref in refs:
# print(ref)
for r in ref:
# print(r['text'])
# print(r['type'])
dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)

How to append the repeated values to same key in dictionary list

I have below dictionary list which contains repeated values. i need to append the repeated values to same key and remain should store as it in the dictionary list.
veh_entry=[{'name': 'scott', 'id': '17'},{'name': 'thomas', 'id': '18'}, {'name': 'tony', 'id': '17'}]
i tried with below approach, but not seems to be working as expected
add=[]
test={}
for item in veh_entry:
if item['id'] not in add:
test['name']=item['name']
add.append(item['id'])
else:
test['name']=(test['name']+ ','+item['name'])
#expected:
the expected dictionary must be as follows:
[{'name': 'scott, tony', 'id':'17'},{'name': 'thomas', 'id': '18'}]
So the basic logic was to compare the ids of different item in the list and if the id's match then join the name and remove the repeated item from the list
final_result = []
for i in range(len(veh_entry)-1):
for j in range(i+1,len(veh_entry)):
a = dict()
if veh_entry[i]['id'] == veh_entry[j]['id']:
a['name'] = veh_entry[i]['name'] +','+veh_entry[j]['name']
a['id'] = veh_entry[i]['id']
veh_entry.pop(j)
final_result.append(a)
else:
final_result.append(veh_entry[j])
print(final_result)
Output:- [{'name': 'thomas', 'id': '18'}, {'name': 'scott,tony', 'id': '17'}]
You can go with this:
def remove_dup(arr):
names=[] # using list here
ids=[] # instead of dict
for x in arr:
id,name=x['id'],x['name']
if id not in ids: # search for existing id
ids.append(id)
names.append(name)
else:
names[ids.index(id)]+=", "+name
return [{'name':name,'id':id} for name,id in zip(names,ids)]
veh_entry=[{'name': 'scott', 'id': '17'},{'name': 'thomas', 'id': '18'}, {'name': 'tony', 'id': '17'}]
print(remove_dup(veh_entry))

How to create a nested dict from list with blank keys across the board?

I know that the dict.fromkeys used as follows rt_dict = dict.fromkeys(['name', 'description', 'model'], '') gets me half way there, BUT, how do I adjust it to achieve my desired result of something like:
{'name': '', 'description': {'year': '', 'make': ''}, 'model': ''}
All keys without nested dictionaries should have blank values. All values of the nested dictionaries should be blank IF they do not have nested dictionaries.
Not clear what your input looks like, but this will work.
input = ['name', {'description': ['year', 'make']}, 'model']
result = {}
for key in input:
if isinstance(key, dict):
result[next(iter(key))] = dict.fromkeys(next(iter(key.values())), '')
else:
result[key] = ''
Output:
{'name': '', 'description': {'year': '', 'make': ''}, 'model': ''}

Extracting Rows by specific keyword in Python (Without using Pandas)

My csv file looks like this:-
ID,Product,Price
1,Milk,20
2,Bottle,200
3,Mobile,258963
4,Milk,24
5,Mobile,10000
My code of extracting row is as follow :-
def search_data():
fin = open('Products/data.csv')
word = input() # "Milk"
found = {}
for line in fin:
if word in line:
found[word]=line
return found
search_data()
While I run this above code I got output as :-
{'Milk': '1,Milk ,20\n'}
I want If I search for "Milk" I will get all the rows which is having "Milk" as Product.
Note:- Do this in only Python don't use Pandas
Expected output should be like this:-
[{"ID": "1", "Product": "Milk ", "Price": "20"},{"ID": "4", "Product": "Milk ", "Price": "24"}]
Can anyone tell me where am I doing wrong ?
In your script every time you assign found[word]=line it will overwrite the value that was before it. Better approach is load all the data and then do filtering:
If file.csv contains:
ID Product Price
1 Milk 20
2 Bottle 200
3 Mobile 10,000
4 Milk 24
5 Mobile 15,000
Then this script:
#load data:
with open('file.csv', 'r') as f_in:
lines = [line.split() for line in map(str.strip, f_in) if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# print only items with 'Product': 'Milk'
print([i for i in data if i['Product'] == 'Milk'])
Prints only items with Product == Milk:
[{'ID': '1', 'Product': 'Milk', 'Price': '20'}, {'ID': '4', 'Product': 'Milk', 'Price': '24'}]
EDIT: If your data are separated by commas (,), you can use csv module to read it:
File.csv contains:
ID,Product,Price
1,Milk ,20
2,Bottle,200
3,Mobile,258963
4,Milk ,24
5,Mobile,10000
Then the script:
import csv
#load data:
with open('file.csv', 'r') as f_in:
csvreader = csv.reader(f_in, delimiter=',', quotechar='"')
lines = [line for line in csvreader if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# # print only items with 'Product': 'Milk'
print([i for i in data if i['Product'].strip() == 'Milk'])
Prints:
[{'ID': '1', 'Product': 'Milk ', 'Price': '20'}, {'ID': '4', 'Product': 'Milk ', 'Price': '24'}]

How do I pass a list as a parameter in a user-defined function?

How do I pass a list as a parameter in a function?
I am trying to form a user-defined function called 'get_all_latitude' where it will extract the latitude according to its listing id from a dataset. An excerpt of the dataset (it is a list of dictionaries) is as follows:
{
'listing_id': '1133718',
'survey_id': '1280',
'host_id': '6219420',
'room_type': 'Shared room',
'country': '',
'city': 'Singapore',
'borough': '',
'neighborhood': 'MK03',
'reviews': 9.0,
'overall_satisfaction': 4.5,
'accommodates': '12',
'bedrooms': '1.0',
'bathrooms': '',
'price': 74.0,
'minstay': '',
'last_modified': '2017-05-17 09:10:25.431659',
'latitude': 1.293354,
'longitude': 103.769226,
'location': '0101000020E6100000E84EB0FF3AF159409C69C2F693B1F43F'
}
This is my progress thus far:
def get_all_latitude(data, list_id):
new_list = []
for row in data:
if row['listing_id'] == list_id:
new_list.append(row['latitude'])
return new_list
This works if I only have 1 listing id as the 2nd argument (e.g. get_all_latitude(airbnb_data, '1133718') but I am wondering how I can get it to work with a list (e.g. get_all_latitude(airbnb_data, ['10350448','13507262','13642646']) ) as I do not know how to code it in a way where it will unpack the elements of a list.
Try this:
def get_all_latitude(data, list_id):
new_list = []
for row in data:
if row['listing_id'] in list_id:
new_list.append(row['latitude'])
return new_list
Or if you want to define list for all list id:
def get_all_latitude(data, list_ids):
new_lists = {list_id:list() for list_id in list_ids}
for row in data:
if row['listing_id'] == list_id:
new_list[row['listing_id']].append(row['latitude'])
return new_lists

Resources