Auto-extracting columns from nested dictionaries in pandas - python-3.x

So I have this nested multiple dictionaries in a jsonl file column as below:
`df['referenced_tweets'][0]`
producing (shortened output)
'id': '1392893055112400898',
'public_metrics': {'retweet_count': 0,
'reply_count': 1,
'like_count': 2,
'quote_count': 0},
'conversation_id': '1392893055112400898',
'created_at': '2021-05-13T17:22:37.000Z',
'reply_settings': 'everyone',
'entities': {'annotations': [{'start': 65,
'end': 77,
'probability': 0.9719000000000001,
'type': 'Person',
'normalized_text': 'Jill McMillan'}],
'mentions': [{'start': 23,
'end': 36,
'username': 'usasklibrary',
'protected': False,
'description': 'The official account of the University Library at USask.',
'created_at': '2019-06-04T17:19:12.000Z',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'url': '*removed*',
'expanded_url': 'http://library.usask.ca',
'display_url': 'library.usask.ca'}]}},
'name': 'University Library',
'url': '....',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
'verified': False,
'id': '1135959197902921728',
'public_metrics': {'followers_count': 365,
'following_count': 119,
'tweet_count': 556,
'listed_count': 9}}]},
'text': 'Wonderful session with #usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!',
...
My intention is to create a function that would auto extract specific columns (e.g. text,type) in the entire dataframe (not just a row). So I wrote the function:
### x = df['referenced_tweets']
def extract_TextType(x):
dic = {}
for i in x:
if i != " ":
new_df= pd.DataFrame.from_dict(i)
dic['refd_text']=new_df['text']
dic['refd_type'] = new_df['type']
else:
print('none')
return dic
However running the function:
df['referenced_tweets'].apply(extract_TextType)
produces an error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
The whole point is to extract these two nested columns (texts & type) from the original "referenced tweets" column and match them to the original rows.
What am I doing wrong pls?
P.S.
The original df is shotgrabbed below:

A couple things to consider here. referenced_tweets holds a list so this line new_df= pd.DataFrame.from_dict(i) is most likely not parsing correctly the way you are entering it.
Also, because it's possible there are multiple tweets in that list you are correctly iterating over it but you don't need put it into a df to do so. This will also create a new dictionary in each cell as you are using a .apply(). If that's what you want that is ok. If you really just want a new dataframe, you can adapt the following. I don't have access to referenced_tweets so I'm using entities as an example.
Here's my example:
ents = df[df.entities.notnull()]['entities']
dict_hold_list = []
for ent in ents:
# print(ent['hashtags'])
for htag in ent['hashtags']:
# print(htag['text'])
# print(htag['indices'])
dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)
Because you have not provided a good working json or dataframe, I can't test this, but your solution could look like this
refs = df[df.referenced_tweets.notnull()]['referenced_tweets']
dict_hold_list = []
for ref in refs:
# print(ref)
for r in ref:
# print(r['text'])
# print(r['type'])
dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)

Related

'dict' object has not attribute 'pk' when using Django bulk_update()

I have the following code:
obj = Products.objects.filter(dump__product_name = 'ABC, dump__product_color = 'black').values()
new_price = [100, 200, 300]
for item in range(len(obj)):
obj[item]['price'] -= new_price[item]
Products.objects.filter(dump__product_name = 'ABC, dump__product_color = 'black').bulk_update(obj, ['price'])
But I am getting the error, Exception inside application: 'dict' has no attribute 'pk'
The value of obj looks like this:
<QuerySet [{'id': 1, 'product_name': 'Acer - Laptop', 'price': 350},
{'id': 1, 'product_name': 'Dell - Laptop', 'price': 450},
{'id': 1, 'product_name': 'Samsung- Laptop', 'price': 650}]>
I am unable to figure out what's wrong with the code. Any help would be much appreciated. Thanks a lot in advance
You should not use .values() since that will create dictionaries instead of model objects, and thus does not offer all the functionality the model provides.
You can work with:
obj = list(Products.objects.filter(
dump__product_name = 'ABC', dump__product_color = 'black'
))
new_price = [100, 200, 300]
for item, prc in zip(obj, new_price):
item.price -= prc
Products.objects.bulk_update(obj, ['price'])
The QuerySet.values method returns a sequence of dicts, but QuerySet.bulk_update takes a sequence of model instances, not dicts. You should iterate over the QuerySet returned by the filter method for a sequence of model instances that you can make changes to instead, and you also don't have to filter again when you perform bulk_update because it's performed on instances of specific primary keys:
items = list(Products.objects.filter(dump__product_name='ABC', dump__product_color='black'))
new_prices = [100, 200, 300]
for item, new_price in zip(items, new_prices):
item.price -= new_price
Products.objects.bulk_update(items, ['price'])

extract dictionary elements from nested list in python

I have a question.
I have a nested list that looks like this.
x= [[{'screen_name': 'BreitbartNews',
'name': 'Breitbart News',
'id': 457984599,
'id_str': '457984599',
'indices': [126, 140]}],
[],
[],
[{'screen_name': 'BreitbartNews',
'name': 'Breitbart News',
'id': 457984599,
'id_str': '457984599',
'indices': [98, 112]}],
[{'screen_name': 'BreitbartNews',
'name': 'Breitbart News',
'id': 457984599,
'id_str': '457984599',
'indices': [82, 96]}]]
There are some empty lists inside the main list.
What I am trying to do is to extract screen_name and append them as a new list including the empty ones (maybe noting them as 'null').
y=[]
for i in x :
for j in i :
if len(j)==0 :
n = 'null'
else :
n = j['screen_name']
y.append(n)
I don't know why the code above outputs a list,
['BreitbartNews',
'BreitbartNews',
'BreitbartNews',
'BreitbartNews',
'BreitbartNews']
which don't reflect the empty sublist.
Can anyone help me how I can refine my code to make it right?
You are checking the lengths of the wrong lists. Your empty lists are in the i variables.
The correct code would be
y=[]
for i in x :
if len(i) == 0:
n = 'null'
else:
n = i[0]['screen_name']
y.append(n)
It may help to print(i) in each iteration to better understand what is actually happening.

How do I turn an RDD into a dictionary in pyspark?

So I have an RDD that I need to turn into a dictionary. However, I'm getting a few errors and I'm stuck.
First thing I do is pull in my csv file:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
Then I collect the data into an RDD:
pre_experian_rdd = dataset.collect()
So my data looks like so:
Row(name='BETTING Golf Course', address='1234 main st', city_name='GARDEN HOUSE', state='OH', zipcode=45209)
I need to keep the same structure with the key:value for the entire row because I need to make a api call. So it would need to be: `{name:value, address:value, city_name:value, state:value, zipcode:value}
But when I do collectAsMap() I get the following error:
dictionary update sequence element #0 has length 5; 2 is required
I need the headers in there to represent the key:value
Can someone provide some insight in what I'm doing wrong, please?
Here is a snippit of my code:
dataset = spark.sparkContext.textFile('/user/myuser/testing_directory/output_csv')
pre_experian_rdd = dataset.collectAsMap()
Error message:
An error was encountered:
dictionary update sequence element #0 has length 36; 2 is required
Traceback (most recent call last):
File "/app/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/rdd.py", line 1587, in collectAsMap
return dict(self.collect())
ValueError: dictionary update sequence element #0 has length 36; 2 is required
When I load it as a CSV, I have a do a few transformations:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
ex_rdd = dataset.collect()
So my rdd looks something like this:
[Row(name='BEAR LAKE GOLF COURSE & RESORT', address='PO BOX 331', city_name='GARDEN CITY', state='UT', zipcode=84028), Row(name='CHRISTENSEN & PETERSON, INC.', address='39 N MAIN ST', city_name='RICHFIELD', state='UT', zipcode=84701), Row(name='ALEXANDERS PRECISION MACHINING', address='15731 CHEMICAL LANE', city_name='HUNTINGTON BEACH', state='CA', zipcode=92649), Row(name='JOSEPH & JANET COLOMBO', address='1003 W COLLEGE', city_name='BOZEMAN', state='MT', zipcode=59715)
If I do ex_rdd.collectAsMap() I get the following errorL
dictionary update sequence element #0 has length 5; 2 is required
To get around this I have to do the following:
df_dict = [row.asDict() for row in dataset.collect()]
[{'name': 'BEAR RESORT', 'address': 'POP 331', 'city_name': 'GARDEN LAKE', 'state': 'UT', 'zipcode': 12345}, {'name': 'CHRISTENSEN INC.', 'address': '12345 MAIN AVE', 'city_name': 'FAIRFIELD', 'state': 'UT', 'zipcode': 12345}, {'name': 'PRECISE MARCHING', 'address': '1234 TESTING LANE', 'city_name': 'HUNTINGTON BEACH', 'state': 'CA', 'zipcode': 92649}]
The issue with this is that it's still a list and I need a dictionary.

Create a dictionary with twitter hashtag counts

I read in a file of tweets I downloaded from a shared drive:
lst = list()
with open('cwctweets.txt', 'r', encoding = 'utf8') as infile:
txt = infile.readlines()
Turned it into a list of 10 dictionaries:
for line in txt:
dct = dict(line)
lst.append(dct)
Each dictionary has I think 15 tweets, except the first one, lst[0], which has 100.
What I am trying to do is create a dictionary that contains the hashtags as keys, and the counts of the hashtags as the values.
All the dictionaries (0-9) look like this:
lst[0].keys()
dict_keys(['search_metadata', 'statuses'])
And I'm only focusing on 'statuses':
lst[0]['statuses'][1].keys()
dict_keys(['geo', 'entities', 'in_reply_to_user_id_str', 'favorite_count', 'retweeted', 'id', 'place', 'source', 'text', 'in_reply_to_user_id', 'favorited', 'id_str', 'lang', 'truncated', 'contributors', 'created_at', 'metadata', 'retweet_count', 'in_reply_to_status_id_str', 'coordinates', 'in_reply_to_screen_name', 'user', 'in_reply_to_status_id'])
Here is where I find hashtags:
lst[0]['statuses'][1]['entities'].keys()
dict_keys(['user_mentions', 'hashtags', 'urls', 'symbols'])
So I can do this to print out the hashtags:
for a in lst:
for b in a['statuses']:
print(b['entities']['hashtags'])
And my output looks like this:
[{'indices': [47, 56], 'text': 'WorldCup'},
{'indices': [57, 63], 'text': 'CWC15'}, {'indices':
[64, 72], 'text': 'IndvsSA'}]
[{'indices': [107, 113], 'text': 'CWC15'},
{'indices': [114, 122], 'text': 'NZvsENG'},
{'indices': [123, 134], 'text': 'Contenders'}]
...
But when I try this to create a dictionary with hashtags as keys and hashtag counts as values:
dct1 = dict()
for a in lst:
for b in a['statuses']:
if b['entities']['hashtags'] not in dct1:
dct1[b] = 1
else:
dct1[b] += 1
This is the error I get:
TypeError Traceback (most recent call last)
<ipython-input-129-cc2e453c6f6d> in <module>()
2 for a in lst:
3 for b in a['statuses']:
----> 4 if b['entities']['hashtags'] not in dct1:
5 dct1[b] = 1
6 else:
TypeError: unhashable type: 'list'
Now I'm not sure why it isn't working if I can just print out the hashtags in a similar manner, any help, please?
The unhashable type error appears when a type such as a list type is used to access a dictionary. The reason for this is that lists cannot be used as a key for a dictionary.
The line if b['entities']['hashtags'] not in dct1: checks if a given key is not in a dictionary.
Print the value of b['entities']['hashtags']. if it has [ and ] surrounding it, that means it is a list.
From your code above it seems that the hashtags key of b['entities'] contains a list of hashtags. According to your needs, you will need to possibly choose one of the values in hashtags and use that to check through each value in your other dictionary.

flatten nested dict values in a list in Python 3

I've got this data structure coming from Vimeo API
{'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
I want to transofrm in
[720, "sv", "http..", "incident.." "http..", "Warsaw", "New Europe.."]
to load it in a Google spreadsheet. I also need to maintain consistence values order.
PS. I see similar questions but answers are not in Python 3
Thanks
I'm going to use the csv module to create a CSV file like you've described out of your data.
First, we should use a header row for your file, so the order doesn't matter, only dict keys do:
import csv
# This defines the order they'll show up in final file
fieldnames = [
'name', 'link', 'duration', 'language',
'user_name', 'user_link', 'user_location',
]
# Open the file with Python
with open('my_file.csv', 'w', newline='') as my_file:
# Attach a CSV writer to the file with the desired fieldnames
writer = csv.DictWriter(my_file, fieldnames)
# Write the header row
writer.writeheader()
Notice the DictWriter, this will allow us to write dicts based on their keys instead of the order (dicts are unordered pre-3.6). The above code will end up with a file like this:
name;link;duration;language;user_name;user_link;user_location
Which we can then add rows to, but let's convert your data first, so the keys match the above field names:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
for key, value in data['user'].items():
data['user_{}'.format(key)] = value
del data['user']
This ends up with the data dictionary like this:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user_link': 'https://vimeo.com/neweuropefilmsales',
'user_location': 'Warsaw, Poland',
'user_name': 'New Europe Film Sales',
}
We can now simply insert this as a whole row to the CSV writer, and everything else is done automatically:
# Using the same writer from above, insert the data from above
writer.writerow(data)
That's it, now just import this into your Google spreadsheets :)
This is a simple solution using recursion:
dictionary = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
def flatten(current: dict, result: list=[]):
if isinstance(current, dict):
for key in current:
flatten(current[key], result)
else:
result.append(current)
return result
result = flatten(dictionary)
print(result)
Explanation: We call flatten() until we reach a value of the dictionary, that is not a dictionary itself (if isinstance(current, dict):). If we reach this value, we append it to our result list. It will work for any number of nested dictionaries.
See: How would I flatten a nested dictionary in Python 3?
I used the same solution, but I've changed the result collection to be a list.

Resources