How do I turn an RDD into a dictionary in pyspark? - apache-spark

So I have an RDD that I need to turn into a dictionary. However, I'm getting a few errors and I'm stuck.
First thing I do is pull in my csv file:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
Then I collect the data into an RDD:
pre_experian_rdd = dataset.collect()
So my data looks like so:
Row(name='BETTING Golf Course', address='1234 main st', city_name='GARDEN HOUSE', state='OH', zipcode=45209)
I need to keep the same structure with the key:value for the entire row because I need to make a api call. So it would need to be: `{name:value, address:value, city_name:value, state:value, zipcode:value}
But when I do collectAsMap() I get the following error:
dictionary update sequence element #0 has length 5; 2 is required
I need the headers in there to represent the key:value
Can someone provide some insight in what I'm doing wrong, please?
Here is a snippit of my code:
dataset = spark.sparkContext.textFile('/user/myuser/testing_directory/output_csv')
pre_experian_rdd = dataset.collectAsMap()
Error message:
An error was encountered:
dictionary update sequence element #0 has length 36; 2 is required
Traceback (most recent call last):
File "/app/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/rdd.py", line 1587, in collectAsMap
return dict(self.collect())
ValueError: dictionary update sequence element #0 has length 36; 2 is required
When I load it as a CSV, I have a do a few transformations:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
ex_rdd = dataset.collect()
So my rdd looks something like this:
[Row(name='BEAR LAKE GOLF COURSE & RESORT', address='PO BOX 331', city_name='GARDEN CITY', state='UT', zipcode=84028), Row(name='CHRISTENSEN & PETERSON, INC.', address='39 N MAIN ST', city_name='RICHFIELD', state='UT', zipcode=84701), Row(name='ALEXANDERS PRECISION MACHINING', address='15731 CHEMICAL LANE', city_name='HUNTINGTON BEACH', state='CA', zipcode=92649), Row(name='JOSEPH & JANET COLOMBO', address='1003 W COLLEGE', city_name='BOZEMAN', state='MT', zipcode=59715)
If I do ex_rdd.collectAsMap() I get the following errorL
dictionary update sequence element #0 has length 5; 2 is required
To get around this I have to do the following:
df_dict = [row.asDict() for row in dataset.collect()]
[{'name': 'BEAR RESORT', 'address': 'POP 331', 'city_name': 'GARDEN LAKE', 'state': 'UT', 'zipcode': 12345}, {'name': 'CHRISTENSEN INC.', 'address': '12345 MAIN AVE', 'city_name': 'FAIRFIELD', 'state': 'UT', 'zipcode': 12345}, {'name': 'PRECISE MARCHING', 'address': '1234 TESTING LANE', 'city_name': 'HUNTINGTON BEACH', 'state': 'CA', 'zipcode': 92649}]
The issue with this is that it's still a list and I need a dictionary.

Related

Auto-extracting columns from nested dictionaries in pandas

So I have this nested multiple dictionaries in a jsonl file column as below:
`df['referenced_tweets'][0]`
producing (shortened output)
'id': '1392893055112400898',
'public_metrics': {'retweet_count': 0,
'reply_count': 1,
'like_count': 2,
'quote_count': 0},
'conversation_id': '1392893055112400898',
'created_at': '2021-05-13T17:22:37.000Z',
'reply_settings': 'everyone',
'entities': {'annotations': [{'start': 65,
'end': 77,
'probability': 0.9719000000000001,
'type': 'Person',
'normalized_text': 'Jill McMillan'}],
'mentions': [{'start': 23,
'end': 36,
'username': 'usasklibrary',
'protected': False,
'description': 'The official account of the University Library at USask.',
'created_at': '2019-06-04T17:19:12.000Z',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'url': '*removed*',
'expanded_url': 'http://library.usask.ca',
'display_url': 'library.usask.ca'}]}},
'name': 'University Library',
'url': '....',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
'verified': False,
'id': '1135959197902921728',
'public_metrics': {'followers_count': 365,
'following_count': 119,
'tweet_count': 556,
'listed_count': 9}}]},
'text': 'Wonderful session with #usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!',
...
My intention is to create a function that would auto extract specific columns (e.g. text,type) in the entire dataframe (not just a row). So I wrote the function:
### x = df['referenced_tweets']
def extract_TextType(x):
dic = {}
for i in x:
if i != " ":
new_df= pd.DataFrame.from_dict(i)
dic['refd_text']=new_df['text']
dic['refd_type'] = new_df['type']
else:
print('none')
return dic
However running the function:
df['referenced_tweets'].apply(extract_TextType)
produces an error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
The whole point is to extract these two nested columns (texts & type) from the original "referenced tweets" column and match them to the original rows.
What am I doing wrong pls?
P.S.
The original df is shotgrabbed below:
A couple things to consider here. referenced_tweets holds a list so this line new_df= pd.DataFrame.from_dict(i) is most likely not parsing correctly the way you are entering it.
Also, because it's possible there are multiple tweets in that list you are correctly iterating over it but you don't need put it into a df to do so. This will also create a new dictionary in each cell as you are using a .apply(). If that's what you want that is ok. If you really just want a new dataframe, you can adapt the following. I don't have access to referenced_tweets so I'm using entities as an example.
Here's my example:
ents = df[df.entities.notnull()]['entities']
dict_hold_list = []
for ent in ents:
# print(ent['hashtags'])
for htag in ent['hashtags']:
# print(htag['text'])
# print(htag['indices'])
dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)
Because you have not provided a good working json or dataframe, I can't test this, but your solution could look like this
refs = df[df.referenced_tweets.notnull()]['referenced_tweets']
dict_hold_list = []
for ref in refs:
# print(ref)
for r in ref:
# print(r['text'])
# print(r['type'])
dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)

Create one nested object with two objects from dictionary

I'm not sure if the title of my question is the right description to the issue I'm facing.
I'm reading the following table of data from a spreadsheet and passing it as a dataframe:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I need to transform this table of data to json following a specific schema.
I'm trying to get the following output:
{'global': {'Name': 'bar', 'Description': 'foofoo', 'spec': {'Value': '8'}}
So far I'm able to get the global and spec objects but I'm not sure how I should combine them to get the expected output above.
I wrote this:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
global = row.to_dict()
spec = row.to_dict()
del(global['Value'])
del(spec['Name'])
del(spec['Description'])
print("global:", global)
print("spec:", spec)
with the following output:
global: {'Name': 'bar', 'Description': 'foofoo'}
spec: {'Value': '8'}
How can I combine these two objects to get to the desired output?
This should give you that output:
global['spec'] = spec
combined = {'global': global}
Try this and see if it works faster: slow speed might be due to iterrows. I suggest you move the iteration to the dictionary after exporting from the dataframe.
Name Description Value
0 foo foobar 5
1 baz foobaz 4
2 bar foofoo 8
#Export dataframe to dictionar, using the 'index' option
M = df.to_dict('index')
r = {}
q = []
#iterating through the dictionary items(key,value pair)
for i,j in M.items():
#assign value to key 'global'
r['global'] = j
#popitem() works similarly to pop in list,
#take out the last item
#and remove it from parent dictionary
#this nests the spec key, inside the global key
r['global']['spec'] = dict([j.popitem()])
#this ensures the dictionaries already present are not overriden
#you could use copy or deep.copy to ensure same state
q.append(dict(r))
{'global': {'Name': 'foo', 'Description': 'foobar', 'spec': {'Value': 5}}}
{'global': {'Name': 'baz', 'Description': 'foobaz', 'spec': {'Value': 4}}}
{'global': {'Name': 'bar', 'Description': 'foofoo', 'spec': {'Value': 8}}}
dict popitem

Python extract unknown string from dataframe column

New to python - using v3. I have a dataframe column that looks like
object
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Training New"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Influx"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Social"}},"objectType":"Activity"}
{"id":"http://Demo/2.18","definition":{"name":{"en-US":"Personal"}},"objectType":"Activity"}
I need to extract the activity, which starts in a variable place and is of variable length. I do not know what the activities are. All the questions I've found are to extract a specific string or pattern, not an unknown one. If I use the code below
dataExtract['activity'] = dataExtract['object'].str.find('en-US":"')
Will give me the start index and this
dataExtract['activity'] = dataExtract['object'].str.rfind('"}}')
Will give me the end index. So I have tried combining these
dataExtract['activity'] = dataExtract['object'].str[dataExtract['object'].str.find('en-US":"'):dataExtract['object'].str.rfind('"}}')]
But that just generates "NaN", which is clearly wrong. What syntax should I use, or is there a better way to do it? Thanks
I suggest convert values to nested dictionaries and then extract by nested keys:
#if necessary
#import ast
#dataExtract['object'] = dataExtract['object'].apply(ast.literal_eval)
dataExtract['activity'] = dataExtract['object'].apply(lambda x: x['definition']['name']['en-US'])
print (dataExtract)
object activity
0 {'id': 'http://Demo/1.7', 'definition': {'name... Time Training New
1 {'id': 'http://Demo/1.7', 'definition': {'name... Time Influx
2 {'id': 'http://Demo/1.7', 'definition': {'name... Social
3 {'id': 'http://Demo/2.18', 'definition': {'nam... Personal
Details:
print (dataExtract['object'].apply(lambda x: x['definition']))
0 {'name': {'en-US': 'Time Training New'}}
1 {'name': {'en-US': 'Time Influx'}}
2 {'name': {'en-US': 'Social'}}
3 {'name': {'en-US': 'Personal'}}
Name: object, dtype: object
print (dataExtract['object'].apply(lambda x: x['definition']['name']))
0 {'en-US': 'Time Training New'}
1 {'en-US': 'Time Influx'}
2 {'en-US': 'Social'}
3 {'en-US': 'Personal'}
Name: object, dtype: object

Create a dictionary with twitter hashtag counts

I read in a file of tweets I downloaded from a shared drive:
lst = list()
with open('cwctweets.txt', 'r', encoding = 'utf8') as infile:
txt = infile.readlines()
Turned it into a list of 10 dictionaries:
for line in txt:
dct = dict(line)
lst.append(dct)
Each dictionary has I think 15 tweets, except the first one, lst[0], which has 100.
What I am trying to do is create a dictionary that contains the hashtags as keys, and the counts of the hashtags as the values.
All the dictionaries (0-9) look like this:
lst[0].keys()
dict_keys(['search_metadata', 'statuses'])
And I'm only focusing on 'statuses':
lst[0]['statuses'][1].keys()
dict_keys(['geo', 'entities', 'in_reply_to_user_id_str', 'favorite_count', 'retweeted', 'id', 'place', 'source', 'text', 'in_reply_to_user_id', 'favorited', 'id_str', 'lang', 'truncated', 'contributors', 'created_at', 'metadata', 'retweet_count', 'in_reply_to_status_id_str', 'coordinates', 'in_reply_to_screen_name', 'user', 'in_reply_to_status_id'])
Here is where I find hashtags:
lst[0]['statuses'][1]['entities'].keys()
dict_keys(['user_mentions', 'hashtags', 'urls', 'symbols'])
So I can do this to print out the hashtags:
for a in lst:
for b in a['statuses']:
print(b['entities']['hashtags'])
And my output looks like this:
[{'indices': [47, 56], 'text': 'WorldCup'},
{'indices': [57, 63], 'text': 'CWC15'}, {'indices':
[64, 72], 'text': 'IndvsSA'}]
[{'indices': [107, 113], 'text': 'CWC15'},
{'indices': [114, 122], 'text': 'NZvsENG'},
{'indices': [123, 134], 'text': 'Contenders'}]
...
But when I try this to create a dictionary with hashtags as keys and hashtag counts as values:
dct1 = dict()
for a in lst:
for b in a['statuses']:
if b['entities']['hashtags'] not in dct1:
dct1[b] = 1
else:
dct1[b] += 1
This is the error I get:
TypeError Traceback (most recent call last)
<ipython-input-129-cc2e453c6f6d> in <module>()
2 for a in lst:
3 for b in a['statuses']:
----> 4 if b['entities']['hashtags'] not in dct1:
5 dct1[b] = 1
6 else:
TypeError: unhashable type: 'list'
Now I'm not sure why it isn't working if I can just print out the hashtags in a similar manner, any help, please?
The unhashable type error appears when a type such as a list type is used to access a dictionary. The reason for this is that lists cannot be used as a key for a dictionary.
The line if b['entities']['hashtags'] not in dct1: checks if a given key is not in a dictionary.
Print the value of b['entities']['hashtags']. if it has [ and ] surrounding it, that means it is a list.
From your code above it seems that the hashtags key of b['entities'] contains a list of hashtags. According to your needs, you will need to possibly choose one of the values in hashtags and use that to check through each value in your other dictionary.

flatten nested dict values in a list in Python 3

I've got this data structure coming from Vimeo API
{'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
I want to transofrm in
[720, "sv", "http..", "incident.." "http..", "Warsaw", "New Europe.."]
to load it in a Google spreadsheet. I also need to maintain consistence values order.
PS. I see similar questions but answers are not in Python 3
Thanks
I'm going to use the csv module to create a CSV file like you've described out of your data.
First, we should use a header row for your file, so the order doesn't matter, only dict keys do:
import csv
# This defines the order they'll show up in final file
fieldnames = [
'name', 'link', 'duration', 'language',
'user_name', 'user_link', 'user_location',
]
# Open the file with Python
with open('my_file.csv', 'w', newline='') as my_file:
# Attach a CSV writer to the file with the desired fieldnames
writer = csv.DictWriter(my_file, fieldnames)
# Write the header row
writer.writeheader()
Notice the DictWriter, this will allow us to write dicts based on their keys instead of the order (dicts are unordered pre-3.6). The above code will end up with a file like this:
name;link;duration;language;user_name;user_link;user_location
Which we can then add rows to, but let's convert your data first, so the keys match the above field names:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
for key, value in data['user'].items():
data['user_{}'.format(key)] = value
del data['user']
This ends up with the data dictionary like this:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user_link': 'https://vimeo.com/neweuropefilmsales',
'user_location': 'Warsaw, Poland',
'user_name': 'New Europe Film Sales',
}
We can now simply insert this as a whole row to the CSV writer, and everything else is done automatically:
# Using the same writer from above, insert the data from above
writer.writerow(data)
That's it, now just import this into your Google spreadsheets :)
This is a simple solution using recursion:
dictionary = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
def flatten(current: dict, result: list=[]):
if isinstance(current, dict):
for key in current:
flatten(current[key], result)
else:
result.append(current)
return result
result = flatten(dictionary)
print(result)
Explanation: We call flatten() until we reach a value of the dictionary, that is not a dictionary itself (if isinstance(current, dict):). If we reach this value, we append it to our result list. It will work for any number of nested dictionaries.
See: How would I flatten a nested dictionary in Python 3?
I used the same solution, but I've changed the result collection to be a list.

Resources