Python extract unknown string from dataframe column - python-3.x

New to python - using v3. I have a dataframe column that looks like
object
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Training New"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Influx"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Social"}},"objectType":"Activity"}
{"id":"http://Demo/2.18","definition":{"name":{"en-US":"Personal"}},"objectType":"Activity"}
I need to extract the activity, which starts in a variable place and is of variable length. I do not know what the activities are. All the questions I've found are to extract a specific string or pattern, not an unknown one. If I use the code below
dataExtract['activity'] = dataExtract['object'].str.find('en-US":"')
Will give me the start index and this
dataExtract['activity'] = dataExtract['object'].str.rfind('"}}')
Will give me the end index. So I have tried combining these
dataExtract['activity'] = dataExtract['object'].str[dataExtract['object'].str.find('en-US":"'):dataExtract['object'].str.rfind('"}}')]
But that just generates "NaN", which is clearly wrong. What syntax should I use, or is there a better way to do it? Thanks

I suggest convert values to nested dictionaries and then extract by nested keys:
#if necessary
#import ast
#dataExtract['object'] = dataExtract['object'].apply(ast.literal_eval)
dataExtract['activity'] = dataExtract['object'].apply(lambda x: x['definition']['name']['en-US'])
print (dataExtract)
object activity
0 {'id': 'http://Demo/1.7', 'definition': {'name... Time Training New
1 {'id': 'http://Demo/1.7', 'definition': {'name... Time Influx
2 {'id': 'http://Demo/1.7', 'definition': {'name... Social
3 {'id': 'http://Demo/2.18', 'definition': {'nam... Personal
Details:
print (dataExtract['object'].apply(lambda x: x['definition']))
0 {'name': {'en-US': 'Time Training New'}}
1 {'name': {'en-US': 'Time Influx'}}
2 {'name': {'en-US': 'Social'}}
3 {'name': {'en-US': 'Personal'}}
Name: object, dtype: object
print (dataExtract['object'].apply(lambda x: x['definition']['name']))
0 {'en-US': 'Time Training New'}
1 {'en-US': 'Time Influx'}
2 {'en-US': 'Social'}
3 {'en-US': 'Personal'}
Name: object, dtype: object

Related

Change a dataframe column value based on the current value?

I have a pandas dataframe with several columns and in one of them, there are string values. I need to change these strings to an acceptable value based on the current value. The dataframe is relatively large (40.000 x 32)
I've made a small function that takes the string to be changed as a parameter and then lookup what this should be changed to.
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','Monday','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
def lut(txt):
my_lut = {'Script' : ['Script','Scrpt','MyScript'],
'Weekday' : ['Sunday','Monday','Tuesday']}
for key, value in my_lut.items():
if txt in value:
return(key)
break
return('Unknown')
The desired output should be:
A B
0 Script Song
1 Script Blues
2 Script Rock
3 Weekday Classic
4 Weekday Whatever
5 Unknown Something
I can't figure out how to apply this to the dataframe.
I've struggled over this for some time now so any input will be appreciated
Regards,
Check this out:
import pandas as pd
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','sdfsd','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
dic = {'Weekday': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], 'Script': ['Script','Scrpt','MyScript']}
for k, v in dic.items():
for item in v:
df.loc[df.A == item, 'A'] = k
df.loc[~df.A.isin(k for k, v in dic.items()), 'A'] = "Unknown"
Output:

How do I turn an RDD into a dictionary in pyspark?

So I have an RDD that I need to turn into a dictionary. However, I'm getting a few errors and I'm stuck.
First thing I do is pull in my csv file:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
Then I collect the data into an RDD:
pre_experian_rdd = dataset.collect()
So my data looks like so:
Row(name='BETTING Golf Course', address='1234 main st', city_name='GARDEN HOUSE', state='OH', zipcode=45209)
I need to keep the same structure with the key:value for the entire row because I need to make a api call. So it would need to be: `{name:value, address:value, city_name:value, state:value, zipcode:value}
But when I do collectAsMap() I get the following error:
dictionary update sequence element #0 has length 5; 2 is required
I need the headers in there to represent the key:value
Can someone provide some insight in what I'm doing wrong, please?
Here is a snippit of my code:
dataset = spark.sparkContext.textFile('/user/myuser/testing_directory/output_csv')
pre_experian_rdd = dataset.collectAsMap()
Error message:
An error was encountered:
dictionary update sequence element #0 has length 36; 2 is required
Traceback (most recent call last):
File "/app/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/rdd.py", line 1587, in collectAsMap
return dict(self.collect())
ValueError: dictionary update sequence element #0 has length 36; 2 is required
When I load it as a CSV, I have a do a few transformations:
dataset = spark.read.csv('/user/myuser/testing_directory/output_csv', inferSchema = True, header = True)
ex_rdd = dataset.collect()
So my rdd looks something like this:
[Row(name='BEAR LAKE GOLF COURSE & RESORT', address='PO BOX 331', city_name='GARDEN CITY', state='UT', zipcode=84028), Row(name='CHRISTENSEN & PETERSON, INC.', address='39 N MAIN ST', city_name='RICHFIELD', state='UT', zipcode=84701), Row(name='ALEXANDERS PRECISION MACHINING', address='15731 CHEMICAL LANE', city_name='HUNTINGTON BEACH', state='CA', zipcode=92649), Row(name='JOSEPH & JANET COLOMBO', address='1003 W COLLEGE', city_name='BOZEMAN', state='MT', zipcode=59715)
If I do ex_rdd.collectAsMap() I get the following errorL
dictionary update sequence element #0 has length 5; 2 is required
To get around this I have to do the following:
df_dict = [row.asDict() for row in dataset.collect()]
[{'name': 'BEAR RESORT', 'address': 'POP 331', 'city_name': 'GARDEN LAKE', 'state': 'UT', 'zipcode': 12345}, {'name': 'CHRISTENSEN INC.', 'address': '12345 MAIN AVE', 'city_name': 'FAIRFIELD', 'state': 'UT', 'zipcode': 12345}, {'name': 'PRECISE MARCHING', 'address': '1234 TESTING LANE', 'city_name': 'HUNTINGTON BEACH', 'state': 'CA', 'zipcode': 92649}]
The issue with this is that it's still a list and I need a dictionary.

Create one nested object with two objects from dictionary

I'm not sure if the title of my question is the right description to the issue I'm facing.
I'm reading the following table of data from a spreadsheet and passing it as a dataframe:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I need to transform this table of data to json following a specific schema.
I'm trying to get the following output:
{'global': {'Name': 'bar', 'Description': 'foofoo', 'spec': {'Value': '8'}}
So far I'm able to get the global and spec objects but I'm not sure how I should combine them to get the expected output above.
I wrote this:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
global = row.to_dict()
spec = row.to_dict()
del(global['Value'])
del(spec['Name'])
del(spec['Description'])
print("global:", global)
print("spec:", spec)
with the following output:
global: {'Name': 'bar', 'Description': 'foofoo'}
spec: {'Value': '8'}
How can I combine these two objects to get to the desired output?
This should give you that output:
global['spec'] = spec
combined = {'global': global}
Try this and see if it works faster: slow speed might be due to iterrows. I suggest you move the iteration to the dictionary after exporting from the dataframe.
Name Description Value
0 foo foobar 5
1 baz foobaz 4
2 bar foofoo 8
#Export dataframe to dictionar, using the 'index' option
M = df.to_dict('index')
r = {}
q = []
#iterating through the dictionary items(key,value pair)
for i,j in M.items():
#assign value to key 'global'
r['global'] = j
#popitem() works similarly to pop in list,
#take out the last item
#and remove it from parent dictionary
#this nests the spec key, inside the global key
r['global']['spec'] = dict([j.popitem()])
#this ensures the dictionaries already present are not overriden
#you could use copy or deep.copy to ensure same state
q.append(dict(r))
{'global': {'Name': 'foo', 'Description': 'foobar', 'spec': {'Value': 5}}}
{'global': {'Name': 'baz', 'Description': 'foobaz', 'spec': {'Value': 4}}}
{'global': {'Name': 'bar', 'Description': 'foofoo', 'spec': {'Value': 8}}}
dict popitem

Create a dictionary with twitter hashtag counts

I read in a file of tweets I downloaded from a shared drive:
lst = list()
with open('cwctweets.txt', 'r', encoding = 'utf8') as infile:
txt = infile.readlines()
Turned it into a list of 10 dictionaries:
for line in txt:
dct = dict(line)
lst.append(dct)
Each dictionary has I think 15 tweets, except the first one, lst[0], which has 100.
What I am trying to do is create a dictionary that contains the hashtags as keys, and the counts of the hashtags as the values.
All the dictionaries (0-9) look like this:
lst[0].keys()
dict_keys(['search_metadata', 'statuses'])
And I'm only focusing on 'statuses':
lst[0]['statuses'][1].keys()
dict_keys(['geo', 'entities', 'in_reply_to_user_id_str', 'favorite_count', 'retweeted', 'id', 'place', 'source', 'text', 'in_reply_to_user_id', 'favorited', 'id_str', 'lang', 'truncated', 'contributors', 'created_at', 'metadata', 'retweet_count', 'in_reply_to_status_id_str', 'coordinates', 'in_reply_to_screen_name', 'user', 'in_reply_to_status_id'])
Here is where I find hashtags:
lst[0]['statuses'][1]['entities'].keys()
dict_keys(['user_mentions', 'hashtags', 'urls', 'symbols'])
So I can do this to print out the hashtags:
for a in lst:
for b in a['statuses']:
print(b['entities']['hashtags'])
And my output looks like this:
[{'indices': [47, 56], 'text': 'WorldCup'},
{'indices': [57, 63], 'text': 'CWC15'}, {'indices':
[64, 72], 'text': 'IndvsSA'}]
[{'indices': [107, 113], 'text': 'CWC15'},
{'indices': [114, 122], 'text': 'NZvsENG'},
{'indices': [123, 134], 'text': 'Contenders'}]
...
But when I try this to create a dictionary with hashtags as keys and hashtag counts as values:
dct1 = dict()
for a in lst:
for b in a['statuses']:
if b['entities']['hashtags'] not in dct1:
dct1[b] = 1
else:
dct1[b] += 1
This is the error I get:
TypeError Traceback (most recent call last)
<ipython-input-129-cc2e453c6f6d> in <module>()
2 for a in lst:
3 for b in a['statuses']:
----> 4 if b['entities']['hashtags'] not in dct1:
5 dct1[b] = 1
6 else:
TypeError: unhashable type: 'list'
Now I'm not sure why it isn't working if I can just print out the hashtags in a similar manner, any help, please?
The unhashable type error appears when a type such as a list type is used to access a dictionary. The reason for this is that lists cannot be used as a key for a dictionary.
The line if b['entities']['hashtags'] not in dct1: checks if a given key is not in a dictionary.
Print the value of b['entities']['hashtags']. if it has [ and ] surrounding it, that means it is a list.
From your code above it seems that the hashtags key of b['entities'] contains a list of hashtags. According to your needs, you will need to possibly choose one of the values in hashtags and use that to check through each value in your other dictionary.

Python - unable to count occurences of values in defined ranges in dataframe

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python
You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

Resources