I have a data like the following:
NAME ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
joseph fr 14 jos, ose, sep, eph
ann en 16 ann
anne ir 14 ann, nne
tom en 18 tom
tommy fr 16 tom, omm, mmy
ann ir 19 ann
... more rows
The 3LETTER_SUBSTRINGS values are string which captures all the 3-letter substrings of the NAME variable. I would like to aggregate it into a single list, with each comma-separated item appended to the list by each row, and to be considered as a single list item. As follows:
ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean <lambda>
fr 2 26 13.22 [jos, ose, sep, eph, tom, oom, mmy, ...]
en 3 24 11.92 [ann, tom, ...]
ir 4 23 12.03 [ann, nne, ann, ...]
I kind of "did" it using the following code:
aggregations = {
'TOTAL_LENGTH': [min, max, 'mean'],
'3LETTER_SUBSTRINGS': lambda x: list(x),
}
self.df_agg = self.df.groupby('ETHNICITY_RECAT', as_index=False).agg(aggregations)
The problem is the whole string "ann, anne" is considered one single list item in the final list, instead of considering each as single list item, such as "ann", "anne".
I would like to see the highest frequency of the substrings, but now I am getting the frequency of the whole string (instead of the individual 3-letter substring), when I run the following code:
from collections import Counter
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)
I get this:
[('jos, ose, sep, eph', 19), ('ann, nee', 5), ...]
Instead of what I want:
[('jos', 19), ('ose', 19), ('sep', 23), ('eph', 19), ('ann', 15), ('nee', 5), ...]
I tried:
'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(', '),
But it says invalid syntax.
First thing you want to do is to convert the string into list, then it's just a groupby with agg:
df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(', ')
df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],
'3LETTER_SUBSTRINGS':'sum'})
Output:
TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean sum
ETHNICITY_RECAT
en 16 18 17.0 [ann, tom]
fr 14 16 15.0 [jos, ose, sep, eph, tom, omm, mmy]
ir 14 19 16.5 [ann, nne, ann]
I think most of your code is alright, you just misinterpreted the error: it has nothing to do with string conversion. You have lists/tuples in each cell of the 3LETTER_SUBSTRING column. When you use the lambda x:list(x) function, you create a list of tuples. Hence there is nothing like split(",") to do and going to cast to string and back to table ...
Instead, you just need to unnest your table when you create your new list. So here's a small reproducible code: (note that I focused on your tuple/aggregation issue as I'm sure you will quickly find the rest of the code)
import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom","en"),("tommy","fr"),("ann","fr")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity, and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
Using the counter you specify, I got
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1, 'tom': 1})
To get it as a list of tuples, just use a dictionary built-in function such as dict.items().
UPDATE : using preformated string list as in the question:
import pandas as pd
# Create some data
names = [("joseph","fr","jos, ose, sep, eph"),("ann","en","ann"),("anne","ir","ann, nne"),("tom","en","tom"),("tommy","fr","tom, omm, mmy"),("ann","fr","ann")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
"""
Update this function with regard to your data format
"""
return x.split(", ")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))
Background
1) I have the following code to create a df
import pandas as pd
word_list = ['crayons', 'cars', 'camels']
l = ['there are many different crayons in the bright blue box',
'i like a lot of sports cars because they go really fast',
'the middle east has many camels to ride and have fun']
df = pd.DataFrame(l, columns=['Text'])
df
Text
0 there are many different crayons in the bright blue box
1 i like a lot of sports cars because they go really fast
2 the middle east has many camels to ride and have fun
2) And I have the following code to create a function
def find_next_words(row, word_list):
sentence = row[0]
# trigger words are the elements in the word_list
trigger_words = []
next_words = []
last_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
trigger_words.append(keyword)
#get the 3 words that follow trigger word
next_words.append(words[index + 1:index + 4])
#get the 3 words that come before trigger word
#DOES NOT WORK...PRODUCES EMPTY LIST
last_words.append(words[index - 1:index - 4])
return pd.Series([trigger_words, last_words, next_words], index = ['TriggerWords','LastWords', 'NextWords'])
3) This function uses the words in the word_list from above to find the 3 words that come before and after the "trigger_words" in the word_list
4) I then use the following code
df = df.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
5) And it produce the following df which is close to what I want
Text TriggerWords LastWords NextWords
0 there are many different crayons [crayons] [[]] [[in, the, bright]]
1 i like a lot of sports cars [cars] [[]] [[because, they, go]]
2 the middle east has many camels [camels] [[]] [[to, ride, and]]
Problem
6) However, the LastWords column is an empty list of list [[]] . I think the problem is this line of code last_words.append(words[index - 1:index - 4]) taken from the find_next_words function from above.
7) This is a bit confusing to me because the the NextWords column uses very similar code next_words.append(words[index + 1:index + 4]) taken from the find_next_words function and it works.
Question
8) How do I fix my code so it does not produce the empty list of lists [[]] and instead it gives me the 3 words that come before the words in the word_list?
I think it should be words[max(index - 4, 0):max(index - 1, 0)] in the code.
I'm doing a bunch of operations on pandas dataframes. For example finding max, min and average inside columns and return the column names in a new column. Now I'm trying to wrap these things into a function, and use max() and/or min() as arguments in this function.
Below is a snippet that describes what I'm trying to do in a very simplified way. In its current state it also returns a description of the desired output. The snippet does not have the desired functionality and flexibility though.
The setup:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
def findValue(function, df, colname):
print(function) # just a placeholder
df[colname] = df.max()[0]
return df
df2 = findValue(function='max', df=df, colname='col_B')
print(df)
Output 1:
col_A col_B
0 1 20
1 20 20
2 6 20
3 1 20
4 3 20
A naive attempt:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
# The function I would like to use in another function is max()
# My function
def findValue(function, df, colname):
df[colname] = df.function()[0]
return df
df2 = findValue(function=max(), df=df , colname='col_B')
print(df)
Output 2:
Traceback (most recent call last):
File "<ipython-input-7-85964ff29e69>", line 1, in <module>
df2 = findValue(function=max(), df=df , colname='col_B')
TypeError: max expected 1 arguments, got 0
How can I change the above snippet so that I can change function = max() to function = min() or any other function in the arguments of findValue()? Or even define a list of functions to be used in a similar manner?
Thank you for any suggestions!
You are very, very close. You pretty much just need to remove the parens when passing in the function. Here's a simplified example that loops over a list of function names, and appears to do what you want:
def findValue(func, x, y):
return func(x, y)
for calc in (max, min):
result = findValue(func=calc, x=1, y=10)
print(result)
Output:
10
1
I got csv-file with numerous URLs. I read it into a pandas dataframe for convenience. I need to do some statistical work later - and pandas is just handy. It looks a little like this:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}]
df = pd.DataFrame(csv)
My task is to check if the websites contain certain strings and to add an extra column with 1 if so, and else 0. For example: I want to check, wether www.mercedes-benz.de contains the string car.
import requests
page_content = requests.get("www.mercedes-benz.de")
if "car" in page_content.text:
print ('1')
else:
print ('0')
How do I iterate/loop through pd.URLs and store the information in the pandas dataframe?
I think you need loop by data by DataFrame.iterrows and then create new values with loc:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
print (df)
URLs electric car
0 http://www.mercedes-benz.de 1 1
1 http://www.audi.de 0 1
I have a unique question, and I am primarily hoping to identify ways to speed up this code a little. I have a set of strings stored in a dataframe, each of which has several names in it and I know the number of names before this step, like so:
print df
description num_people people
'Harry ran with sally' 2 []
'Joe was swinging with sally' 2 []
'Lola Dances alone' 1 []
I am using a dictionary with the keys that I am looking to find in description, like so:
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}
and then using iterrows to search each string for matches like so:
for index, row in df.iterrows():
row.people=[key for key in my_dict if re.findall(key,row.desciption)]
and when run it ends up with
print df
description num_people people
'Harry ran with sally' 2 ['Harry','Sally']
'Joe was swinging with sally' 2 ['Joe','Sally']
'Lola Dances alone' 1 ['Lola']
The problem that I see, is that this code is still fairly slow to get the job done, and I have a large number of descriptions and over 1000 keys. Is there a faster way of performing this operation, like maybe using the number of people found?
Faster solution:
#strip ' in start and end of text, create lists from words
splited = df.description.str.strip("'").str.split()
#filtering
df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
print (df)
description num_people people
0 'Harry ran with Sally' 2 [Harry, Sally]
1 'Joe was swinging with Sally' 2 [Joe, Sally]
2 'Lola Dances alone' 1 [Lola]
Timings:
#[30000 rows x 3 columns]
In [198]: %timeit (orig(my_dict, df))
1 loop, best of 3: 3.63 s per loop
In [199]: %timeit (new(my_dict, df1))
10 loops, best of 3: 78.2 ms per loop
df['people'] = [[],[],[]]
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}
def orig(my_dict, df):
for index, row in df.iterrows():
df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
return (df)
def new(my_dict, df):
df.description = df.description.str.strip("'")
splited = df.description.str.split()
df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
return (df)
print (orig(my_dict, df))
print (new(my_dict, df1))