Searching many strings for many dictionary keys, quickly - string

I have a unique question, and I am primarily hoping to identify ways to speed up this code a little. I have a set of strings stored in a dataframe, each of which has several names in it and I know the number of names before this step, like so:
print df
description num_people people
'Harry ran with sally' 2 []
'Joe was swinging with sally' 2 []
'Lola Dances alone' 1 []
I am using a dictionary with the keys that I am looking to find in description, like so:
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}
and then using iterrows to search each string for matches like so:
for index, row in df.iterrows():
row.people=[key for key in my_dict if re.findall(key,row.desciption)]
and when run it ends up with
print df
description num_people people
'Harry ran with sally' 2 ['Harry','Sally']
'Joe was swinging with sally' 2 ['Joe','Sally']
'Lola Dances alone' 1 ['Lola']
The problem that I see, is that this code is still fairly slow to get the job done, and I have a large number of descriptions and over 1000 keys. Is there a faster way of performing this operation, like maybe using the number of people found?

Faster solution:
#strip ' in start and end of text, create lists from words
splited = df.description.str.strip("'").str.split()
#filtering
df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
print (df)
description num_people people
0 'Harry ran with Sally' 2 [Harry, Sally]
1 'Joe was swinging with Sally' 2 [Joe, Sally]
2 'Lola Dances alone' 1 [Lola]
Timings:
#[30000 rows x 3 columns]
In [198]: %timeit (orig(my_dict, df))
1 loop, best of 3: 3.63 s per loop
In [199]: %timeit (new(my_dict, df1))
10 loops, best of 3: 78.2 ms per loop
df['people'] = [[],[],[]]
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}
def orig(my_dict, df):
for index, row in df.iterrows():
df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
return (df)
def new(my_dict, df):
df.description = df.description.str.strip("'")
splited = df.description.str.split()
df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
return (df)
print (orig(my_dict, df))
print (new(my_dict, df1))

Related

Format Data using panadas groupBy such that it groups by one column

I have the data in below format in an csv :-
However, the format in which I required is below :-
I have written below code, but somehow the groupby is not working for me.
def grouping():
df = pd.read_csv("final_data_6.csv")
df['n'] = df.apply(lambda x: (x['data'], x['Period']), axis=1)
df.groupby(['data','Period'])['n'].apply(list).reset_index()
df.to_csv("final_data_9.csv", encoding="utf-8", index=False)
Use GroupBy.agg with create dictionaries filled by list:
def grouping():
df = pd.read_csv("final_data_6.csv")
df['n'] = [x for x in zip(df['positions'], df['Period'])]
df=df.groupby('data')['n'].agg(lambda x:{'entities':list(x)}).reset_index(name='entity')
df.to_csv("final_data_9.csv", encoding="utf-8", index=False)
Sample data test:
print (df)
data positions Period
0 abc 37,41 disease
1 abc 10,16 drugs
2 def 4,14 indication
3 def 78,86 intervention
df['n'] = [x for x in zip(df['positions'], df['Period'])]
print (df)
data positions Period n
0 abc 37,41 disease (37,41, disease)
1 abc 10,16 drugs (10,16, drugs)
2 def 4,14 indication (4,14, indication)
3 def 78,86 intervention (78,86, intervention)
df=df.groupby('data')['n'].agg(lambda x:{'entities':list(x)}).reset_index(name='entity')
print (df)
data entity
0 abc {'entities': [('37,41', 'disease'), ('10,16', ...
1 def {'entities': [('4,14', 'indication'), ('78,86'...

How do I add a dynamic list of variable to the command pd.concat

I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.

Use lambda, apply, and join function on a pandas dataframe

Goal
Apply deid_notes function to df
Background
I have a df that resembles this sample df
import pandas as pd
df = pd.DataFrame({'Text' : ['there are many different types of crayons',
'i like a lot of sports cares',
'the middle east has many camels '],
'P_ID': [1,2,3],
'Word' : ['crayons', 'cars', 'camels'],
'P_Name' : ['John', 'Mary', 'Jacob'],
'N_ID' : ['A1', 'A2', 'A3']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name', 'Word']]
df
Text N_ID P_ID P_Name Word
0 many types of crayons A1 1 John crayons
1 i like sports cars A2 2 Mary cars
2 has many camels A3 3 Jacob camels
I use the following function to deidentify certain words within the Text column using NeuroNER http://neuroner.com/
def deid_notes(text):
#use predict function from neuorNER to tag words to be deidentified
ner_list = n1.predict(text)
#n1.predict wont work in this toy example because neuroNER package needs to be installed (and installation is difficult)
#but the output resembles this: [{'start': 1, 'end:' 11, 'id': 1, 'tagged word': crayon}]
#use start and end position of tagged words to deidentify and replace with **BLOCK**
if len(ner_list) > 0:
parts_to_take = [(0, ner_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(ner_list, ner_list[1:])] + [(ner_list[-1]['end'], len(text)-1)]
parts = [text[start:end] for start, end in parts_to_take]
deid = '**BLOCK**'.join(parts)
#if n1.predict does not identify any words to be deidentified, place NaN
else:
deid='NaN'
return pd.Series(deid, index='Deid')
Problem
I apply the deid_notes function to my df using the following code
fx = lambda x: deid_notes(x.Text,axis=1)
df.join(df.apply(fx))
But I get the following error
AttributeError: ("'Series' object has no attribute 'Text'", 'occurred at index Text')
Question
How do I get the deid_notes function to work on my df?
Assuming you are returning a pandas series as output from deid_notes function taking text as the only input argument. Pass the axis = 1 argument to the apply instead of died_notes. For eg.
# Dummy function
def deid_notes(text):
deid = 'prediction to: ' + text
return pd.Series(deid, index = ['Deid'])
fx = lambda x: deid_notes(x.Text)
df.join(df.apply(fx, axis =1))

not produce empty list of lists in pandas

Background
1) I have the following code to create a df
import pandas as pd
word_list = ['crayons', 'cars', 'camels']
l = ['there are many different crayons in the bright blue box',
'i like a lot of sports cars because they go really fast',
'the middle east has many camels to ride and have fun']
df = pd.DataFrame(l, columns=['Text'])
df
Text
0 there are many different crayons in the bright blue box
1 i like a lot of sports cars because they go really fast
2 the middle east has many camels to ride and have fun
2) And I have the following code to create a function
def find_next_words(row, word_list):
sentence = row[0]
# trigger words are the elements in the word_list
trigger_words = []
next_words = []
last_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
trigger_words.append(keyword)
#get the 3 words that follow trigger word
next_words.append(words[index + 1:index + 4])
#get the 3 words that come before trigger word
#DOES NOT WORK...PRODUCES EMPTY LIST
last_words.append(words[index - 1:index - 4])
return pd.Series([trigger_words, last_words, next_words], index = ['TriggerWords','LastWords', 'NextWords'])
3) This function uses the words in the word_list from above to find the 3 words that come before and after the "trigger_words" in the word_list
4) I then use the following code
df = df.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
5) And it produce the following df which is close to what I want
Text TriggerWords LastWords NextWords
0 there are many different crayons [crayons] [[]] [[in, the, bright]]
1 i like a lot of sports cars [cars] [[]] [[because, they, go]]
2 the middle east has many camels [camels] [[]] [[to, ride, and]]
Problem
6) However, the LastWords column is an empty list of list [[]] . I think the problem is this line of code last_words.append(words[index - 1:index - 4]) taken from the find_next_words function from above.
7) This is a bit confusing to me because the the NextWords column uses very similar code next_words.append(words[index + 1:index + 4]) taken from the find_next_words function and it works.
Question
8) How do I fix my code so it does not produce the empty list of lists [[]] and instead it gives me the 3 words that come before the words in the word_list?
I think it should be words[max(index - 4, 0):max(index - 1, 0)] in the code.

Web scraping multiple pages with python 3?

I got csv-file with numerous URLs. I read it into a pandas dataframe for convenience. I need to do some statistical work later - and pandas is just handy. It looks a little like this:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}]
df = pd.DataFrame(csv)
My task is to check if the websites contain certain strings and to add an extra column with 1 if so, and else 0. For example: I want to check, wether www.mercedes-benz.de contains the string car.
import requests
page_content = requests.get("www.mercedes-benz.de")
if "car" in page_content.text:
print ('1')
else:
print ('0')
How do I iterate/loop through pd.URLs and store the information in the pandas dataframe?
I think you need loop by data by DataFrame.iterrows and then create new values with loc:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
print (df)
URLs electric car
0 http://www.mercedes-benz.de 1 1
1 http://www.audi.de 0 1

Resources