Remove custom stop words from pandas dataframe not working - python-3.x

I am trying to remove a custom list of stop words, but its not working.
desc = pd.DataFrame(description, columns =['description'])
print(desc)
Which gives the following results
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
I found the following code here, but it doesn't seem to work
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
Which produces the following results
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.

Handle the case, simplify pattern,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...

Related

VBA for merging range of cells with reference to a cell

I want to merge the range of cells with reference to a unique cell value and require VBA for the same.
Sample Data
Name
Phone No.
Car Company
Car Model
Car Expiry
DL No.
Charlie Andrew
98765
CA123
Charlie Andrew
98765
Mercedes
D201
Charlie Andrew
D201
Jun-50
Charlie Andrew
98765
Volkswagon
CA123
Charlie Andrew
Volkswagon
POLO
Charlie Andrew
POLO
MAR-25
Charlie Andrew
98765
Jun-50
Charlie Andrew
12345
BMW
520D
Charlie Andrew
520D
MAY-40
CA456
Stephen Logan
556644
GM MOTORS
Stephen Logan
GM MOTORS
2255H
Stephen Logan
2255H
APR-30
Stephen Logan
556644
SL987
Desired Result
Name
Phone No.
Car Company
Car Model
Car Expiry
DL No.
Charlie Andrew
987654
Mercedes; Volkswagon
D201; POLO
Jun-50; mar-25
CA123
Charlie Andrew
12345
BMW
520D
MAY-40
CA456
Stephen Logan
556644
GM MOTORS
2255H
APRIL-30
SL987
Please note that, DL No. should not be merged as it is a unique value
Thanks in Advance
I tried various VBA's but didn't get desired result.

check amount of time between different rows of data (time) and date and name of employee

I have a df with this info ['Name', 'Department', 'Date', 'Time', 'Activity'],
so for example looks like this:
Acosta, Hirto 225 West 28th Street 9/18/2019 07:25:00 Punch In
Acosta, Hirto 225 West 28th Street 9/18/2019 11:57:00 Punch Out
Acosta, Hirto 225 West 28th Street 9/18/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 06:57:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 12:00:00 Punch Out
Adams, Juan 225 West 28th Street 9/16/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 15:30:00 Punch Out
Adams, Juan 225 West 28th Street 9/18/2019 07:04:00 Punch In
Adams, Juan 225 West 28th Street 9/18/2019 11:57:00 Punch Out
I need to calculate the time between the punch in and the punch out in the same day for the same employee.
i manage to just clean the data
like:
self.raw_data['Time'] = pd.to_datetime(self.raw_data['Time'], format='%H:%M').dt.time
sorted_db = self.raw_data.sort_values(['Name', 'Date'])
sorted_db = sorted_db[['Name', 'Department', 'Date', 'Time', 'Activity']]
any suggestions will be appreciated
so i found the answer of my problem and i wanted to share it.
first a separate the "Punch in" and the "Punch Out" if two columns
def process_info(self):
# filter data and organized --------------------------------------------------------------
self.raw_data['in'] = self.raw_data[self.raw_data['Activity'].str.contains('In')]['Time']
self.raw_data['pre_out'] = self.raw_data[self.raw_data['Activity'].str.contains('Out')]['Time']
after i sort the information base in date and time
sorted_data = self.raw_data.sort_values(['Date', 'Name'])
after that i use the shift function to move on level up the 'out' column so in parallel with the in.
sorted_data['out'] = sorted_data.shift(-1)['Time']
and finally i take out the extra out columns that was created in the first step. but checking if it is by itself.
filtered_data = sorted_data[sorted_data['pre_out'].isnull()]

How to identify where each person have lived in different cities in each time?

Here is a small set of the dataset that I am currently working on.
FirstName LastName cities occupation time
---------------------------------------------------------------
---------------------------------------------------------------
Alice Oumi Queens software engineer 1/1/2019
Alice Oumi New York software engineer 12/3/2018
Sam Charles Santa Clara Engineer 2/5/2017
Sam Charles Santa Monica Engineer 8/9/2018
Sam Charles Santa Clara Engineer 12/12/2019
Alice Oumi New York software engineer 1/2/2017
As you see above, the same person could be living in a same place but for a different duration of a time. I want to make clean this dataset that should what places did Alice and Sam live. For example, instead of having 2 rows of Alice living in New York, I only need to have one. Something similar to the following table
FirstName LastName cities FirstTime SecondTime
---------------------------------------------------------------
---------------------------------------------------------------
Alice Oumi Queens 1/1/2019 NA
Alice Oumi New York 1/2/2017 12/3/2018
Sam Charles Santa Clara 2/5/2017 12/12/2019
Sam Charles Santa Monica 8/9/2018 NA
I am kinda new to python and trying to learn. but i have tried to use for loops using iterrows() but didn't work.
What can use to achieve this table?
Thank you so much in advance
You can do that as follows:
# number the times a person lived in the same city (with the same occupation)
df['sequence']= df.groupby(['FirstName', 'LastName', 'cities', 'occupation']).cumcount()+1
# now create the "pivot" table
result= df.set_index(['FirstName', 'LastName', 'cities', 'occupation', 'sequence']).unstack()
# rename the columns
result.columns= ['FirstTime', 'SecondTime']
# reset the index (it was just needed for "pivoting"
result.reset_index(inplace=True)
The result looks like:
Out[483]:
FirstName LastName cities occupation FirstTime SecondTime
0 Alice Oumi New York software engineer 12/3/2018 1/2/2017
1 Alice Oumi Queens software engineer 1/1/2019 NaN
2 Sam Charles Santa Clara Engineer 2/5/2017 12/12/2019
3 Sam Charles Santa Monica Engineer 8/9/2018 None NaN

Getting wrong coordinates for amenities using overpass API

I am using overpass API to get amenities around (within radius of 200 m) particular location. I am able to receive the results but for last 8 records (from 20 to 28), i get wrong coordinates. Can somebody help, is there something wrong with query?
import pandas as pd
from pandas.io.json import json_normalize
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 500)
# bounding box coordinates
unique_nodes = []
way_nodes = []
# query overpass within OpenStreetMap
import overpass
api = overpass.API()
data = api.get('(node["amenity"](around:200,51.896723,-8.482056);way["amenity"](around:200,51.896723,-8.482056);relation["amenity"](around:200,51.896723,-8.482056);>;);')
#df = [f for f in data.features if f.geometry['type'] == "LineString"]
df_1 = json_normalize(data['features'])
df_1 = df_1[df_1['properties.amenity'].notnull()]
df_1 = df_1[df_1['properties.name'].notnull()]
print(df_1[['properties.amenity', 'geometry.coordinates', 'properties.name']])
The results are mentioned below:
properties.amenity geometry.coordinates properties.name
0 college [-8.4816139, 51.8950368] Crawford College of Art and Design
1 taxi [-8.4797104, 51.8974271] Cork Taxi Co-op
2 restaurant [-8.4829027, 51.8971138] Café Paradiso
3 car_sharing [-8.4792851, 51.8964633] Wandesford
4 parking [-8.481515, 51.8961668] Saint Finbarre's
5 pub [-8.483117, 51.897028] Reidy's Vault
6 pub [-8.4801976, 51.8973964] Costigans
8 social_facility [-8.4813453, 51.8976471] Penny Dinners
9 bicycle_rental [-8.4846434, 51.8971537] Dyke Parade
10 bicycle_rental [-8.4822285, 51.8971562] St. Finbarre's Bridge
11 bicycle_rental [-8.4800657, 51.8964889] Wandesford Quay
12 pharmacy [-8.4802256, 51.8975298] Santry's
13 pub [-8.4812941, 51.8980821] Porterhouse
14 fast_food [-8.4812901, 51.897902] Holy Smoke
17 restaurant [-8.4797886, 51.8975706] Feed your Senses
20 school [-8.4806982, 51.8982846] Presentation Brothers College
21 university [-8.4806982, 51.8982846] UCC Lee Maltings
22 university [-8.4806982, 51.8982846] Tyndall National Institute
23 school [-8.4806982, 51.8982846] Saint Maries of the Isle National School
24 school [-8.4806982, 51.8982846] Saint Aloysius School
25 parking [-8.4806982, 51.8982846] Lancaster Lodge
26 theatre [-8.4806982, 51.8982846] The Kino
27 university [-8.4806982, 51.8982846] University College Cork
28 hospital [-8.4806982, 51.8982846] Mercy University Hospital

Multi Criterion Max If Statement

My dataset looks like this...
State Close Date Probability Highest Prob/State
WA 12/31/2016 50% FALSE
WA 12/19/2016 80% FALSE
WA 10/15/2016 80% TRUE
My objective is to build a formula to populate the right-most column. The formula should assess Close Dates and Probabilities within each state. First, it should select the highest probability, then it should select the nearest close date if there is a tie on probability (as in the example). For that record, it should read "TRUE".
I assume this would include a MAX IF statement but haven't been able to get it to work.
Here is a more robust set of data I'm working with. It may actually be easier to first find the highest probability within each Region then select the minimum (oldest) date if there is a tie on probability. This too will serve my purposes.
Region Forecast Close Date Probability (%)
Okeechobee FL 6/27/2016 90
Okeechobee West FL 7/1/2016 40
Albany GA 3/11/2016 100
Emerald Coast FL 6/30/2016 60
Emerald Coast FL 10/1/2016 40
Cullman_Hartselle TN 4/30/2016 10
North MS 10/1/2016 25
Roanoke VA 8/31/2016 25
Roanoke VA 8/1/2016 40
Gardena CA 6/1/2016 80
Gardena CA 6/1/2016 80
Lomita-Harbor City 6/30/2016 60
Lomita-Harbor City 6/30/2016 0
Lomita-Harbor City 6/30/2016 40
Eastern NC 6/30/2016 60
Northwest NC 9/16/2016 10
Fort Collins_Greeley CO 3/1/2016 100
Northwest OK 6/30/2016 100
Southwest MO 7/29/2016 90
Northern NH-VT 3/1/2016 20
South DE 12/1/2016 0
South DE 12/1/2016 20
Kingston NY 12/30/2016 5
Longview WA 11/30/2016 5
North DE 12/1/2016 20
North DE 12/1/2016 0
Salt Lake City UT 8/31/2016 20
Idaho Panhandle 8/26/2016 0
Bridgeton_Salem NJ 7/1/2016 25
Bridgeton_Salem NJ 7/1/2016 65
Layton_Ogden UT 3/25/2016 5
Central OR 6/30/2016 10
The following Array formula should work:
=(ABS(B2-$F$2)=MIN(IF(($A$2:$A$33=A2)*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33))),ABS($B$2:$B$33-$F$2))))*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33)))>0
Being an array formula use Ctrl-Shift-Enter when exiting Edit mode. If done properly Excel will put {} around the formula.
Edit
Added #tigeravatar suggestion to avoid volatile functions.
I think this is OK now but needs to be checked against the more complete set of data provided by OP.
It counts:-
(1) Any rows with same state but higher probability
(2) Any rows with same state and probability, in the future (or present) and nearer to today's date
(3) Any rows with same state and probability, in the past and nearer to today's date.
If all these are zero, you should have the right one.
=COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+IF ($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,">="&$G$2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,">"&$G$2-IF($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,"<"&$G$2)
=0
If the dates are all in the future, it can be simplified a lot:-
=COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2)
+COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+DATEDIF($G$2,$B2,"d"))
=0

Resources