Extract Datetime information from a string in a DataFrame column - python-3.x

So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 Paperback,– 2016
20 Hardcover,– 24 Nov 2018
21 Paperback,– Import, 4 Oct 2018
How can I extract the date to a separate column. I tried using str.split() but can't find specific pattern to extract.Is there any method I could do it?

obj = df['Edition']
obj.str.split('((?:\d+\s+\w+\s+)?\d{4}$)', expand=True)
or
obj.str.split('[,–]+').str[0]
obj.str.split('[,–]+').str[-1] # date

Try using dateutil
from dateutil.parser import parse
df['Dt']=[parse(i, fuzzy_with_tokens=True)[0] for i in df['column']]

Related

How to replace values between columns based on condition in pandas?

I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09

Splitting datetime value out of text string with uneven length

System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2) 
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM , 
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00

TypeError: while doing web scraping

I was just scraping data and want to make two columns of title and date but TypeError occurs
TypeError: from_dict() got an unexpected keyword argument 'columns'
CODE :
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
while True:
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv={title : date}
print(' title : ', title ,' \n date : ' ,date )
url_tag = soup.find('div',{'class':'pagination'})
if url_tag.get('href'):
url = 'https://timesofindia.indiatimes.com/' + url_tag.get('href')
print(url)
else:
break
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['title' ,'date'])
pandas is updated to version 0.23.4,then also error occurs.
The first thing I noticed is the construction of the dictionary is off. I'm assuming you want the dictionary of the entire title:date. The way as you have it now, will only keep the last.
Then when you do that, the index of the dataframe with be the key, and the values are the series/column. So technically there's only 1 column. I can create the two columns by resetting the index, then that index is put into a column that I rename 'title'
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://timesofindia.indiatimes.com/topic/Hiv'
response=requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
content = soup.find_all('div',{'class': 'content'})
hiv = {}
for contents in content:
title_tag = contents.find('span',{'class':'title'})
title= title_tag.text[1:-1] if title_tag else 'N/A'
date_tag = contents.find('span',{'class':'meta'})
date = date_tag.text if date_tag else 'N/A'
hiv.update({title : date})
print(' title : ', title ,' \n date : ' ,date )
hiv1 = pd.DataFrame.from_dict(hiv , orient = 'index' , columns = ['date'])
hiv1 = hiv1.rename_axis('title').reset_index()
Output:
print (hiv1)
title date
0 I told my boyfriend I was HIV positive and thi... 01 Dec 2018
1 Pay attention to these 7 very common HIV sympt... 30 Nov 2018
2 Transfusion of HIV blood: Panel seeks time til... 2019-01-06T03:54:33Z
3 No. of pregnant women testing HIV+ dips; still... 01 Dec 2018
4 Busted:5 HIV AIDS myths 30 Nov 2018
5 Myths and taboos related to AIDS 01 Dec 2018
6 N/A N/A
7 Mumbai: Free HIV tests at six railway stations... 23 Nov 2018
8 HIV blood tranfusion: Tamil Nadu govt assures ... 2019-01-05T09:05:27Z
9 Autopsy performed on HIV+ve donor’s body at GRH 2019-01-03T07:45:03Z
10 Madras HC directs to videograph HIV+ve donor’s... 2019-01-01T01:23:34Z
11 HIV +ve Tamil Nadu teen who attempted suicide ... 2018-12-31T03:37:56Z
12 Another woman claims she got HIV-infected blood 2018-12-31T06:34:32Z
13 Another woman says she got HIV from donor blood 29 Dec 2018
14 HIV case: Five-member panel begins inquiry in ... 29 Dec 2018
15 Pregnant woman turns HIV positive after blood ... 26 Dec 2018
16 Pregnant woman contracts HIV after blood trans... 26 Dec 2018
17 Man attacks niece born with HIV for sleeping i... 16 Dec 2018
18 Health ministry implements HIV AIDS Act 2017: ... 11 Sep 2018
19 When meds don’t heal: HIV+ kids fight daily wa... 03 Sep 2018
I'm not quite sure why you're getting the error though. It doesn't make sense since you are using updated Pandas. Maybe uninstall Pandas and then re pip install it?
Otherwise I guess you could just do it in 2 lines and name the columns after converting to dataframe:
hiv1 = pd.DataFrame.from_dict(hiv, orient = 'index').reset_index()
hiv1.columns = ['title','date']

Using Excel, how do I extract the date from a string?

DEC 4 DEC 5 SPORT LIFE $31.49
DEC 25 DEC 28 BESTBUY EGIFTCARD 877-850-1977 $35.00
I want to have columns 2 as:
DEC 4
DEC 25
Column 3 as:
DEC 5
DEC 28
Edit: I have uploaded the image. I want to somehow achieve column B and C.
Notice that sometimes date is DEC 4 or could be DEC 14.
(Ignore the period. I did that to quickly not have auto format)
And yes, date is always first 2 groups.
Put this in B1 and copy over one column and down the length of the data:
=TRIM(MID(SUBSTITUTE($A1," ",REPT(" ",99)),(COLUMN(A:A)-1)*198+1,198))

Parse CSV file with some dynamic columns

I have a CSV file that I receive once a week that is in the following format:
"Item","Supplier Item","Description","1","2","3","4","5","6","7","8" ...Linefeed
"","","","Past Due","Day 13-OCT-2014","Buffer 14-OCT-2014","Week 20-OCT-2014","Week 27-OCT-2014", ...LineFeed
"Part1","P1","Big Part","0","0","0","100","50", ...LineFeed
"Part4","P4","Red Part","0","0","0","35","40", ...LineFeed
"Part92","P92","White Part","0","0","0","10","20", ...LineFeed
...
An explanation of the data - Row 2 is dynamic data signifying the date parts are due. Row 3 begins the part numbers with description and number of parts due on a particular date. So looking at the above data: row 3 column7 shows that PartNo1 has 100 parts due on the Week of OCT 20 2014 and 50 due on the Week of OCT 27, 2014.
How can I parse this csv to show the data like this:
Item, Supplier Item, Description, Past Due, Due Date Amount Due
Part1 P1 Big Part 0 20 OCT 2014 100
Part1 P1 Big Part 0 27 OCT 2014 50
Part4 P4 Red Part 0 20 OCT 2014 35
Part4 P4 Red Part 0 27 OCT 2014 40
....
Is there a way to manipulate the format in Excel to rearrange the data like I need or what is the best method to resolve this?

Resources