Remove '-' and two following characters using regex? - python-3.x

I have set of data from laboratory where data column looks like this:
12-15.11.12
19-22.11.12
26-29.11.12
03-06.12.12
10-13.12.12
17-20.12.12
19-23.12.12
27-30.12.12
02-05.01.13
I only want the first value (the day of sampling) so I can convert it into pandas datetime series etc. and continue working with data.
I know I can manually delete it in Excel but I would like to do it with the use of code. So my goal is for example:
12-15.11.12 -> 12.11.2012, '-15' gets deleted.

You can use re.sub with -\d+ pattern (regex101):
import re
data = '''\
12-15.11.12
19-22.11.12
26-29.11.12
03-06.12.12
10-13.12.12
17-20.12.12
19-23.12.12
27-30.12.12
02-05.01.13'''
data = re.sub(r'-\d+', '', data)
print(data)
Prints:
12.11.12
19.11.12
26.11.12
03.12.12
10.12.12
17.12.12
19.12.12
27.12.12
02.01.13

import re
dates = [
"12-15.11.12",
"19-22.11.12",
"26-29.11.12",
"03-06.12.12",
"10-13.12.12",
"17-20.12.12",
"19-23.12.12",
"27-30.12.12",
"02-05.01.13"
]
cleaned_dates = []
for date in dates:
date = re.sub(r"(\d+)-\d+", r"\1", date)
cleaned_dates.append(date)
print(cleaned_dates)

Related

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

How can I do something similar to VLOOKUP in Excel with urlparse?

I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.

Trying to Pass date to pandas search from input prompt

I am trying to figure out how to pass a date inputted at a prompt by the user to pandas to search by date. I have both the search and the input prompt working separately but not together. I will show you what I mean. And maybe someone can tell me how to properly pass the date to pandas for the search.
This is how I successfully use pandas to extract rows in an excel sheet if any cell in column emr_first_access_date is greater than or equal to '2019-09-08'
I do this successfully with the following code:
import pandas as pd
HISorigFile = "C:\\folder\\inputfile1.xlsx"
#opens excel worksheet
df = pd.read_excel(HISorigFile, sheet_name='Non Live', skiprows=8)
#locates the columns I want to write to file including date column emr_first_access_date if greater than or equal to '2019-09-08'
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
#sorts the data
datasort = data.sort_values("emr_first_access_date",ascending=False)
#this creates the file (data already sorted) in panda with date and time.
datasort.to_excel(r'C:\\folder\sitesTestedInLastWeek.xlsx', index=False, header=True)
However, the date above is hardcoded of course. So, I need the user running this script to input the date. I created a very basic working input prompt with the following:
import datetime
#prompts for input date
TestedDateBegin = input('Enter beginning date to search for sites tested in YYYY-MM-DD format')
year, month, day = map(int, TestedDateBegin.split('-'))
date1 = datetime.date(year, month, day)
Obviously I want to pass TestedDateBegin to pandas, changing the pertinent code line:
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
to something like:
data = df.loc[df[b]['emr_first_access_date'] >= 'TestedDateBegin', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
Obviously this doesn't work. But how do I proceed? I am very new to programming so I not always clear how to proceed. Does the date inputted in TestedDateBegin need to be added to a return? Or should it be put in a single item list? What is the right approach? Thx!
This is resolved.
I had to remove the single quotes around TestedDateBegin as python, of course, interpreted that as a string and not a variable. Living and learning. :-)
data = df.loc[df[b]['emr_first_access_date'] >= TestedDateBegin,['site_name','subs_num','emr_id', 'emr_first_access_date']]

Sort by datetime in python3

Looking for help on how to sort a python3 dictonary by a datetime object (as shown below, a value in the dictionary) using the timestamp below.
datetime: "2018-05-08T14:06:54-04:00"
Any help would be appreciated, spent a bit of time on this and know that to create the object I can do:
format = "%Y-%m-%dT%H:%M:%S"
# Make strptime obj from string minus the crap at the end
strpTime = datetime.datetime.strptime(ts[:-6], format)
# Create string of the pieces I want from obj
convertedTime = strpTime.strftime("%B %d %Y, %-I:%m %p")
But I'm unsure how to go about comparing that to the other values where it accounts for both day and time correctly, and cleanly.
Again, any nudges in the right direction would be greatly appreciated!
Thanks ahead of time.
Datetime instances support the usual ordering operators (< etc), so you should order in the datetime domain directly, not with strings.
Use a callable to convert your strings to timezone-aware datetime instances:
from datetime import datetime
def key(s):
fmt = "%Y-%m-%dT%H:%M:%S%z"
s = ''.join(s.rsplit(':', 1)) # remove colon from offset
return datetime.strptime(s, fmt)
This key func can be used to correctly sort values:
>>> data = {'s1': "2018-05-08T14:06:54-04:00", 's2': "2018-05-08T14:05:54-04:00"}
>>> sorted(data.values(), key=key)
['2018-05-08T14:05:54-04:00', '2018-05-08T14:06:54-04:00']
>>> sorted(data.items(), key=lambda item: key(item[1]))
[('s2', '2018-05-08T14:05:54-04:00'), ('s1', '2018-05-08T14:06:54-04:00')]

Slicing files in python with conditons

Suppose i have a txt. file that looks like this:
0 day0 event_data0
1 day1 event_data1
2 day2 event_data2
3 day3 event_data3
4 day4 event_data4
........
n dayn event_datan
#where:
#n is the event index
#dayn is the day when the event happened. year-month-day format
#event_datan is what happened at the event.
From this file, i need to create a new one with all the events that happened between two specific dates. like after september the 7th 2003 and before christmas 2006.
Could someone help me this problem? Much appreciated!
Looks like the datetime module is what you'll want. Iterate through the file line by line until the timedelta between the current line's date and your beginning threshold date (Sept 7, 2003 in your example) is positive; stop iterating when you breach Christmas 2006. Load the lines into either a pandas dataframe or numpy array.
Lucas, you can try this:
import re
import os
from datetime import datetime as dt
__date_start__ = dt.strptime('2003-09-07', "%Y-%m-%d").date()
__date_end__ = dt.strptime('2006-12-25', "%Y-%m-%d").date()
f = open('file.txt', 'r').read()
os.remove('events.txt')
for i in f:
date = re.search('\d{4}\-\d{2}-\d{2}',i).group(0)
if date != '':
date_converted = dt.strptime(date, '%Y-%m-%d').date()
if (date_converted > __date_start__) and (date_converted < __date_end__):
open('events.txt', 'a').write(i)
You will change __date_start__ and __date_end__ values to your desire interval, then, the code will search in lines a regex that match with the format of date yyyy-mm-dd. So on, it going to compare in range (date start & end) and, if true, append a events.txt file the content of line.
I assume your file is tab delimited so you can use the pandas package to read it. Just add a the first row with the column names (index, date, event) in your .txt file separated by tab and then read in the data.
df = pandas.read_csv('txt_file.txt', sep='\t', index_col=0)
#index_col=0 just sets your first column as index
After you've done so, follow the steps from this link. That will essentially answer your question on how to select events between two dates by simply using this package. That way you can return a new data frame only with those events you need.
You have not described that you want especially for "after September the 7th 2003 and before Christmas 2006." or you have other options for these two dates ?
if specially for "after september the 7th 2003 and before christmas 2006." then you can get result with regex module in my opinion :
import re
c=r"([0-9]{1,2}\s+)(2003-09-07).+(2006-12-25)\s+\w+"
with open("event.txt","r") as f:
file_data=f.readlines()
regex_search=re.search(c,str(file_data))
print(regex_search.group())
You can also use conditions with group() , or you can use findall() method.

Resources