Slicing files in python with conditons - python-3.x

Suppose i have a txt. file that looks like this:
0 day0 event_data0
1 day1 event_data1
2 day2 event_data2
3 day3 event_data3
4 day4 event_data4
........
n dayn event_datan
#where:
#n is the event index
#dayn is the day when the event happened. year-month-day format
#event_datan is what happened at the event.
From this file, i need to create a new one with all the events that happened between two specific dates. like after september the 7th 2003 and before christmas 2006.
Could someone help me this problem? Much appreciated!

Looks like the datetime module is what you'll want. Iterate through the file line by line until the timedelta between the current line's date and your beginning threshold date (Sept 7, 2003 in your example) is positive; stop iterating when you breach Christmas 2006. Load the lines into either a pandas dataframe or numpy array.

Lucas, you can try this:
import re
import os
from datetime import datetime as dt
__date_start__ = dt.strptime('2003-09-07', "%Y-%m-%d").date()
__date_end__ = dt.strptime('2006-12-25', "%Y-%m-%d").date()
f = open('file.txt', 'r').read()
os.remove('events.txt')
for i in f:
date = re.search('\d{4}\-\d{2}-\d{2}',i).group(0)
if date != '':
date_converted = dt.strptime(date, '%Y-%m-%d').date()
if (date_converted > __date_start__) and (date_converted < __date_end__):
open('events.txt', 'a').write(i)
You will change __date_start__ and __date_end__ values to your desire interval, then, the code will search in lines a regex that match with the format of date yyyy-mm-dd. So on, it going to compare in range (date start & end) and, if true, append a events.txt file the content of line.

I assume your file is tab delimited so you can use the pandas package to read it. Just add a the first row with the column names (index, date, event) in your .txt file separated by tab and then read in the data.
df = pandas.read_csv('txt_file.txt', sep='\t', index_col=0)
#index_col=0 just sets your first column as index
After you've done so, follow the steps from this link. That will essentially answer your question on how to select events between two dates by simply using this package. That way you can return a new data frame only with those events you need.

You have not described that you want especially for "after September the 7th 2003 and before Christmas 2006." or you have other options for these two dates ?
if specially for "after september the 7th 2003 and before christmas 2006." then you can get result with regex module in my opinion :
import re
c=r"([0-9]{1,2}\s+)(2003-09-07).+(2006-12-25)\s+\w+"
with open("event.txt","r") as f:
file_data=f.readlines()
regex_search=re.search(c,str(file_data))
print(regex_search.group())
You can also use conditions with group() , or you can use findall() method.

Related

Python - import csv file and group numbers by column

My problem is simple: I have this csv file.
I use Python 3. This file represent the number of new covid cases divided by country every day. But what I want to do is to obtain the global number of cases day by day regardless of the origin country. What is the fastest and simplest way to do this ?
Thank you
You could make a dictionary with dates as the key and cases as the value.
from datetime import datetime
cases_by_day = {}
with open("cases.csv") as f:
f.readline()
for line in f:
elements = line.split(",")
date = datetime.strptime(elements[0], "%d/%m/%Y")
cases_by_day[date] = cases_by_day.get(date, 0) + int(elements[3])
This is easily expandable to add deaths and other data per day.

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

Adding a number in one column to a date in another in a pandas dataframe

My first python project that didn't print 'Hello World' - so be gentle. Tried answers from similar questions but they don't seem to work.
I'm working with an Excel file, parsing as pandas dataframe.
I have a calculated column that calculates the number of days to later be added to a date. The number of days to add column is done as below, with 'choices' being a list of integers. This seems to work fine.
choices = [0,0,925,778,567,608, 638,730]
df['Days_to_add'] = np.select(conditions, choices, default=0)
I now want to add this to an existing date column, to return a new column with the new date. So far i've tried this but Jupyter says its depreciated and will return a TypeError in a future version:
df["Estimated Start"] = pd.to_timedelta(df["Date1"]) + df['Days_to_add']
Also tried this:
df['Estimated_Start'] = df.Max_Dec_Date + pd.DateOffset(df['Days_to_add'])
And something else that told me to use timedelta index, and something else that pointed to timedelta range. I think the problem is something to do with trying to add an integer to a series?
No success with any of it. Help?
Date is not TimeDelta, but DateTime,
so the addition should go like this:
df["Estimated Start"] = pd.to_datetime(df["Date1"]) + pd.to_timedelta(df['Days_to_add'], unit='D')

Trying to Pass date to pandas search from input prompt

I am trying to figure out how to pass a date inputted at a prompt by the user to pandas to search by date. I have both the search and the input prompt working separately but not together. I will show you what I mean. And maybe someone can tell me how to properly pass the date to pandas for the search.
This is how I successfully use pandas to extract rows in an excel sheet if any cell in column emr_first_access_date is greater than or equal to '2019-09-08'
I do this successfully with the following code:
import pandas as pd
HISorigFile = "C:\\folder\\inputfile1.xlsx"
#opens excel worksheet
df = pd.read_excel(HISorigFile, sheet_name='Non Live', skiprows=8)
#locates the columns I want to write to file including date column emr_first_access_date if greater than or equal to '2019-09-08'
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
#sorts the data
datasort = data.sort_values("emr_first_access_date",ascending=False)
#this creates the file (data already sorted) in panda with date and time.
datasort.to_excel(r'C:\\folder\sitesTestedInLastWeek.xlsx', index=False, header=True)
However, the date above is hardcoded of course. So, I need the user running this script to input the date. I created a very basic working input prompt with the following:
import datetime
#prompts for input date
TestedDateBegin = input('Enter beginning date to search for sites tested in YYYY-MM-DD format')
year, month, day = map(int, TestedDateBegin.split('-'))
date1 = datetime.date(year, month, day)
Obviously I want to pass TestedDateBegin to pandas, changing the pertinent code line:
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
to something like:
data = df.loc[df[b]['emr_first_access_date'] >= 'TestedDateBegin', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
Obviously this doesn't work. But how do I proceed? I am very new to programming so I not always clear how to proceed. Does the date inputted in TestedDateBegin need to be added to a return? Or should it be put in a single item list? What is the right approach? Thx!
This is resolved.
I had to remove the single quotes around TestedDateBegin as python, of course, interpreted that as a string and not a variable. Living and learning. :-)
data = df.loc[df[b]['emr_first_access_date'] >= TestedDateBegin,['site_name','subs_num','emr_id', 'emr_first_access_date']]

Problems with graphing excel data off an internet source with dates

this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')
Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.

Resources