Python - import csv file and group numbers by column - python-3.x

My problem is simple: I have this csv file.
I use Python 3. This file represent the number of new covid cases divided by country every day. But what I want to do is to obtain the global number of cases day by day regardless of the origin country. What is the fastest and simplest way to do this ?
Thank you

You could make a dictionary with dates as the key and cases as the value.
from datetime import datetime
cases_by_day = {}
with open("cases.csv") as f:
f.readline()
for line in f:
elements = line.split(",")
date = datetime.strptime(elements[0], "%d/%m/%Y")
cases_by_day[date] = cases_by_day.get(date, 0) + int(elements[3])
This is easily expandable to add deaths and other data per day.

Related

How do you take one dataframe that covers multiple years, and break it into a separate DF for each year

I already looked on SE and couldn't find an answer to my question. I am still new to this.
I am trying to take a purchasing csv file and break it into separate dataframes for each year.
For example, if I have a listing with full dates in MM/DD/YYYY format, I am trying to separate them into dataframes for each year. Like Ord2015, Ord2014, etc...
I tried to covert the full date into just the year, and also attempted to use slicing to only look at the last four of the date to no avail.
Here is my current (incomplete) attempt:
import pandas as pd
import csv
import numpy as np
import datetime as dt
import re
purch1 = pd.read_csv('purchases.csv')
#Remove unneeded fluff
del_colmn = ['pid', 'notes', 'warehouse_id', 'env_notes', 'budget_notes']
purch1 = purch1.drop(del_colmn, 1)
#break down by year only
purch1.sort_values(by=['order_date'])
Ord2015 = ()
Ord2014 = ()
for purch in purch1:
Order2015.add(purch1['order_date'] == 2015)
Per req by #anon01... here are the results of the code you had me run. I only used a sample of four as that was all I was initially playing with... The record has almost 20k lines, so I only pulled aside a few to play with.
'{"pid":{"0":75,"2":95,"3":117,"1":82},"env_id":{"0":12454,"2":12532,"3":12623,"1":12511},"ord_date":{"0":"10\/2\/2014","2":"11\/22\/2014","3":"2\/17\/2015","1":"11\/8\/2014"},"cost_center":{"0":"Ops","2":"Cons","3":"Net","1":"Net"},"dept":{"0":"Ops","2":"Cons","3":"Ops","1":"Ops"},"signing_mgr":{"0":"M. Dodd","2":"L. Price","3":"M. Dodd","1":"M. Dodd"},"check_num":{"0":null,"2":null,"3":null,"1":82301.0},"rec_date":{"0":"10\/11\/2014","2":"12\/2\/2014","3":"3\/1\/2015","1":"11\/20\/2014"},"model":{"0":null,"2":null,"3":null,"1":null},"notes":{"0":"Shipped to east WH","2":"Rec'd by L.Price","3":"Shipped to Client (1190)","1":"Rec'd by K. Wilson"},"env_notes":{"0":"appr by K.Polt","2":"appr by S. Crane","3":"appr by K.Polt","1":"appr by K.Polt"},"budget_notes":{"0":null,"2":"OOB expense","3":"Bill to client","1":null},"cost_year":{"0":2014.0,"2":2015.0,"3":null,"1":2014.0}}'
You can add parse_dates to read_csv for convert column to datetimes and then create dictionary of DataFrames dfs, for selecting is used keys:
purch1 = pd.read_csv('purchases.csv', parse_dates=['ord_date'])
dfs = dict(tuple(purch1.groupby(df['ord_date'].dt.year)))
Ord2015 = dfs[2015]
Ord2016 = dfs[2016]
It is not recommended, but possible create DataFrames by years groups:
for i, g in df.groupby(purch1['ord_date'].dt.year):
globals()['Ord' + str(i)] = g

How to calculate average datetime timestamps in python3

I have a code which I have it's performance timestamped, and I want to measure the average of time it took to run it on multiple computers, but I just cant figure out how to use the datetime module in python.
Here is how my procedure looks:
1) I have the code which simply writes into a text file the log, where the timestamp looks like
t1=datetime.datetime.now()
...
t2=datetime.datetime.now()
stamp= t2-t1
And that stamp variable is just written in say log.txt so in the log file it looks like 0:07:23.160896 so it seems like it's %H:%M:%S.%f format.
2) Then I run a second python script which reads in the log.txt file and it reads the 0:07:23.160896 value as a string.
The problem is I don't know how to work with this value because if I import it as a datetime it will also append and imaginary year and month and day to it, which I don't want, I simply just want to work with hours and minutes and seconds and microseconds to add them up or do an average.
For example I can just open it in Libreoffice and add the 0:07:23.160896 to 0:00:48.065130 which will give 0:08:11.226026 and then just divide by 2 which will give 0:04:05.613013, and I just can't possibly do that in python or I dont know how to do it.
I have tried everything, but neither datetime.datetime, nor datetime.timedelta allows simply multiplication and division like that. If I just do a y=datetime.datetime.strptime('0:07:23.160896','%H:%M:%S.%f') it will just give out 1900-01-01 00:07:23.160896 and I can't just take a y*2 like that, it doesnt allow arithmetic operations, plus if if I convert it into a timedelta it will also multiply the year,which is ridiculous. I simply just want to add and subtract and multiply time.
Please help me find a way to do this, and not just for 2 variables but possibly even a way to calculate the average of an entire list of timestamps like average(['0:07:23.160896' , '0:00:48.065130', '0:00:14.517086',...]) way.
I simply just want a way to calculate the average of many timestamps and give out it's average in the same format, just as you can just select a column in Libreoffice and take the AVERAGE() function which will give out the average timestamp in that column.
As you have done, you first read the string into a datetime-object using strptime: t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
After that, convert the time part of your datestring into a timedelta, so you can easily calculate with times: tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
Now you can easily calculate with the timedelta object, and convert at the end of the calculations back into a string by str(tdsum)
import datetime
times = ['0:07:23.160896', '0:00:48.065130', '0:12:22.324251']
# convert times in iso-format into timedelta list
tsum = datetime.timedelta()
count = 0
for single_time in times:
t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
tsum = tsum + tdelta
count = count + 1
taverage = tsum / count
average_time = str(taverage)
print(average_time)

Trying to Pass date to pandas search from input prompt

I am trying to figure out how to pass a date inputted at a prompt by the user to pandas to search by date. I have both the search and the input prompt working separately but not together. I will show you what I mean. And maybe someone can tell me how to properly pass the date to pandas for the search.
This is how I successfully use pandas to extract rows in an excel sheet if any cell in column emr_first_access_date is greater than or equal to '2019-09-08'
I do this successfully with the following code:
import pandas as pd
HISorigFile = "C:\\folder\\inputfile1.xlsx"
#opens excel worksheet
df = pd.read_excel(HISorigFile, sheet_name='Non Live', skiprows=8)
#locates the columns I want to write to file including date column emr_first_access_date if greater than or equal to '2019-09-08'
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
#sorts the data
datasort = data.sort_values("emr_first_access_date",ascending=False)
#this creates the file (data already sorted) in panda with date and time.
datasort.to_excel(r'C:\\folder\sitesTestedInLastWeek.xlsx', index=False, header=True)
However, the date above is hardcoded of course. So, I need the user running this script to input the date. I created a very basic working input prompt with the following:
import datetime
#prompts for input date
TestedDateBegin = input('Enter beginning date to search for sites tested in YYYY-MM-DD format')
year, month, day = map(int, TestedDateBegin.split('-'))
date1 = datetime.date(year, month, day)
Obviously I want to pass TestedDateBegin to pandas, changing the pertinent code line:
data = df.loc[df['emr_first_access_date'] >= '2019-09-08', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
to something like:
data = df.loc[df[b]['emr_first_access_date'] >= 'TestedDateBegin', ['site_name','subs_num','emr_id', 'emr_first_access_date']]
Obviously this doesn't work. But how do I proceed? I am very new to programming so I not always clear how to proceed. Does the date inputted in TestedDateBegin need to be added to a return? Or should it be put in a single item list? What is the right approach? Thx!
This is resolved.
I had to remove the single quotes around TestedDateBegin as python, of course, interpreted that as a string and not a variable. Living and learning. :-)
data = df.loc[df[b]['emr_first_access_date'] >= TestedDateBegin,['site_name','subs_num','emr_id', 'emr_first_access_date']]

Slicing files in python with conditons

Suppose i have a txt. file that looks like this:
0 day0 event_data0
1 day1 event_data1
2 day2 event_data2
3 day3 event_data3
4 day4 event_data4
........
n dayn event_datan
#where:
#n is the event index
#dayn is the day when the event happened. year-month-day format
#event_datan is what happened at the event.
From this file, i need to create a new one with all the events that happened between two specific dates. like after september the 7th 2003 and before christmas 2006.
Could someone help me this problem? Much appreciated!
Looks like the datetime module is what you'll want. Iterate through the file line by line until the timedelta between the current line's date and your beginning threshold date (Sept 7, 2003 in your example) is positive; stop iterating when you breach Christmas 2006. Load the lines into either a pandas dataframe or numpy array.
Lucas, you can try this:
import re
import os
from datetime import datetime as dt
__date_start__ = dt.strptime('2003-09-07', "%Y-%m-%d").date()
__date_end__ = dt.strptime('2006-12-25', "%Y-%m-%d").date()
f = open('file.txt', 'r').read()
os.remove('events.txt')
for i in f:
date = re.search('\d{4}\-\d{2}-\d{2}',i).group(0)
if date != '':
date_converted = dt.strptime(date, '%Y-%m-%d').date()
if (date_converted > __date_start__) and (date_converted < __date_end__):
open('events.txt', 'a').write(i)
You will change __date_start__ and __date_end__ values to your desire interval, then, the code will search in lines a regex that match with the format of date yyyy-mm-dd. So on, it going to compare in range (date start & end) and, if true, append a events.txt file the content of line.
I assume your file is tab delimited so you can use the pandas package to read it. Just add a the first row with the column names (index, date, event) in your .txt file separated by tab and then read in the data.
df = pandas.read_csv('txt_file.txt', sep='\t', index_col=0)
#index_col=0 just sets your first column as index
After you've done so, follow the steps from this link. That will essentially answer your question on how to select events between two dates by simply using this package. That way you can return a new data frame only with those events you need.
You have not described that you want especially for "after September the 7th 2003 and before Christmas 2006." or you have other options for these two dates ?
if specially for "after september the 7th 2003 and before christmas 2006." then you can get result with regex module in my opinion :
import re
c=r"([0-9]{1,2}\s+)(2003-09-07).+(2006-12-25)\s+\w+"
with open("event.txt","r") as f:
file_data=f.readlines()
regex_search=re.search(c,str(file_data))
print(regex_search.group())
You can also use conditions with group() , or you can use findall() method.

How i find out which person stayed maximum nights? Name and total how many days?

How i find out which person stayed maximum nights? Name and total how many days? (date format MM/DD)
for example
text file contain's
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
output expected
('john', 30 )
my code
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
stay=[]
for line in f_handle:
name, a_date, d_date = line.strip().split()
diff = datetime.strptime(d_date, "%m/%d") -datetime.strptime(a_date, "%m/%d")
stay.append(abs(diff.days+1))
return name,max(stay)
It always return first name.
This can also be implemented using pandas. I think it will much simpler using pandas.
One issue I find is that how you want to handle when you have many stayed for max nights. I have addressed that in the following code.
import pandas as pd
from datetime import datetime as dt
def longest_stay(fpath):
# Reads the text file as Dataframe
data = pd.read_csv(fpath + 'test.txt', sep=" ", header = None)
# adding column names to the Data frame
data.columns = ['Name', 'a_date', 'd_date']
# Calculating the nights for each customer
data['nights'] = datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")
# Slicing the data frame by applying the condition and getting the Name of the customer and nights as a tuple (as expected)
longest_stay = tuple( data.ix[data.nights == data.nights.max(), {'Name', 'nights'}])
# In case if many stayed for the longest night. Returns a list of tuples.
longest_stay = [tuple(x) for x in longest_stay]
return longest_stay
Your code fails but not storing the first name, it is because name is going to be set to the last name in the file because you only store the days as you go, hence you always see the last name.
You also add + 1 which does not seem correct as you should not be adding or including the last day as the person does not stay that night. Your code would actually output ('John', 32) the correct name by chance because it is the last in your sample file and the day off by 1.
Just keep track of the best which includes the name and day count as you go using the days stayed as the measure and return that at the end:
from datetime import datetime
from csv import reader
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
mx,best = None, None
for name, a_date, d_date in reader(f_handle,delimiter=" "):
days = (datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")).days
# first iteration or we found
if best is None or mx < days:
best = name, days
return best
Outout:
In [13]: cat test.txt
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
In [14]: longest_stay("test.txt")
# 31 days not including the last day as a stay
Out[14]: ('John', 31)
You only need to use abs if the format is not always in the format start-end but be aware would could get the wrong output using the abs value if your dates had years.

Resources