Initial "ImportDate" datatype Initial Pandas Dataframe interested in "ImportDate
Problem statement -
I want to extract the data where "ImportDate" last till "1-1-2019". For eg - start_date to 1-1-2019. I tried converting "object" into "datetime64[ns] and wrote the code as
df[df['ImportDate'].between(4/26/2018, 1/1/2019)]
But resulted in an error while extracting the data:
"'>=' not supported between instances of 'str' and 'float"
Can anyone help me how to deal with my problem statement?
My guess is that your input in the between function are not dates. You should try to convert them :
df[df['ImportDate'].between(pd.to_datetime("4/26/2018"), pd.to_datetime("1/1/2019"))]
Or directly create date objects : datetime.date(2019,1,1) (do not forget to import datetime).
As stated, it would be easier to check if you can provide a piece of data.
Is the column you say is a datetime really datetime? Fro mthe error you posted, it looks like it is not. Please check once more with df.dtypes. if its is not a datetime object, then convert it to datetime with for example df['ImportDate']= pd.to_datetime(df['ImportDate'],format='%d/%m/%y') (you will have to tweak parameters to suit your data). Then you can do df[df['ImportDate'].between(start_date,end_date)]
Related
Problem
I'm trying to accurately represent a date from Google Sheets in a DataFrame. I know that the "base" dates in Google Sheets are integers added to the date since 1/1/1900. Testing this is clear: I have a Sheet with the date 5/2/2019. Using the Python API, I download this Sheet with the parameter valueRenderOption='UNFORMATTED_VALUE' to ensure I'm getting raw values, and do a simple conversion to a DataFrame. The value shows up as 43587, and if I put that back into a Sheet and set the format to date, it appears as 5/2/2019. Sanity check complete.
The problem arises when I try to convert that date in the DataFrame to an actual datetime: it shows up as offset by two days, and I'm not sure why.
Attempts
In a DataFrame df, with datetime column timestamp, I do the following:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='d', origin='1900-01-01')
and I get a date of 2019-05-04, which is two days later than I would expect. I searched for this on SO and found a similar issue, but the accepted answer actually contains the exact same problem (albeit no mention of it): a two day offset.
This can be "solved" by setting the origin two days back, to 1899-12-30, though that feels almost like a cover, and not necessarily fixing the underlying issue (and could perhaps leads to further date inconsistencies down the road as more time has passed?).
Here's code for a toy DataFrame so that you don't have to type it out, if you want to experiment:
import pandas as pd
df = pd.DataFrame([{'timestamp': 43587}])
Question
I imagine this is on the Pandas side of things, but I'm not sure. Some internal conversion that happens differently than how they do it at Google? Does anyone have an idea of what's at play here, and if setting the origin date two days earlier is actually a solution?
I have been banging my head against this as well, and think that I finally figured it out. While for the Date() function, Sheets uses 1900-1-1 as the base, for the date format and for the TO_DATE() function, the origin date is 1899-12-30.
You can see this in Sheets by either
entering 0 in a cell, and then formatting to a date → 12/30/1899
entering =TO_DATE(0), which will result in 12/30/1899
One origin story for this odd choice is here in a very old MSDN forum. I have no idea of its veracity.
At any rate, that explains the two-day discrepancy and then the solution becomes
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='d', origin='1899-12-30')
which worked for me.
My dataframe looks like following. I am trying to aggregate(sum) my amount column based on Date and Group present in pandas dataframe. I was able to successfully aggregate the column. However, I am not sure how to pass in fbprophet to predict the future values based on grouped date and Group. Below is the code for aggregation.
Note: I am beginner in python, please provide explanation with code.
Data Frame
import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
'2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
Code tried so far:
grouped = df.groupby(['Group','Date'])[['Amount']].sum()
You're suffering from a few problems.
numeric
The 3rd line of your data initialization should pass in float rather than str.
Elide the quotes.
Or, this will fix it:
'Amount':[float(n) for n in ['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']]}
We do this because you really don't want .sum() to put together 12.1 and 13.2
and come up with '12.113.2'.
You'd much prefer 25.3.
index
The grouped object you computed looks good on the face of it,
but if you inspect the .dtypes attribute you'll see
it's only offering the Amount column to facebook prophet.
To remedy that, use .reset_index():
>>> grouped.reset_index(inplace=True)
>>> grouped.dtypes
Group object
Date object
Amount float64
dtype: object
But now we see one last fly in the ointment.
dates
Having opaque categories of 'A' or 'B' is fine,
but for Date we probably want to know that February or March
comes a certain number of days after January,
rather that leaving opaque str labels in that column.
We might have done the type conversion back when we presented the data input,
but it's fine to clean it up at this stage, too:
import datetime as dt
def to_timestamp(day: str):
return dt.datetime.strptime(day, '%Y-%m-%d')
grouped['Date'] = grouped.Date.apply(to_timestamp)
Having successfully wrangled the shape and types of your data,
you should now be in a good position to let libraries further analyze it.
I have a dataset saved in an xls file.
In this dataset there are 4 columns that represent dates, in the format dd/mm/yyyy.
My problem is that when I read it in python using pandas and the function read_excel all the columns are read as string, except one, read as datetime64[ns], also if I specify dtypes={column=str}. Why?
Dates in Excel are frequently stored as numbers, which allows you to do things like subtract them, even though they might be displayed as human-readable dates like dd/mm/yyyy. Pandas is handily taking those numbers and interpreting them as dates, which lets you deal with them more flexibly.
To turn them into strings, you can use the converters argument of pd.read_excel like so:
df = pd.read_excel(filename, converters={'name_of_date_column': lambda dt: dt.strftime('%d/%m/%Y')})
The strftime method lets you format dates however you like. Specifying a converter for your column lets you apply the function to the data as you read it in.
import pandas as pd
batch=pd.read_excel('batch.xlsx')
stock_report=pd.read_excel('Stock_Report.xlsx')
Result_stock=pd.merge(stock_report,batch[['Batch','Cost price']], on='Batch').fillna(0)
Result_stock2=pd.merge(Result_stock,batch[['Item number',' Batch MRP']], on='Item number').fillna(0)
Result_stock2['Total']=Result_stock2['Posted quantity']*Result_stock2['Cost price']
I need to change the value of Column(Total) for Result_stock2 by multiplying it with two column value if it has 0.
You need to learn some formatting. Please format your code so we can read.
If I understood what you mean and your script is working fine so far, you should just simply add:
Result_stock2.loc[Result_stock2['Total']==0,'Total']=(****OPERATION YOU NEED****)
example in 'OPERATION'
Result_stock2.loc[Result_stock2['Total']==0,'Posted quantity']*(Result_stock2.loc[Result_stock2['Total']==0,'Cost price']-5)
It's not a beautiful code but will do what you need.
Wondering if someone can help me here. When I take a regular python list containing strings, and check to see if a pandas series (pla.id) has a value that matches a value in that list. It works.
Which is great and all but I wonder how it's able to compare strings to ints... is there documentation somewhere that states that it will convert under the hood before comparing those values??
I wasn't able to find anything on the .isin() page of pandas documentation..
Also super interesting is that when I try pandas indexing it fails due to a type comparison.
So my two questions:
Does pandas.series.isin(some_list_of_strings) method automatically convert the values in the series (which are int values) to strings before doing a value comparison?
If so, why doesn't pandas indexing i.e. (df[df.series == 'some value']) not do the same thing? What is the thinking behind this? If I wanted to accomplish the same thing I would have to do df[df.series.astype(str) == ''] or df[df.series.astype(str).isin(some_list_of_strings)] to access those values in the df that match
After some digging I think this might be due to the pandas object datatype? but I have no understanding of why this works. Also this doesn't explain why the below works... since it is a int dtype
Thanks in advance!