datetime instead of str in read_excell with pandas - string

I have a dataset saved in an xls file.
In this dataset there are 4 columns that represent dates, in the format dd/mm/yyyy.
My problem is that when I read it in python using pandas and the function read_excel all the columns are read as string, except one, read as datetime64[ns], also if I specify dtypes={column=str}. Why?

Dates in Excel are frequently stored as numbers, which allows you to do things like subtract them, even though they might be displayed as human-readable dates like dd/mm/yyyy. Pandas is handily taking those numbers and interpreting them as dates, which lets you deal with them more flexibly.
To turn them into strings, you can use the converters argument of pd.read_excel like so:
df = pd.read_excel(filename, converters={'name_of_date_column': lambda dt: dt.strftime('%d/%m/%Y')})
The strftime method lets you format dates however you like. Specifying a converter for your column lets you apply the function to the data as you read it in.

Related

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

Problem and a Solution for pd.DataFrame values changing to Nan while changing index/row default names

I had a dataframe like in image-1 - Input dataframe on which I want to rename Rows/indices by dates (dtype='datetime64[ns]) in YYYY-MM-DD format.
So, I used index re-naming option as shown in the image-2 below, which is last date of every 6th month for every row incrementing till end. It did rename the rows but end up making NaNs for all data values. I did try the transpose of dataframe, same result.
After trying few other things as shown in image-3, which were all unfruitful and mostly I had error suggesting TypeError: 'DatetimeIndex' object is not callable
As the final solution, I end up creating dataframe for all dates image-4, followed by merging two dataframes by columns, image-5 and then assign/set very first column as row names, image-6.
Dates have a weird format when converting to list, and wondering why it is so, image-7. How do we get exactly the year-month-date? I tried different combinations but didn't end up in fruitful results. strftime is the way to go here, but how?
Why I went this strftime approach, I was thinking to output a list of dates in a sensible YYYY-MM-DD format and then use function as --> pd.rename(index=list_dates) to replace default 0 1 2 by dates as new index names.
So, I have a solution but is it an economic solution or are there good solutions available?
This is an attempt to share my solution for those who can use it and learn new solutions from wizards here.
BRgrds,

Pandas read_csv - dealing with columns that have ID numbers with consecutive '$' and '#' symbols, along with letters and digits

I'm trying to read a csv file with a column of data that has a scrambled ID number that includes the occasional consecutive $$ along with #, numbers, and letters.
SCRAMBLE_ID
AL9LLL677
AL9$AM657
$L9$$4440
#L9$306A1
etc.
I tried the following:
df = pd.read_csv('MASTER~1.CSV',
dtype = {'SCRAMBLE_ID': str})
which rendered the third entry as L9$4440 (L9 appear in serif font, italicized, and the first and second $ vanish).
Faced with an entire column of ID numbers configured in this manner, what is the best way of dealing with such data? I can imagine:
PRIOR TO pd.read_csv: replacing the offending symbols with substitutes that don't create this problem (and what would those be), OR,
is there a way of preserving the IDs as is but making them into a data type that ignores these symbols while keeping them present?
Thank you. I've attached a screenshot of the .csv side by side with resulting df (Jupyter notebook) below.
csv column to pandas df with $$
I cannot replicate this using the same values as you in a mock CSV file.
Are you sure that the formatting based on the $ symbol is not occurring in wherever you are rendering your dataframe values? Have you checked to see if the data in the dataframe is what you expect or are you just rendering it externally?

What to do if some datetime values 'don't match format'?

i have big data challenges using a big number of csv files. In the second column there data times and I just want to read the data.
I used
dt1=list1[1][1]
dt_obj1=datetime.datetime.strptime(dt1, '%Y-%m-%d %H:%M:%S')
and after this
first_date=dt_obj1.date() and it worked well.
The Problem is that there a a few (just 10 of over a million) entries where there is just a date instead of a date time and so it doesn't match the format.
Do you have any ideas how i can just read the date in this entries (or ignore them)?
You can use the dateutil library. The advantage of using this library is that you don't have to worry about the format. Its parser would automatically select the format matching your data.
from dateutil.parser import *
dt_1 = parse("Sat Oct 11 17:13:46 UTC 2003")
You can always use try/catch to design the way you read, suppose you have all possible formats in a formats list, then you can do
dt = None
for format in formats:
try:
dt = datetime.datetime.strptime(dt, format)
break
except:
pass
This will ensure you only break the loop when you get the correct format, otherwise keep trying possible formats.
Otherwise you can use the external dateutil library parse function parser.parse which can parse any datetime format irrespective of the format
from dateutil import parser
print(parser.parse("1990-01-21 14:12:11"))
print(parser.parse("1990-01-21"))
#1990-01-21 14:12:11
#1990-01-21 00:00:00

How to perform group by on multiple columns in a pandas dataframe and predict future values using fbProphet in python?

My dataframe looks like following. I am trying to aggregate(sum) my amount column based on Date and Group present in pandas dataframe. I was able to successfully aggregate the column. However, I am not sure how to pass in fbprophet to predict the future values based on grouped date and Group. Below is the code for aggregation.
Note: I am beginner in python, please provide explanation with code.
Data Frame
import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
'2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
Code tried so far:
grouped = df.groupby(['Group','Date'])[['Amount']].sum()
You're suffering from a few problems.
numeric
The 3rd line of your data initialization should pass in float rather than str.
Elide the quotes.
Or, this will fix it:
'Amount':[float(n) for n in ['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']]}
We do this because you really don't want .sum() to put together 12.1 and 13.2
and come up with '12.113.2'.
You'd much prefer 25.3.
index
The grouped object you computed looks good on the face of it,
but if you inspect the .dtypes attribute you'll see
it's only offering the Amount column to facebook prophet.
To remedy that, use .reset_index():
>>> grouped.reset_index(inplace=True)
>>> grouped.dtypes
Group object
Date object
Amount float64
dtype: object
But now we see one last fly in the ointment.
dates
Having opaque categories of 'A' or 'B' is fine,
but for Date we probably want to know that February or March
comes a certain number of days after January,
rather that leaving opaque str labels in that column.
We might have done the type conversion back when we presented the data input,
but it's fine to clean it up at this stage, too:
import datetime as dt
def to_timestamp(day: str):
return dt.datetime.strptime(day, '%Y-%m-%d')
grouped['Date'] = grouped.Date.apply(to_timestamp)
Having successfully wrangled the shape and types of your data,
you should now be in a good position to let libraries further analyze it.

Resources