From Dataframe to Datestamp python3 - python-3.x

recently I faced a really weird csv file with 2 columns (with headers), one for dates and the second one for prices. The time format was "dd.mm.yyyy".
d = {'Date': [31.12.1991, 02.01.1992, 03.01.1992, 06.01.1992],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)
prices = pd.DataFrame(df['Prices'])
date = pd.DataFrame(df['Date'])
date = date.to_string(header=True)
date = df.to_datetime(utc=True, infer_datetime_format=True)
frame = date.join(values)
print(df)
I tried to make it work by isolating the date column and trying to transform it first into string with the to_string() function and then back to date with the to_datetime but it was no use.
Any suggestions?
Thanks in advance

Interesting way to generalize for whole dataframe
Note This uses errors='ignore' in order to skip columns that might not be suitable for parsing as dates. However, the trade off is that if there is a column that is intended to be parsed as dates but has a bad date value, this approach will leave that column unaltered. The point is to make sure you don't have bad date values.
df.assign(
**df.select_dtypes(exclude=[np.number]).apply(
pd.to_datetime, errors='ignore', dayfirst=True
)
)
Date Prices
0 1991-12-31 9.62
1 1992-01-02 9.50
2 1992-01-03 9.73
3 1992-01-06 9.45
Another example
df = pd.DataFrame(dict(
A=1, B='B', C='6.7.2018', D=1-1j,
E='1.2.2017', F=pd.Timestamp('2016-08-08')
), [0])
df
A B C D E F
0 1 B 6.7.2018 (1-1j) 1.2.2017 2016-08-08
df.assign(
**df.select_dtypes(exclude=[np.number]).apply(
pd.to_datetime, errors='ignore', dayfirst=True
)
)
A B C D E F
0 1 B 2018-07-06 (1-1j) 2017-02-01 2016-08-08
Setup
borrowed from jezrael
d = {'Date': ['31.12.1991', '02.01.1992', '03.01.1992', '06.01.1992'],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)

You could try to parse the dates when you read in the file. You can specify that the format has the day first instead of the month.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=['Date'], dayfirst=True)
print(df)
# Date Prices
#0 1991-12-31 9.62
#1 1992-01-02 9.50
#2 1992-01-03 9.73
#3 1992-01-06 9.45
df.dtypes
#Date datetime64[ns]
#Prices float64
#dtype: object
However, your data really need to be clean and properly formatted for this to work:
parse_dates:
If a column or index contains an unparseable date, the entire column
or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use pd.to_datetime after pd.read_csv
Sample Data: test.csv
Date,Prices
31.12.1991,9.62
02.01.1992,9.5
03.01.1992,9.73
06.01.1992,9.45

I believe need:
d = {'Date': ['31.12.1991', '02.01.1992', '03.01.1992', '06.01.1992'],
'Prices': [9.62, 9.5, 9.73, 9.45]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df)
Date Prices
0 1991-12-31 9.62
1 1992-01-02 9.50
2 1992-01-03 9.73
3 1992-01-06 9.45

Related

Transform a specific nested list to a pandas dataframe

My nested list looks like:
[['NP-00002',
Motor1 0.126878
Lpi 0.099597
dtype: float64],
['NP-00067',
Health 0.253135
Travel 0.157896
dtype: float64],
['LE-00035',
Train 0.134382
Property 0.126089
dtype: float64],
['NP-00009',
Start 0.171959
Casco 0.163557
dtype: float64]]
I would like my data to be in 3 columns in a pandas dataframe (dtype: float64 is dropped). I have a problem with ' ' separation, also with .astype(str).
Example for 1st item in nested list (2 rows outputed):
1st column 2nd column 3rd column
NP-00002 Motor1 0.126878
NP-00002 Lpi 0.099597
Use pd.concat:
df = (pd.concat(dict(lst)).rename_axis(['Type', 'Property'])
.rename('Value').reset_index())
print(df)
# Output
Type Property Value
0 NP-00002 Motor1 0.126878
1 NP-00002 Lpi 0.099597
2 NP-00067 Health 0.253135
3 NP-00067 Travel 0.157896
4 LE-00035 Train 0.134382
5 LE-00035 Property 0.126089
6 NP-00009 Start 0.171959
7 NP-00009 Casco 0.163557
In reality I found out that I had problems with too many spaces that I did not see in the pandas dataframe. The way I solved it was not that elegant, but it works.
list_output = pd.DataFrame(n_largest, columns=["Policyholder", "Recommendation"])
list_output["Recommendation"] = list_output["Recommendation"].astype(str)
list_output["Recommendation"] = list_output["Recommendation"].str.replace('\n',' ', regex=True)
list_output["Recommendation"] = list_output["Recommendation"].str.replace('dtype: float64',' ', regex=True)
list_output["Recommendation"] = list_output["Recommendation"].replace(r'\s+', ' ', regex=True)
output = pd.concat([list_output["Policyholder"],list_output["Recommendation"].str.split(' ', expand=True)], axis=1)
So in the end my output looks a bit different, which is still fine
Policyholder Property1 Value1 Property2 Value2
0 NP-00002 Motor1 0.126878 Lpi 0.099597
1 NP-00067 Health 0.253135 Travel 0.157896
Thank you for all the help!

pandas to_datetime() funtion is not converting for date 08-12-1600 in dataframe

raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
df_1['dates'] = pd.to_datetime(df_1['dates'])
the above code gives error due to date 08-12-1600 if the date is removed it works fine, what could be the possible reason for it?
error is:
Out of bounds nanosecond timestamp: 1600-08-12 00:00:00
That is because the provided dates are outside the range of Timestamp.
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
Details here
If we need the dates even out of range
Then we can convert them to period using below code
raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
def conv(x):
day,month,year = tuple(x.split('-'))
return pd.Period(year=int(year), month=int(month), day=int(day), freq="D")
df_1['dates'] = df_1.dates.apply(conv)
df_1
Output
Event dates
0 A 1600-12-08
1 B 1400-09-26
2 C 1991-11-04
3 D 1991-03-25
4 E 1991-05-10
If we can ignore dates outside range
df_1['dates'] = pd.to_datetime(df_1.dates, errors='coerce')
df_1
Output
Event dates
0 A NaT
1 B NaT
2 C 1991-04-11
3 D 1991-03-25
4 E 1991-10-05
Bonus Fact
Why timestamp can hold values for around 584 years 1677-2262?
Since timestamps provides nano second precision and is stored in 64-bit integer, hence it can store around 584 years with this nano second resolution in 64-bit int space.

Why is call to sum() on a data frame generating wrong numbers?

I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?
This code was run using Python 3.6 :
import pandas as pd
import numpy as np
data = np.array([
['', 'week','storeA','storeB','storeC','storeD'],
[0,"2014-05-04",2643,8257,3893,6231],
[1,"2014-05-11",6444,5736,5634,7092],
[2,"2014-05-18",9646,2552,4253,5447],
[3,"2014-05-25",5960,10740,8264,6063],
[4,"2014-06-04",5960,10740,8264,6063],
[5,"2014-06-12",7412,7374,3208,3985]
])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)
# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)
Output of Sum column for rows 4 and 5:
Row4 - 5.960107e+16
Row5 - 7.412737e+15
Use astype, to convert those strings to ints and sum works properly:
df2['Sum'] = df2[col_list].astype(int).sum(axis=1)
Output:
week storeA storeB storeC storeD Sum
4 2014-06-04 5960 10740 8264 6063 31027
5 2014-06-12 7412 7374 3208 3985 21979
What was happening,you were summing (concatenating) strings.
Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:
df.dtypes
week object
storeA object
storeB object
storeC object
storeD object
dtype: object
You have columns of strings, and sum on string dataframes results in concatenation.
The solution is to convert these to integers first -
df2[col_list] = df2[col_list].astype(int)
Your code then works.
df2[col_list].sum(axis=1)
4 31027
5 21979
dtype: int64
Alternatively, declare data as a object array -
data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
Next, perform a soft conversion using infer_objects (new in v0.22):
df = df.infer_objects()
df.dtypes
week object
storeA int64
storeB int64
storeC int64
storeD int64
dtype: object
Works like a charm.

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Resources