Pandas [.dt] property vs to_datetime - python-3.x

The question is intended to build an understandable grasp on subtle differences between .dt and pd.to_datetime
I want understand which method is suited/preferred and if one can be used as a defacto and other differences that are there between the two
values = {'date_time': ['20190902093000','20190913093000','20190921200000'],
}
df = pd.DataFrame(values, columns = ['date_time'])
df['date_time'] = pd.to_datetime(df['date_time'], format='%Y%m%d%H%M%S')
>>> df
date_time
0 2019-09-02 09:30:00
1 2019-09-13 09:30:00
2 2019-09-21 20:00:00
Using .dt
df['date'] = df['date_time'].dt.date
>>> df
date_time date
0 2019-09-02 09:30:00 2019-09-02
1 2019-09-13 09:30:00 2019-09-13
2 2019-09-21 20:00:00 2019-09-21
>>> df.dtypes
date_time datetime64[ns]
date object
dtype: object
>>> df.date.values
array([datetime.date(2019, 9, 2), datetime.date(2019, 9, 13),
datetime.date(2019, 9, 21)], dtype=object)
Using .dt , even though the elements are individually datetime , is inferred as object in the 'DataFrame` , which sometimes is suited but mostly its causes a lot of problems down the line and an implicit conversion is inevitable
Using pd.to_datetime
df['date_to_datetime'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
>>> df.dtypes
date_time datetime64[ns]
date object
date_to_datetime datetime64[ns]
>>> df.date_to_datetime.values
array(['2019-09-02T00:00:00.000000000', '2019-09-13T00:00:00.000000000',
'2019-09-21T00:00:00.000000000'], dtype='datetime64[ns]')
Using pd.to_datetime , natively returns a datetime64[ns] array and inferred the same in the DataFrame , which in my experience is consistent and widely used , when dealing with dates using pandas
I m aware of the fact a native Date dtype does not exist in pandas , and is wrapped around datetime64[ns]

The two concepts are quite different.
pandas.to_datetime() is a function that can take a variety of inputs and convert them to a pandas datetime index. For example:
dates = pd.to_datetime([1610290846000000000, '2020-01-11', 'Jan 12 2020 2pm'])
print(dates)
# DatetimeIndex(['2021-01-10 15:00:46', '2020-01-11 00:00:00',
# '2020-01-12 14:00:00'],
# dtype='datetime64[ns]', freq=None)
pandas.Series.dt is an interface on a pandas series that gives you convenient access to operations on data stored as a pandas datetime. For example:
x = pd.Series(dates)
print(x.dt.date)
# 0 2021-01-10
# 1 2020-01-11
# 2 2020-01-12
# dtype: object
print(x.dt.hour)
# 0 15
# 1 0
# 2 14
# dtype: int64

Related

Anyone know how to write in a short way the folowing code [duplicate]

I created a DataFrame from a list of lists:
table = [
['a', '1.2', '4.2' ],
['b', '70', '0.03'],
['x', '5', '0' ],
]
df = pd.DataFrame(table)
How do I convert the columns to specific types? In this case, I want to convert columns 2 and 3 into floats.
Is there a way to specify the types while converting the list to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the dtype for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns, and I don't want to specify exactly which columns are of which type. All I can guarantee is that each column contains values of the same type.
You have four main options for converting types in pandas:
to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).
Read on for more detailed explanations and usage of each of these methods.
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().
This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Basic usage
The input to to_numeric() is a Series or a single column of a DataFrame.
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply() method:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that's probably all you need.
Error handling
But what if some values can't be converted to a numeric type?
to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.
Here's an example using a Series of strings s which has the object dtype:
>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object
The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:
>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64
The third option for errors is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with to_numeric() will give you either an int64 or float64 dtype (or whatever integer width is native to your platform).
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?
to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to 'integer' uses the smallest possible integer that can hold the values:
>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8
Downcasting to 'float' similarly picks a smaller than normal floating type:
>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32
2. astype()
The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).
Call the method on the object you want to convert and astype() will try and convert it for you:
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN or inf value you'll get an error trying to convert it to an integer.
As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be returned untouched.
Be careful
astype() is powerful, but it will sometimes convert values "incorrectly". For example:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
These are small integers, so how about converting to an unsigned 8-bit type to save memory?
>>> s.astype(np.uint8)
0 1
1 2
2 249
dtype: uint8
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!
Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).
For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a object
b object
dtype: object
Using infer_objects(), you can change the type of column 'a' to int64:
>>> df = df.infer_objects()
>>> df.dtypes
a int64
b object
dtype: object
Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int) instead.
4. convert_dtypes()
Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.
Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64, a column of NumPy int32 values, will become the pandas dtype Int32.
With our object DataFrame df, we get the following result:
>>> df.convert_dtypes().dtypes
a Int64
b string
dtype: object
Since column 'a' held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).
Column 'b' contained string objects, so was changed to pandas' string dtype.
By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:
>>> df.convert_dtypes(infer_objects=False).dtypes
a object
b string
dtype: object
Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.
Use this:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df.dtypes
Out[17]:
one object
two object
three object
df[['two', 'three']] = df[['two', 'three']].astype(float)
df.dtypes
Out[19]:
one object
two float64
three float64
This below code will change the datatype of a column.
df[['col.name1', 'col.name2'...]] = df[['col.name1', 'col.name2'..]].astype('data_type')
In place of the data type, you can give your datatype what you want, like, str, float, int, etc.
When I've only needed to specify specific columns, and I want to be explicit, I've used (per pandas.DataFrame.astype):
dataframe = dataframe.astype({'col_name_1':'int','col_name_2':'float64', etc. ...})
So, using the original question, but providing column names to it...
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col_name_1', 'col_name_2', 'col_name_3'])
df = df.astype({'col_name_2':'float64', 'col_name_3':'float64'})
pandas >= 1.0
Here's a chart that summarises some of the most important conversions in pandas.
Conversions to string are trivial .astype(str) and are not shown in the figure.
"Hard" versus "Soft" conversions
Note that "conversions" in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes
a object
b object
dtype: object
# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes
a int64
b int64
dtype: object
# Infers better data types for object data - soft conversion
df.infer_objects().dtypes
a object # no change
b int64
dtype: object
# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
So, for your example:
import pandas as pd
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
coerce_df_columns_to_numeric(df, ['col2','col3'])
df = df.astype({"columnname": str})
#e.g - for changing the column type to string
#df is your dataframe
Create two dataframes, each with different data types for their columns, and then appending them together:
d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float)
d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))
Results
In[8}: d1.dtypes
Out[8]:
float_column float64
string_column object
dtype: object
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
df.info() gives us initial datatype of temp which is float64
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 132 non-null object
1 temp 132 non-null float64
Now, use this code to change the datatype to int64:
df['temp'] = df['temp'].astype('int64')
if you do df.info() again, you will see:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 132 non-null object
1 temp 132 non-null int64
This shows you have successfully changed the datatype of column temp. Happy coding!
Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes. You can even control what types to convert!
In [40]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
...: "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
...: "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
...: "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
...: "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
...: "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
...: }
...: )
In [41]: dff = df.copy()
In [42]: df
Out[42]:
a b c d e f
0 1 x True h 10.0 NaN
1 2 y False i NaN 100.5
2 3 z NaN NaN 20.0 200.0
In [43]: df.dtypes
Out[43]:
a int32
b object
c object
d object
e float64
f float64
dtype: object
In [44]: df = df.convert_dtypes()
In [45]: df.dtypes
Out[45]:
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
In [46]: dff = dff.convert_dtypes(convert_boolean = False)
In [47]: dff.dtypes
Out[47]:
a Int32
b string
c object
d string
e Int64
f float64
dtype: object
In case you have various objects columns like this Dataframe of 74 Objects columns and 2 Int columns where each value have letters representing units:
import pandas as pd
import numpy as np
dataurl = 'https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv'
nutrition = pd.read_csv(dataurl,index_col=[0])
nutrition.head(3)
Output:
name serving_size calories total_fat saturated_fat cholesterol sodium choline folate folic_acid ... fat saturated_fatty_acids monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine theobromine water
0 Cornstarch 100 g 381 0.1g NaN 0 9.00 mg 0.4 mg 0.00 mcg 0.00 mcg ... 0.05 g 0.009 g 0.016 g 0.025 g 0.00 mg 0.0 g 0.09 g 0.00 mg 0.00 mg 8.32 g
1 Nuts, pecans 100 g 691 72g 6.2g 0 0.00 mg 40.5 mg 22.00 mcg 0.00 mcg ... 71.97 g 6.180 g 40.801 g 21.614 g 0.00 mg 0.0 g 1.49 g 0.00 mg 0.00 mg 3.52 g
2 Eggplant, raw 100 g 25 0.2g NaN 0 2.00 mg 6.9 mg 22.00 mcg 0.00 mcg ... 0.18 g 0.034 g 0.016 g 0.076 g 0.00 mg 0.0 g 0.66 g 0.00 mg 0.00 mg 92.30 g
3 rows × 76 columns
nutrition.dtypes
name object
serving_size object
calories int64
total_fat object
saturated_fat object
...
alcohol object
ash object
caffeine object
theobromine object
water object
Length: 76, dtype: object
nutrition.dtypes.value_counts()
object 74
int64 2
dtype: int64
A good way to convert to numeric all columns is using regular expressions to replace the units for nothing and astype(float) for change the columns data type to float:
nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
nutrition.head(3)
Output:
serving_size calories total_fat saturated_fat cholesterol sodium choline folate folic_acid niacin ... fat saturated_fatty_acids monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine theobromine water
name
Cornstarch 100.0 381.0 0.1 NaN 0.0 9.0 0.4 0.0 0.0 0.000 ... 0.05 0.009 0.016 0.025 0.0 0.0 0.09 0.0 0.0 8.32
Nuts, pecans 100.0 691.0 72.0 6.2 0.0 0.0 40.5 22.0 0.0 1.167 ... 71.97 6.180 40.801 21.614 0.0 0.0 1.49 0.0 0.0 3.52
Eggplant, raw 100.0 25.0 0.2 NaN 0.0 2.0 6.9 22.0 0.0 0.649 ... 0.18 0.034 0.016 0.076 0.0 0.0 0.66 0.0 0.0 92.30
3 rows × 75 columns
nutrition.dtypes
serving_size float64
calories float64
total_fat float64
saturated_fat float64
cholesterol float64
...
alcohol float64
ash float64
caffeine float64
theobromine float64
water float64
Length: 75, dtype: object
nutrition.dtypes.value_counts()
float64 75
dtype: int64
Now the dataset is clean and you are able to do numeric operations with this Dataframe only with regex and astype().
If you want to collect the units and paste on the headers like cholesterol_mg you can use this code:
nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = { k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
Is there a way to specify the types while converting to DataFrame?
Yes. The other answers convert the dtypes after creating the DataFrame, but we can specify the types at creation. Use either DataFrame.from_records or read_csv(dtype=...) depending on the input format.
The latter is sometimes necessary to avoid memory errors with big data.
1. DataFrame.from_records
Create the DataFrame from a structured array of the desired column types:
x = [['foo', '1.2', '70'], ['bar', '4.2', '5']]
df = pd.DataFrame.from_records(np.array(
[tuple(row) for row in x], # pass a list-of-tuples (x can be a list-of-lists or 2D array)
'object, float, int' # define the column types
))
Output:
>>> df.dtypes
# f0 object
# f1 float64
# f2 int64
# dtype: object
2. read_csv(dtype=...)
If you're reading the data from a file, use the dtype parameter of read_csv to set the column types at load time.
For example, here we read 30M rows with rating as 8-bit integers and genre as categorical:
lines = '''
foo,biography,5
bar,crime,4
baz,fantasy,3
qux,history,2
quux,horror,1
'''
columns = ['name', 'genre', 'rating']
csv = io.StringIO(lines * 6_000_000) # 30M lines
df = pd.read_csv(csv, names=columns, dtype={'rating': 'int8', 'genre': 'category'})
In this case, we halve the memory usage upon load:
>>> df.info(memory_usage='deep')
# memory usage: 1.8 GB
>>> pd.read_csv(io.StringIO(lines * 6_000_000)).info(memory_usage='deep')
# memory usage: 3.7 GB
This is one way to avoid memory errors with big data. It's not always possible to change the dtypes after loading since we might not have enough memory to load the default-typed data in the first place.
I thought I had the same problem, but actually I have a slight difference that makes the problem easier to solve. For others looking at this question, it's worth checking the format of your input list. In my case the numbers are initially floats, not strings as in the question:
a = [['a', 1.2, 4.2], ['b', 70, 0.03], ['x', 5, 0]]
But by processing the list too much before creating the dataframe, I lose the types and everything becomes a string.
Creating the data frame via a NumPy array:
df = pd.DataFrame(np.array(a))
df
Out[5]:
0 1 2
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df[1].dtype
Out[7]: dtype('O')
gives the same data frame as in the question, where the entries in columns 1 and 2 are considered as strings. However doing
df = pd.DataFrame(a)
df
Out[10]:
0 1 2
0 a 1.2 4.20
1 b 70.0 0.03
2 x 5.0 0.00
df[1].dtype
Out[11]: dtype('float64')
does actually give a data frame with the columns in the correct format.
I had the same issue.
I could not find any solution that was satisfying. My solution was simply to convert those float into str and remove the '.0' this way.
In my case, I just apply it on the first column:
firstCol = list(df.columns)[0]
df[firstCol] = df[firstCol].fillna('').astype(str).apply(lambda x: x.replace('.0', ''))
If you want convert one column from string format I suggest use this code"
import pandas as pd
#My Test Data
data = {'Product': ['A','B', 'C','D'],
'Price': ['210','250', '320','280']}
data
#Create Data Frame from My data df = pd.DataFrame(data)
#Convert to number
df['Price'] = pd.to_numeric(df['Price'])
df
Total = sum(df['Price'])
Total
else if you going to convert a number of column values to number I suggest to you first filter your values and save in empty array and after that convert to number. I hope this code solve your problem.
Convert string representation of long numbers to integers
By default, astype(int) converts to int32, which wouldn't work (OverflowError) if a number is particularly long (such as phone number); try 'int64' (or even float) instead:
df['long_num'] = df['long_num'].astype('int64')
On a side note, if you get SettingWithCopyWarning, then make a copy of your frame and do whatever you were doing again. For example, if you were converting col1 and col2 to float dtype, then do:
df = df.copy()
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
# or use assign
df = df.assign(**{k: df[k].astype(float) for k in ['col1', 'col2']})
Convert integers to timedelta
Also, the long string/integer maybe datetime or timedelta, in which case, use to_datetime or to_timedelta to convert to datetime/timedelta dtype:
df = pd.DataFrame({'long_int': ['1018880886000000000', '1590305014000000000', '1101470895000000000', '1586646272000000000', '1460958607000000000']})
df['datetime'] = pd.to_datetime(df['long_int'].astype('int64'))
# or
df['datetime'] = pd.to_datetime(df['long_int'].astype(float))
df['timedelta'] = pd.to_timedelta(df['long_int'].astype('int64'))
Convert timedelta to numbers
To perform the reverse operation (convert datetime/timedelta to numbers), view it as 'int64'. This could be useful if you were building a machine learning model that somehow needs to include time (or datetime) as a numeric value. Just make sure that if the original data are strings, then they must be converted to timedelta or datetime before any conversion to numbers.
df = pd.DataFrame({'Time diff': ['2 days 4:00:00', '3 days', '4 days', '5 days', '6 days']})
df['Time diff in nanoseconds'] = pd.to_timedelta(df['Time diff']).view('int64')
df['Time diff in seconds'] = pd.to_timedelta(df['Time diff']).view('int64') // 10**9
df['Time diff in hours'] = pd.to_timedelta(df['Time diff']).view('int64') // (3600*10**9)
Convert datetime to numbers
For datetime, the numeric view of a datetime is the time difference between that datetime and the UNIX epoch (1970-01-01).
df = pd.DataFrame({'Date': ['2002-04-15', '2020-05-24', '2004-11-26', '2020-04-11', '2016-04-18']})
df['Time_since_unix_epoch'] = pd.to_datetime(df['Date'], format='%Y-%m-%d').view('int64')
astype is faster than to_numeric
df = pd.DataFrame(np.random.default_rng().choice(1000, size=(10000, 50)).astype(str))
df = pd.concat([df, pd.DataFrame(np.random.rand(10000, 50).astype(str), columns=range(50, 100))], axis=1)
%timeit df.astype(dict.fromkeys(df.columns[:50], int) | dict.fromkeys(df.columns[50:], float))
# 488 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(pd.to_numeric)
# 686 ms ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

Split up time series per year for plotting

I would like to plot a time series, start Oct-2015 and end Feb-2018, in one graph, each year is a single line. The time series is int64 value and is in a Pandas DataFrame. The date is in datetime64[ns] as one of the columns in the DataFrame.
How would I create a graph from Jan-Dez with 4 lines for each year.
graph['share_price'] and graph['date'] are used. I have tried Grouper, but that somehow takes Oct-2015 values and mixes it with the January values from all other years.
This groupby is close to what I want, but I loose the information which year the index of the list belongs to.
graph.groupby('date').agg({'share_price':lambda x: list(x)})
Then I have created a DataFrame with 4 columns, 1 for each year but still, I don't know how to go ahead and group these 4 columns in a way, that I will be able to plot a graph in a way I want.
You can achieve this by:
extracting the year from the date
replacing the dates by the equivalent without the year
setting both the year and the date as index
unstacking the values by year
At this point, each year will be a column, and each date within the year a row, so you can just plot normally.
Here's an example.
Assuming that your DataFrame looks something like this:
>>> import pandas as pd
>>> import numpy as np
>>> index = pd.date_range('2015-10-01', '2018-02-28')
>>> values = np.random.randint(-3, 4, len(index)).cumsum()
>>> df = pd.DataFrame({
... 'date': index,
... 'share_price': values
>>> })
>>> df.head()
date share_price
0 2015-10-01 0
1 2015-10-02 3
2 2015-10-03 2
3 2015-10-04 5
4 2015-10-05 4
>>> df.set_index('date').plot()
You would transform the DataFrame as follows:
>>> df['year'] = df.date.dt.year
>>> df['date'] = df.date.dt.strftime('%m-%d')
>>> unstacked = df.set_index(['year', 'date']).share_price.unstack(-2)
>>> unstacked.head()
year 2015 2016 2017 2018
date
01-01 NaN 28.0 -16.0 21.0
01-02 NaN 29.0 -14.0 22.0
01-03 NaN 29.0 -16.0 22.0
01-04 NaN 26.0 -15.0 23.0
01-05 NaN 25.0 -16.0 21.0
And just plot normally:
unstacked.plot()

pandas timeseries select single date

Selecting a single date from a timeserie gives a KeyError.
Setup:
import pandas as pd
import numpy as np
ts = pd.DataFrame({'date': pd.date_range(start = '1/1/2017', periods = 5),
'observations': np.random.choice(range(0, 100), 5, replace = True)}).set_index('date')
Dataframe:
observations
date
2017-01-01 58
2017-01-02 88
2017-01-03 53
2017-01-04 4
2017-01-05 26
How do I select the number of observations for a single date?
ts['2017-01-01']
Returns: KeyError: '2017-01-01'
But...
ts['2017-01-01':'2017-01-01']
...seems to work just fine.
Any suggestions how to select/subset with a single date?
As #scnerd pointed out, when you do ts['2017-01-01'] it tries to find '2017-01-01' as a column's name of the dataframe ts, which gives you an KeyError as none of the columns in ts has this name
In order to look for an index' name, as in your example 'date' is set as index, you need to use loc method such as ts.loc['2017-01-01'] and you will get:
observations 54
Name: 2017-01-01 00:00:00, dtype: int32

Resources