Pandas Timestamp: What type is this? - python-3.x

I have a pandas dataframe with a parsed time stamp. What type is this? I have tried matching against it with the following rules:
dtype_dbg = df[col].dtype # debugger shows it as 'datetime64[ns]'
if isinstance(df[col].dtype,np.datetime64) # no luck
if isinstance(df[col].dtype,pd.Timestamp) # ditto
if isinstance(df[col].dtype,[all other timestamps I could think of]) # nothing
How does one match against the timestamp dtype in a pandas dataframe?

Pandas datetime64[ns] is a '<M8[ns]' numpy type, so you can just compare the dtypes:
df = pd.DataFrame( {'col': ['2019-01-01', '2019-01-02']})
df.col = pd.to_datetime(df.col)
df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 1 columns):
#col 2 non-null datetime64[ns]
#dtypes: datetime64[ns](1)
#memory usage: 144.0 bytes
df[col].dtype == np.dtype('<M8[ns]')
#True
You can also (or maybe should better) use pandas built-in api.types.is_... functions:
pd.api.types.is_datetime64_ns_dtype(df[col])
#True
Your comparisons isinstance(df[col].dtype, ...) don't work, as you compare the type of dtype (which is numpy.dype of course) with other data dtypes, which will naturally fail for any data type.

Related

Dask memory usage exploding even for simple computations

I have a parquet folder created with dask containing multiple files of about 100MB each. When I load the dataframe with df = dask.dataframe.read_parquet(path_to_parquet_folder), and run any sort of computation (such as df.describe().compute()), my kernel crashes.
Things I have noticed:
CPU usage (about 100%) indicates that multithreading is not used
memory usage shoots way past the size of a single file
the kernel crashes after system memory usage approaches 100%
EDIT:
I tried to create a reproducible example, without success, but I discovered some other oddities, seemingly all related to the newer pandas dtypes that I'm using:
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 10000000
test = pd.DataFrame({
1:pd.Series(['a', pd.NA]*n, dtype = pd.StringDtype()),
2:pd.Series([1, pd.NA]*n, dtype = pd.Int64Dtype()),
3:pd.Series([0.56, pd.NA]*n, dtype = pd.Float64Dtype())
})
dd_df = dd.from_pandas(test, npartitions = 2) # convert to dask df
dd_df.to_parquet('test.parquet') # save as parquet directory
dd_df = dd.read_parquet('test.parquet') # load files back
dd_df.mean().compute() # compute something
dd_df.describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something
Output, respectively:
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
Kernel appears to have died.
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
It seems that the dtypes are preserved even throughout the parquet IO, but dask has some trouble actually doing anything with these columns.
Python version: 3.9.7
dask version: 2021.11.2
It seems the main error is due to NAType which is not yet fully supported by numpy (version 1.21.4):
~/some_env/python3.8/site-packages/numpy/core/_methods.py in _var(a, axis, dtype, out, ddof, keepdims, where)
240 # numbers and complex types with non-native byteorder
241 else:
--> 242 x = um.multiply(x, um.conjugate(x), out=x).real
243
244 ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
TypeError: loop of ufunc does not support argument 0 of type NAType which has no callable conjugate method
As a workaround, casting columns to float will compute the descriptives. Note that to avoid KeyError the column names are given as strings rather than int.
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 1000
# note that column names are changed to strings rather than ints
test = pd.DataFrame(
{
"1": pd.Series(["a", pd.NA] * n, dtype=pd.StringDtype()),
"2": pd.Series([1, pd.NA] * n, dtype=pd.Int64Dtype()),
"3": pd.Series([0.56, pd.NA] * n, dtype=pd.Float64Dtype()),
}
)
dd_df = dd.from_pandas(test, npartitions=2) # convert to dask df
dd_df.to_parquet("test.parquet", engine="fastparquet") # save as parquet directory
dd_df = dd.read_parquet("test.parquet", engine="fastparquet") # load files back
dd_df.mean().compute() # compute something
dd_df.astype({"2": "float"}).describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something

How to create an empty Series on pandas?

I'm learning about Series and I'm using VS Code to do excercises to learn it's usage, but when I typed this
current_series_add = pd.Series()
in the terminal it shows me a message telling me that "Te default dtype for empty Series will be object instead of float64 in a future version"
How can I specify a dtype?
As the docs say:
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]ΒΆ
...
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from data. See the user guide for more usages.
Example:
In [1]: import pandas as pd
In [2]: pd.Series(dtype=int)
Out[2]: Series([], dtype: int64)

dask map_partitions returns pandas data frame, not dask

Everything I can find indicates that dask map_partitions should return a dask dataframe object. But the following code snippet and the corresponding output (using logzero) does not. (note -- calc_delta returns a np.array of floats).
352 logger.debug(type(self.dd))
353 self.dd = self.dd.map_partitions(
354 lambda df: df.assign(
355 duration1=lambda r: calc_delta(r['a'], r['b'])
356 , duration2=lambda r: calc_delta(r['a'], r['c'])
357 )
358 ).compute(scheduler='processes')
359 logger.debug(type(self.dd))
[D 200316 19:19:28 exploratory:352] <class'dask.dataframe.core.DataFrame'>
[D 200316 19:19:43 exploratory:359] <class 'pandas.core.frame.DataFrame'>
All the guidance (with lots of hacking) suggests that this is the way to add (logical) columns to the partitioned dask dataframe. But not if it doesn't actually return a dask dataframe.
What am I missing?
Is it not because you are calling "compute"?
Maybe this:
self.dd.map_partitions(
lambda df: df.assign(
duration1=lambda r: calc_delta(r['a'], r['b'])
, duration2=lambda r: calc_delta(r['a'], r['c'])
)
)
actually returns a dask dataframe. But then you call compute which is supposed to return you a result, hence the pandas dataframe, no?

How to use pandas DataFrame or Series with seaborn compliantly?

I import this dataset and select an interval with
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'],
index_col='dt')
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
df_08_09.info()
to get
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1052641 entries, 2008-01-01 00:00:00 to 2010-01-01 00:00:00
Data columns (total 7 columns):
Global_active_power 1052641 non-null float64
Global_reactive_power 1052641 non-null float64
Voltage 1052641 non-null float64
Global_intensity 1052641 non-null float64
Sub_metering_1 1052641 non-null float64
Sub_metering_2 1052641 non-null float64
Sub_metering_3 1052641 non-null float64
dtypes: float64(7)
memory usage: 64.2 MB
I just wanted to know how I can treat the DatetimeIndex dt as a data column as well to make use of lmplot() or regplot() as, for, e.g.:
seaborn.regplot(x="dt", y="Global_active_power", data=df_08_09)
The dt is always making problems, because seaborn is not able to access it for some reason. I tried to access the DatetimeIndex, but i found no way to extract it and make it a data column, due to the fact that I'm not quite used to pandas.
I expect seaborn to find dt in the data, but it doesn't and throws an error accordingly. It's clear for me to see that, but idk how to treat this in an efficent python/pandas/seaborn fashion. So please help me out! :)
...An other question btw... I'm also wondering why df_08_09.Global_active_power.values is returning an (n,) shaped np.array and not (n,1). I'm always forced to do values = np.array([values]).transpose() to recover (n,1)
You can use a workaround and convert the datetime-columns to an integer first and replace the axis-tick-labels of matplotlib with the datetime-values afterwards, e.g.
import pandas as pd
import numpy as np
import seaborn
from datetime import datetime
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'])
#index_col='dt') No need for this, as we are not working with indexes
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
# Convert to integers
df_08_09['date_ordinal'] = pd.to_datetime(df_08_09['dt']).apply(lambda date: date.timestamp())
# Plotting as integers
ax = seaborn.regplot(data=df_08_09, x="date_ordinal", y="Global_active_power")
# Adjust axis
ax.set_xlim(df_08_09['date_ordinal'].min() - 1, df_08_09['date_ordinal'].max() + 1)
ax.set_ylim(0, df_08_09['Global_active_power'].max() + 1)
# Set x-axis-tick-labels to datetime
new_labels = [datetime.utcfromtimestamp(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels, rotation = 45)
Reference: This SO answer by waterproof

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Resources