How to use pandas DataFrame or Series with seaborn compliantly? - python-3.x

I import this dataset and select an interval with
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'],
index_col='dt')
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
df_08_09.info()
to get
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1052641 entries, 2008-01-01 00:00:00 to 2010-01-01 00:00:00
Data columns (total 7 columns):
Global_active_power 1052641 non-null float64
Global_reactive_power 1052641 non-null float64
Voltage 1052641 non-null float64
Global_intensity 1052641 non-null float64
Sub_metering_1 1052641 non-null float64
Sub_metering_2 1052641 non-null float64
Sub_metering_3 1052641 non-null float64
dtypes: float64(7)
memory usage: 64.2 MB
I just wanted to know how I can treat the DatetimeIndex dt as a data column as well to make use of lmplot() or regplot() as, for, e.g.:
seaborn.regplot(x="dt", y="Global_active_power", data=df_08_09)
The dt is always making problems, because seaborn is not able to access it for some reason. I tried to access the DatetimeIndex, but i found no way to extract it and make it a data column, due to the fact that I'm not quite used to pandas.
I expect seaborn to find dt in the data, but it doesn't and throws an error accordingly. It's clear for me to see that, but idk how to treat this in an efficent python/pandas/seaborn fashion. So please help me out! :)
...An other question btw... I'm also wondering why df_08_09.Global_active_power.values is returning an (n,) shaped np.array and not (n,1). I'm always forced to do values = np.array([values]).transpose() to recover (n,1)

You can use a workaround and convert the datetime-columns to an integer first and replace the axis-tick-labels of matplotlib with the datetime-values afterwards, e.g.
import pandas as pd
import numpy as np
import seaborn
from datetime import datetime
data_frame = pd.read_csv('household_power_consumption.txt',
sep=';',
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
low_memory=False,
na_values=['nan','?'])
#index_col='dt') No need for this, as we are not working with indexes
df_08_09 = data_frame.truncate(before='2008-01-01', after='2010-01-01')
# Convert to integers
df_08_09['date_ordinal'] = pd.to_datetime(df_08_09['dt']).apply(lambda date: date.timestamp())
# Plotting as integers
ax = seaborn.regplot(data=df_08_09, x="date_ordinal", y="Global_active_power")
# Adjust axis
ax.set_xlim(df_08_09['date_ordinal'].min() - 1, df_08_09['date_ordinal'].max() + 1)
ax.set_ylim(0, df_08_09['Global_active_power'].max() + 1)
# Set x-axis-tick-labels to datetime
new_labels = [datetime.utcfromtimestamp(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels, rotation = 45)
Reference: This SO answer by waterproof

Related

Getting the MAE values of columns in pandas dataframe with last column

How to compute MAE for the columns in a pandas Dataframe with the last column:
,CPFNN,EN,Blupred,Horvath2,EPM,vMLP,Age
202,4.266596,3.5684403102704,5.2752761330328,5.17705043941232,3.30077613485548,3.412883,4.0
203,5.039452,5.1258136685894,4.40019825995985,5.03563327742846,3.97465334472661,4.140719,4.0
204,5.0227585,5.37207428128756,1.56392554883583,4.41805439337257,4.43779809822224,4.347523,4.0
205,4.796998,5.61052306552109,4.20912233479662,3.57075401779518,3.24902718889411,3.887743,4.0
I have a pandas dataframe and I want to create a list with the mae values of each column with "Age".
Is there a "pandas" way of doing this instead of just doing a for loop for each column?
from sklearn.metrics import mean_absolute_error as mae
mae(blood_bestpred_df["CPFNN"], blood_bestpred_df['Age'])
I'd like to do this:
mae(blood_bestpred_df[["CPFNN,EN,Blupred,Horvath2,EPM,vMLP"]], blood_bestpred_df['Age'])
But I have a dimension issue.
Looks like sklearn's MAE requires both inputs to be the same shape and doesn't do any broadcasting (I'm not an sklearn expert, there might be another way around this). You can use raw pandas instead:
import pandas as pd
df = pd.read_clipboard(sep=",", index_col=0) # Your df here
out = df.drop(columns="Age").sub(df["Age"], axis=0).abs().mean()
out:
CPFNN 0.781451
EN 1.134993
Blupred 1.080168
Horvath2 0.764996
EPM 0.478335
vMLP 0.296904
dtype: float64

Pandas Timestamp: What type is this?

I have a pandas dataframe with a parsed time stamp. What type is this? I have tried matching against it with the following rules:
dtype_dbg = df[col].dtype # debugger shows it as 'datetime64[ns]'
if isinstance(df[col].dtype,np.datetime64) # no luck
if isinstance(df[col].dtype,pd.Timestamp) # ditto
if isinstance(df[col].dtype,[all other timestamps I could think of]) # nothing
How does one match against the timestamp dtype in a pandas dataframe?
Pandas datetime64[ns] is a '<M8[ns]' numpy type, so you can just compare the dtypes:
df = pd.DataFrame( {'col': ['2019-01-01', '2019-01-02']})
df.col = pd.to_datetime(df.col)
df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 1 columns):
#col 2 non-null datetime64[ns]
#dtypes: datetime64[ns](1)
#memory usage: 144.0 bytes
df[col].dtype == np.dtype('<M8[ns]')
#True
You can also (or maybe should better) use pandas built-in api.types.is_... functions:
pd.api.types.is_datetime64_ns_dtype(df[col])
#True
Your comparisons isinstance(df[col].dtype, ...) don't work, as you compare the type of dtype (which is numpy.dype of course) with other data dtypes, which will naturally fail for any data type.

Keyerror in time/Date Components of datetime - what to do?

I am using a pandas DataFrame with datetime indexing. I know from the
Xarray documentation, that datetime indexing can be done as ds['date.year'] with ds being the DataArray of xarray, date the date index and years of the dates. Xarray points to datetime components which again leads to DateTimeIndex, the latter being panda documentation. So I thought of doing the same with pandas, as I really like this feature.
However, it is not working for me. Here is what I did so far:
# Import required modules
import pandas as pd
import numpy as np
# Create DataFrame (name: df)
df=pd.DataFrame({'Date': ['2017-04-01','2017-04-01',
'2017-04-02','2017-04-02'],
'Time': ['06:00:00','18:00:00',
'06:00:00','18:00:00'],
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
# Combine str() information of Date and Time and format to datetime
df['Date']=pd.to_datetime(df['Date'] + ' ' + df['Time'],format = '%Y-%m-%d %H:%M:%S')
# Make the combined data the index
df = df.set_index(df['Date'])
# Erase the rest, as it is not required anymore
df = df.drop(['Time','Date'], axis=1)
# Show me the first day
df['2017-04-01']
Ok, so this shows me only the first entries. So far, so good.
However
df['Date.year']
results in KeyError: 'Date.year'
I would expect an output like
array([2017,2017,2017,2017])
What am I doing wrong?
EDIT:
I have a workaround, which I am able to go on with, but I am still not satisfied, as this doesn't explain my question. I did not use a pandas DataFrame, but an xarray Dataset and now this works:
# Load modules
import pandas as pd
import numpy as np
import xarray as xr
# Prepare time array
Date = ['2017-04-01','2017-04-01', '2017-04-02','2017-04-02']
Time = ['06:00:00','18:00:00', '06:00:00','18:00:00']
time = [Date[i] + ' ' + Time[i] for i in range(len(Date))]
time = pd.to_datetime(time,format = '%Y-%m-%d %H:%M:%S')
# Create Dataset (name: ds)
ds=xr.Dataset({'time': time,
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
ds['time.year']
which gives:
<xarray.DataArray 'year' (time: 4)>
array([2017, 2017, 2017, 2017])
Coordinates:
* time (time) datetime64[ns] 2017-04-01T06:00:00 ... 2017-04-02T18:00:00
Just in terms of what you're doing wrong, your are
a) trying to call an index as a series
b) chaning commands within a string df['Date'] is a single column df['Date.year'] is a column called 'Date.year'
if you're datetime is the index, then use the .year or dt.year if it's a series.
df.index.year
#or assuming your dtype is a proper datetime (your code indicates it is)
df.Date.dt.year
hope that helps bud.

Interpolation of a variable for different elevations (automatically determine x1 and x2)

I have a dataframe with 3 temperature values columns, as follows:
T1 at 1000 m
T2 at 2000 m
T3 at 3000m
And I have a list with different elevations ranging from 1000 to 3000.
For each elevation I want to create the interpolated temperature.
The main issue is that I can't make my code automatically select the correct columns. For example if my target elevation is 1500 I want to interpolate between 1000 and 2000. I am aiming for simple linear interpolation. I tried the method that is suggested in Pandas: Make a new column by linearly interpolating between existing columns
But I kept getting TypeError: 'zip' object is not subscriptable
Can you help me solve this problem?
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
data = np.random.randint(1, high=50, size=len(days))
Elevation= np.random.randint(1000, high=3000, size=len(days))
df = pd.DataFrame({'Time': days, 'T1000':data,'T2000':data,'T3000':data} )
df = df.set_index('Time')
print(df)
You can simply interpolate for each row of the the DataFrame:
df['Alti'] = Elevation
df['Val'] = df.apply(lambda x: np.interp(x.Alti, [1000, 2000, 3000], x['T1000':'T3000']),
axis=1)

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Resources