I stuck in this error, I want to subtract two date and get the difference as Days
I always receive the below error
here is the dataframe info
The reason this is happening is because you have pd.MultiIndex column headers. I can tell you have MultiIndex column headers by tuples in your column names from pd.DataFrame.info() results.
See this example below:
df = pd.DataFrame(np.random.randint(100,999,(5,5))) #create a dataframe
df.columns = pd.MultiIndex.from_arrays([['A','B','C','D','E'],['max','min','max','min','max']])
#create multi index column headers
type(df['A'] - df['E'])
Output:
pandas.core.frame.DataFrame
Note The type of the return even though you are subtracting one column from another column. You expected a pd.Series, but this is returning a dataframe.
You have a couple of options to solving this.
Option 1 use squeeze:
type((df['A'] - df['E']).squeeze())
pandas.core.series.Series
Option 2 flatten your column headers before:
df.columns = df.columns.map('_'.join)
type(df['A_max'] - df['E_max'])
Output:
pandas.core.series.Series
Now, you can apply .dt datetime accessor to your series. type is important to know the object you are working with.
Well, as #EdChum said above, .dt is a pd.DataFrame attribute, not a pd.Series method. If you want to get the date difference, use the apply() pd.Dataframe method.
Related
I have a list that is 5 rows by 5 columns.
I am trying to convert this list into a dataframe.
When I try to do so, it only grabs the first row.
This failed because I had it set to 5,5:
df2 = pd.DataFrame(np.array(pdf_read).reshape(5,5),columns=list("abcde"))
When I switched it to this:
df2 = pd.DataFrame(np.array(pdf_read).reshape(1,5),columns=list("abcde"))
It only grabbed the first row.
Why does it do this?
Any advice?
Edit: Added Context
I am using the tabula module in python to read a PDF file.
The PDF file results are stored in the variable pdf_read.
When I do len(pdf_read) it has a length of 1, but when I type
print(pdf_read) it says it is 5 rows x 5 columns, which is very strange.
Edit #2: Datatypes
I ran the following:
print(type(pdf_read))
print(type(pdf_read[0]))
I got <class 'list'> and <class 'pandas.core.frame.DataFrame'> respectively.
It seems I have a Dataframe inside of a list.
I ran this code:
df = pd.DataFrame(
pdf_read[0],columns=["column_a","column_b","column_c","column_d","column_e"]
)
This just returns a 5,5 dataframe, but all of the values in each column are NaN.
Some progress made, but will need to figure out why the values are not populated now.
EDIT: After some research output pdf_read is list of DataFrames.
So for first DataFrame:
df = pdf_read[0]
Here I am applying OneHotEncoder to one of my dataframe columns.
dfcars= pd.read_excel('cars.xlsx')
ohe=OneHotEncoder()
temp1= pd.DataFrame(ohe.fit_transform(dfcars[['Car Model']]).toarray())
ohe.categories_
dfcars = pd.concat([dfcars,temp1], axis=1)
This is my dataset after aplying OHE:
dfcars
dfcars[0] doesn't display the the first column.
dfcars[4] shows error.
Why is this happening?
This may be happening because dfcars[0] syntax is df[column_name_string] and you have a column of the name "0" but you don't have a column of the name "4".
You can rename the columns before concatenating:
temp1= pd.DataFrame(ohe.fit_transform(dfcars[['Car Model']]).toarray(),columns=['Category_0', 'Category_1', 'Category_2'])
dfcars = pd.concat([dfcars,temp1], axis=1)
For categories_ attribute of OneHotEncoder, you can visit sklearn docs.
We can access any column using the column name df['column_name'] and the same thing is happening here with df[0] as one of the column created after applying OneHotEncoder is named as 0.
To slice the dataframe using index values, one can use iloc.
df.iloc[:,:1] can be used to access the first column.
I pulled some stock data from a financial API and created a DataFrame with it. Columns were 'date', 'data1', 'data2', 'data3'. Then, I converted that DataFrame into a CSV with 'date' column as index:
df.to_csv('data.csv', index_label='date')
In a second script, I read that CSV and attempted to slice the resulting DataFrame between two dates:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
When I attempt to do this, I get the following TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2020-03-28] of <class 'str'>
So clearly, the problem is that I'm trying to use a str to slice a datetime object. But here's the confusing part! If in the first step, I save the DataFrame to a csv and DO NOT set 'date' as index:
df.to_csv('data.csv')
In my second script, I no longer get the TypeError:
df = pd.read_csv('data.csv', parse_dates=['date'] ,index_col='date')
df = df['2020-03-28':'2020-04-28']
Now it works just fine. The only problem is I have the default Pandas index column to deal with.
Why do I get a TypeError when I set the 'date' column as index in my CSV...but I do NOT get a TypeError when I don't set any index in the CSV?
It seems that in your "first" instance of df, date column was an
ordinary column (not the index) and this DataFrame had a default
index - consecutive integers (its name is not important).
In this situation running df.to_csv('data.csv', index_label='date')
causes that the output file contains:
date,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the index column (integers) was given date name, passed by you in
index_label parameter,
the next column, which in df was named date was also
given date name.
Then if you read it running
df = pd.read_csv('data.csv', parse_dates=['date'], index_col='date'), then:
the first date column (integers) has been read as date and
set as the index,
the second date column (dates) has been read as date.1 and
it is an ordinary column.
Now when you run df['2020-03-28':'2020-04-28'], you attempt to find rows
with index in the range given. But the index column is of Int64Index
type (check this in your installation), hence just the mentioned exception
was thrown.
Things look other way when you run df.to_csv('data.csv').
Now this file contains:
,date,data1,data2,data3
0,2020-03-27,10.5,12.3,13.2
1,2020-03-28,10.6,12.9,14.7
i.e.:
the first column (which in df was the index) has no name and int
values,
the only column named date is the second column and contains
dates.
Now when you read it, the result is:
date (converted do DatetimeIndex) is the index,
the original index column got name Unnamed: 0, no surprise,
since in the source file it had no name.
And now, when you run df['2020-03-28':'2020-04-28'] everything is OK.
The thing to learn for the future:
Running df.to_csv('data.csv', index_label='date') does not set this
column as the index. It only saves the current index column
under the given name, without any check whether any other column has
just the same name.
The result is that 2 columns can have the same name.
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)
I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.