I am trying to plot a column of pandas dataframe with datetime index (timeseries). Some dates and times have no rows in dataframe and when I am going to plot it using simple df['column_name'].plot(), on the x axis which is datetime, it shows date and times with no rows in the dataframe and connects data before these empty days to date after it.
How should I get rid of these empty rows in plotting?
When making a line plot, the plotting library don't automatically know between what datapoints there should be a line drawn, and between what points, there should be a gap.
The most straightforward way to tell the library this, I think, is to create NaN-rows so that the index reflects what you think it should reflect. I.e. if you think data should be per minute, make sure that the dataframe index is per minute.
The plotting library then understands that where there is NaN-data, no line should be drawn.
Code example:
# generate a dataframe with one column of
df = pd.DataFrame(
[
['2020-04-03 12:10:00',23.2],
['2020-04-03 12:12:00',23.1],
['2020-04-03 12:13:00',14.1], #notice the gap here!
['2020-04-03 12:24:00',23.1],
['2020-04-03 12:25:00',23.3],
],
columns=['timestamp','value']
)
df['timestamp'] = pd.to_datetime(df.timestamp) # make sure that the timestamp data is stored as timestamps
Then we create reindex the data, which create new nan-rows where neededn.
df = df.set_index('timestamp')
df = df.reindex(pd.date_range(start=df.index.min(),end=df.index.max(),freq='1min'))
Finally plot it!
df['value'].plot(figsize=(10,6))
The result looks like
Related
I have a long panda-dataframe of many data, of which there are columns ['date', 'MtM', 'desk', 'counterParty', ......].
I would like to sum-up the value of 'MtM', where the y-axis is ['desk', 'counterParty'].
Also, would like the x-axis to be 'date'.
How do I do that?
Only know the syntax df.groupby(['desk', counterParty']).sum().
But how do I get the 'date' to show up along the x-axis?
Thanks!
I am new to python and doing a time series analysis of stocks.I created a data frame of rolling average of 5 stocks according to their percentage change in close price.Therefore this df has 5 columns and i have another df index rolling average of percentage change of closing price.I want to plot individual stock column of the df with the index df. I wrote this code
fig.add_subplot(5,1,1)
plt.plot(pctchange_RA['HUL'])
plt.plot(N50_RA)
fig.add_subplot(5,1,2)
plt.plot(pctchange_RA['IRCON'])
plt.plot(N50_RA)
fig.add_subplot(5,1,3)
plt.plot(pctchange_RA['JUBLFOOD'])
plt.plot(N50_RA)
fig.add_subplot(5,1,4)
plt.plot(pctchange_RA['PVR'])
plt.plot(N50_RA)
fig.add_subplot(5,1,5)
plt.plot(pctchange_RA['VOLTAS'])
plt.plot(N50_RA)
NOTE:pctchange_RA is a pandas df of 5 stocks and N50_RA is a index df of one column
You can put your column names in a list and then just loop over it and create subplots dynamically. A pseudocode would look like the following
cols = ['HUL', 'IRCON', 'JUBLFOOD', 'PVR', 'VOLTAS']
for i, col in enumerate(cols):
ax = fig.add_subplot(5, 1, i+1)
ax.plot(pctchange_RA[col])
ax.plot(N50_RA)
I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)
we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)