I have a long panda-dataframe of many data, of which there are columns ['date', 'MtM', 'desk', 'counterParty', ......].
I would like to sum-up the value of 'MtM', where the y-axis is ['desk', 'counterParty'].
Also, would like the x-axis to be 'date'.
How do I do that?
Only know the syntax df.groupby(['desk', counterParty']).sum().
But how do I get the 'date' to show up along the x-axis?
Thanks!
Related
I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]
Let us say we have a dataframe as under:
index = pd.MultiIndex.from_tuples(zip(['A','A','A','A','B','B','B','B'],[2012,2013,2014,2015,2012,2013,2014,2015]))
df = pd.DataFrame({
'col1':[10,20,10,30,50,20,60,80],
'col2':[40,20,40,30,50,20,60,80],
'col3':[10,20,80,30,80,20,80,10],
},index=index)
I then added the names for these multi-index as 'Product' and 'Year' respectively.
Now I need to plot this data in such a way that for each 'Product' there is a different line for a specific column.
I tried this but it doesn't work.
df.plot(kind='line',x='Year')
I tried unstacking the dataframe using unstack(), however, as there are multiple columns, I will have to create as many new dataframes as there are columns for this to work.
Is there any other way?
You can unstack:
df['col'].unstack(level=0).plot()
Output:
I am trying to plot a column of pandas dataframe with datetime index (timeseries). Some dates and times have no rows in dataframe and when I am going to plot it using simple df['column_name'].plot(), on the x axis which is datetime, it shows date and times with no rows in the dataframe and connects data before these empty days to date after it.
How should I get rid of these empty rows in plotting?
When making a line plot, the plotting library don't automatically know between what datapoints there should be a line drawn, and between what points, there should be a gap.
The most straightforward way to tell the library this, I think, is to create NaN-rows so that the index reflects what you think it should reflect. I.e. if you think data should be per minute, make sure that the dataframe index is per minute.
The plotting library then understands that where there is NaN-data, no line should be drawn.
Code example:
# generate a dataframe with one column of
df = pd.DataFrame(
[
['2020-04-03 12:10:00',23.2],
['2020-04-03 12:12:00',23.1],
['2020-04-03 12:13:00',14.1], #notice the gap here!
['2020-04-03 12:24:00',23.1],
['2020-04-03 12:25:00',23.3],
],
columns=['timestamp','value']
)
df['timestamp'] = pd.to_datetime(df.timestamp) # make sure that the timestamp data is stored as timestamps
Then we create reindex the data, which create new nan-rows where neededn.
df = df.set_index('timestamp')
df = df.reindex(pd.date_range(start=df.index.min(),end=df.index.max(),freq='1min'))
Finally plot it!
df['value'].plot(figsize=(10,6))
The result looks like
I have a excel file with multilevel data and i need to melt them into a single level column
df = pd.read_excel('test.xlsx')
df.to_excel('test1.xlsx')
I need the dataframe output to look like below
Geo PC Month A B C Total
Jan-19
Feb-19
Consider using pandas.melt?
From the docs
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)