df.mean() / jupyter / pandas alternating axis for output - python-3.x

I haven't posted many questions, but, I have found a very strange behavior causing alternating output. I'm hoping someone can help shed some light on this.
I am using jupyter and I am creating some data like this:
# Use the following data for this assignment:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(12345)
df = pd.DataFrame([np.random.normal(32000,200000,3650),
np.random.normal(43000,100000,3650),
np.random.normal(43500,140000,3650),
np.random.normal(48000,70000,3650)],
index=[1992,1993,1994,1995])
df
Now in the next cell I have a couple lines to get the transpose of the DF and then get the mean and standard deviations. However, when I run this cell multiple times it seems that I am getting different output from .mean()
df = df.T
values = df.mean(axis=0)
std = df.std(axis=0)
values
I am using shift enter to run this second cell and this is what I will get:
1992 33312.107476
1993 41861.859541
1994 39493.304941
1995 47743.550969
dtype: float64
And when I run the cell again using shift + enter (Output truncated but you should get the idea)
0 5447.716574
1 126449.084350
2 41091.469083
3 -61754.197831
4 223744.364842
5 94746.779056
6 57607.078825
7 109812.089923
8 28283.060354
9 69768.157194
10 32952.030326
11 40222.026635
12 64786.632304
13 17025.266684
14 111334.168830
15 96067.788206
16 -68157.985363
I have tried changing the axis parameter and removing the axis parameter but the output remains the same
Here is a screen shot incase anyone is interested in duplicating what I have done:
Jupyter window on my end
Thanks for reading.

Your problem is that in your second cell, you are re-assigning your df to be df.T, so every time, it is transposing your dataframe again. So what you can do is: Don't use df = df.T, just say this instead:
values = df.T.mean(axis=0)
std = df.T.std(axis=0)
Or even better, use axis=1 (apply it to columns instead of rows) without transposing:
values = df.mean(axis=1)
std = df.std(axis=1)

You can use describe
df.T.describe()
Out[267]:
1992 1993 1994 1995
count 3650.000000 3650.000000 3650.000000 3650.000000
mean 34922.760627 41574.363827 43186.197526 49355.777683
std 200618.445749 98495.601455 140639.407130 70408.448642
min -632057.636640 -292484.131067 -435217.159232 -181304.694667
25% -98715.272565 -24771.835741 -49460.639563 -973.422386
50% 34446.219184 41474.621854 43323.557410 49281.270881
75% 170722.706967 107502.446843 136286.933017 97422.070284
max 714855.084396 453834.306915 516751.566696 295427.273677

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Numbers appearing in scientific notation after imputing missing values with mean in a dataframe

I have imputed missing values with mean for my dataset but post this process I can see that the amount values are showing in a scientific format, though the data type is still float64. I have used the following code :
mean_value1=df1['amount'].mean()
df1['amount']=df1['amount'].fillna(mean_value1)
df1['start_balance']=df1['start_balance'].fillna(mean_value2)
mean_value3=df1['end_balance'].mean()
df1['end_balance']=df1['end_balance'].fillna(mean_value3)
df1 = df1.fillna(df1.mode().iloc[0])
df1.head()
missing values are treated correctly but the values for start balance and end balance are coming in scientific notation. How can I prevent this to happen?
The output looks like following:
amount booking_date booking_text date_end_balance date_start_balance end_balance month start_balance tx_code
-60790.332082 2017-06-30 SEPA-Gutschrift 2017-06-30 2017-06-01 2.693179e+07 June-2017 2.652441e+07 166.0
-10.000000 2016-03-22 GEBUEHREN 2016-03-22 2016-02-22 3.589838e+06 March-2016 3.590838e+06 808.0
If you don't want to round the numbers you can change how they are displayed in the output this way
import pandas as pd
df = pd.DataFrame(np.random.random(5)*10000000000, columns=['random'])
pd.set_option('display.float_format', lambda x: '%.0f' % x)
df
which gives this output
random
0 7591769472
1 78148991059
2 19880680453
3 1965830619
4 39390983843
instead of this output
random
0 6.704323e+10
1 6.714734e+10
2 8.447027e+09
3 3.051957e+10
4 1.481439e+09
change %.0f to whatever number of decimal places you want to see from the numbers so two change 0 to 2, 3 0 to 3 etc.
you can also use df.apply(lambda x: '%.0f' % x, axis=1) as well
df1['amount'] = df1['amount'].astype('int64')
df1['start_balance'] = df1['start_balance'].astype('int64')
This worked for me well! in a different step but still worked

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

P-value normal test for multiple rows

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

Resources