NaN values in product id - python-3.x

Am concatenating my two dataframes base_df and base_df1 with base_df having product id and base_df1 as sales, profit and discount.
base_df1
sales profit discount
0 0.050090 0.000000 0.262335
1 0.110793 0.000000 0.260662
2 0.309561 0.864121 0.241432
3 0.039217 0.591474 0.260687
4 0.070205 0.000000 0.263628
base_df['Product ID']
0 FUR-ADV-10000002
1 FUR-ADV-10000108
2 FUR-ADV-10000183
3 FUR-ADV-10000188
4 FUR-ADV-10000190
final_df=pd.concat([base_df1,base_df], axis=0, ignore_index=True,sort=False)
But my final_df.head() having NaN values in product_id column, what might be the issue.
sales Discount profit product id
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN

Try using axis=1:
final_df=pd.concat([base_df1,base_df], axis=1, sort=False)
Output:
sales profit discount ProductID
0 0.050090 0.000000 0.262335 FUR-ADV-10000002
1 0.110793 0.000000 0.260662 FUR-ADV-10000108
2 0.309561 0.864121 0.241432 FUR-ADV-10000183
3 0.039217 0.591474 0.260687 FUR-ADV-10000188
4 0.070205 0.000000 0.263628 FUR-ADV-10000190
With axis=0 you are stacking the dataframes vertically and with pandas using intrinsic data alignment, meaning aligning the data with the indexes, you are generating the following dataframe:
final_df=pd.concat([base_df1,base_df], axis=0, sort=False)
sales profit discount ProductID
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
0 NaN NaN NaN FUR-ADV-10000002
1 NaN NaN NaN FUR-ADV-10000108
2 NaN NaN NaN FUR-ADV-10000183
3 NaN NaN NaN FUR-ADV-10000188
4 NaN NaN NaN FUR-ADV-10000190

Related

How to avoid bfill or ffill when calculating pct_change with NaNs

For a df like below, I use pct_change() to calculate the rolling percentage changes:
price = [np.NaN, 10, 13, np.NaN, np.NaN, 9]
df = pd. DataFrame(price, columns = ['price'])
df
Out[75]:
price
0 NaN
1 10.0
2 13.0
3 NaN
4 NaN
5 9.0
But I get these unexpected results:
df.price.pct_change(periods = 1, fill_method='bfill')
Out[76]:
0 NaN
1 0.000000
2 0.300000
3 -0.307692
4 0.000000
5 0.000000
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='pad')
Out[77]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='ffill')
Out[78]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
I hope that while calculating with NaNs, the results will be NaNs instead of being filled forward or backward and then calculated.
May I ask how to achieve it? Thanks.
The expected result:
0 NaN
1 NaN
2 0.300000
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html
Maybe you can compute the pct manually with diff and shift:
period = 1
pct = df.price.diff().div(df.price.shift(period))
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Update: you can pass fill_method=None
period = 1
pct = df.price.pct_change(periods=period, fill_method=None)
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64

Drop columns if all of their values in a specific date range are NaNs using Pandas

Given a data sample as follows:
date value1 value2 value3
0 2021-10-12 1.015 1.115668 1.015000
1 2021-10-13 NaN 1.104622 1.030225
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 1.077284
5 2021-10-29 1.015 1.061520 1.093443
6 2021-10-30 1.015 1.051010 1.109845
7 2021-10-31 1.015 NaN 1.126493
8 2021-11-1 1.015 NaN NaN
9 2021-11-2 1.015 1.020100 NaN
10 2021-11-3 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN
Let's say I want to drop columns whose values all are NaNs in the November of 2021, which means range of 2021-11-01 to 2021-11-30 (including the starting and ending date).
Under this requirement, vlue3 will be drop since all its values in 2021-11 are NaNs. Other columns have NaNs in 2021-11 but not all, so those columns will be kept.
How could I achieve that in Pandas? Thanks.
EDIT:
df['date'] = pd.to_datetime(df['date'])
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df.loc[mask]
Out:
date value1 value2 value3
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.0201 NaN
10 2021-11-03 NaN 1.0100 NaN
11 2021-11-30 1.015 1.0000 NaN
You can filter rows by November of 2021 and test if all rows has NaNs by conditions:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, ~df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()]
Or:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, df[df['date'].dt.to_period('m') == pd.Period('2021-11')].notna().any()]
EDIT: If need manually set some columns for not processing use:
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df = df.loc[:, df.loc[mask].notna().any()]
Out:
date value1 value2
0 2021-10-12 1.015 1.115668
1 2021-10-13 NaN 1.104622
2 2021-10-14 NaN 1.093685
3 2021-10-15 1.015 1.082857
4 2021-10-16 1.015 1.072135
5 2021-10-29 1.015 1.061520
6 2021-10-30 1.015 1.051010
7 2021-10-31 1.015 NaN
8 2021-11-01 1.015 NaN
9 2021-11-02 1.015 1.020100
10 2021-11-03 NaN 1.010000
11 2021-11-30 1.015 1.000000
EDIT:
df = df.assign(value4 = np.nan)
print (df)
date value1 value2 value3 value4
0 2021-10-12 1.015 1.115668 1.015000 NaN
1 2021-10-13 NaN 1.104622 1.030225 NaN
2 2021-10-14 NaN 1.093685 NaN NaN
3 2021-10-15 1.015 1.082857 NaN NaN
4 2021-10-16 1.015 1.072135 1.077284 NaN
5 2021-10-29 1.015 1.061520 1.093443 NaN
6 2021-10-30 1.015 1.051010 1.109845 NaN
7 2021-10-31 1.015 NaN 1.126493 NaN
8 2021-11-1 1.015 NaN NaN NaN
9 2021-11-2 1.015 1.020100 NaN NaN
10 2021-11-3 NaN 1.010000 NaN NaN
11 2021-11-30 1.015 1.000000 NaN NaN
df['date'] = pd.to_datetime(df['date'])
m = df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()
m.loc['value4'] = False
print (m)
date False
value1 False
value2 False
value3 True
value4 False
dtype: bool
df = df.loc[:, ~m]
print (df)
date value1 value2 value4
0 2021-10-12 1.015 1.115668 NaN
1 2021-10-13 NaN 1.104622 NaN
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 NaN
5 2021-10-29 1.015 1.061520 NaN
6 2021-10-30 1.015 1.051010 NaN
7 2021-10-31 1.015 NaN NaN
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.020100 NaN
10 2021-11-03 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN

Filter the rows based on the missing values on the specific set columns - pandas

I have data frame as shown below
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN NaN 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN 140 NaN NaN
From the above df, I would like to filter the rows where exactly one of the columns D_N, D_A, C_N and C_A has non zero.
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Bank In_Pay NaN NaN NaN 900
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Eat Food NaN 140 NaN NaN
I tried the below code but that does not work.
df[df.loc[:, ["D_N", "D_A", "C_N", "C_A"]].isna().sum(axis=1).eq(1)]
Use DataFrame.count for count values excluded missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].count(axis=1).eq(1)]
print (df1)
ID Type Desc D_N D_A C_N C_A
1 1 Bank In_Pay NaN NaN NaN 900.0
7 2 Bank NaN NaN NaN 8.0 NaN
8 2 NaN Food 4.0 NaN NaN NaN
11 3 Eat Food NaN 140.0 NaN NaN
Your solution is possible modify with test non missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].notna().sum(axis=1).eq(1)]

Remove the rows when the values of the all the specific 4 columns are NaN in pandas

I have a df as shown below
df:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN NaN NaN NaN
About the df:
whenever D_N non-NaN D_A should be non-NaN, at the same time C_N and C_A should be NaN and vice versa.
In the above data, I would like to filter rows where any of the D_N, D_A, C_N and C_A are non-NaN
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN 8 900
1 Eat Food 4 200 NaN NaN
1 Bank NaN NaN NaN 4 700
2 Edu Education NaN NaN 1 100
2 Bank In_Pay NaN NaN 8 NaN
2 Eat Food 4 200 NaN NaN
3 Bank NaN 2 300 NaN NaN
print(df.dropna(subset=["D_N", "D_A", "C_N", "C_A"], how="all"))
Prints:
ID Type Desc D_N D_A C_N C_A
0 1 Edu Education 3.0 100.0 NaN NaN
1 1 Bank In_Pay NaN NaN 8.0 900.0
2 1 Eat Food 4.0 200.0 NaN NaN
4 1 Bank NaN NaN NaN 4.0 700.0
6 2 Edu Education NaN NaN 1.0 100.0
7 2 Bank In_Pay NaN NaN 8.0 NaN
8 2 Eat Food 4.0 200.0 NaN NaN
10 3 Bank NaN 2.0 300.0 NaN NaN

DataFrame remove useless columns

i use the following code to build and prepare my pandas dataframe:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
Unfurtunately there are a lot of 0's in my dataframe (the articles don't start at the same date, so so if A starts in 2015 and B starts in 2016, B will get 0 through the whole year 2015)
Here is the grouped dataframe:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
And here the corresponding percentage change:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
The factor 12 at 405659844106 is 'correct'
Here is another example from my dataframe:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
And the corresponding percentage change:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
As you can see, there are changes of factor 200-300 which comefrome the change of the replaced NaN to a real value.
This data is used to do a kmeans-clustering and such 'nonsense'-data ruins my kmeans-centers.
Does anyone have an idea how to remove such columns?
I used the following statement to drop the nonsense columns:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)

Resources