DataFrame remove useless columns - python-3.x

i use the following code to build and prepare my pandas dataframe:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
Unfurtunately there are a lot of 0's in my dataframe (the articles don't start at the same date, so so if A starts in 2015 and B starts in 2016, B will get 0 through the whole year 2015)
Here is the grouped dataframe:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
And here the corresponding percentage change:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
The factor 12 at 405659844106 is 'correct'
Here is another example from my dataframe:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
And the corresponding percentage change:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
As you can see, there are changes of factor 200-300 which comefrome the change of the replaced NaN to a real value.
This data is used to do a kmeans-clustering and such 'nonsense'-data ruins my kmeans-centers.
Does anyone have an idea how to remove such columns?

I used the following statement to drop the nonsense columns:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)

Related

How to avoid bfill or ffill when calculating pct_change with NaNs

For a df like below, I use pct_change() to calculate the rolling percentage changes:
price = [np.NaN, 10, 13, np.NaN, np.NaN, 9]
df = pd. DataFrame(price, columns = ['price'])
df
Out[75]:
price
0 NaN
1 10.0
2 13.0
3 NaN
4 NaN
5 9.0
But I get these unexpected results:
df.price.pct_change(periods = 1, fill_method='bfill')
Out[76]:
0 NaN
1 0.000000
2 0.300000
3 -0.307692
4 0.000000
5 0.000000
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='pad')
Out[77]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='ffill')
Out[78]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
I hope that while calculating with NaNs, the results will be NaNs instead of being filled forward or backward and then calculated.
May I ask how to achieve it? Thanks.
The expected result:
0 NaN
1 NaN
2 0.300000
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html
Maybe you can compute the pct manually with diff and shift:
period = 1
pct = df.price.diff().div(df.price.shift(period))
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Update: you can pass fill_method=None
period = 1
pct = df.price.pct_change(periods=period, fill_method=None)
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64

Drop columns if all of their values in a specific date range are NaNs using Pandas

Given a data sample as follows:
date value1 value2 value3
0 2021-10-12 1.015 1.115668 1.015000
1 2021-10-13 NaN 1.104622 1.030225
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 1.077284
5 2021-10-29 1.015 1.061520 1.093443
6 2021-10-30 1.015 1.051010 1.109845
7 2021-10-31 1.015 NaN 1.126493
8 2021-11-1 1.015 NaN NaN
9 2021-11-2 1.015 1.020100 NaN
10 2021-11-3 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN
Let's say I want to drop columns whose values all are NaNs in the November of 2021, which means range of 2021-11-01 to 2021-11-30 (including the starting and ending date).
Under this requirement, vlue3 will be drop since all its values in 2021-11 are NaNs. Other columns have NaNs in 2021-11 but not all, so those columns will be kept.
How could I achieve that in Pandas? Thanks.
EDIT:
df['date'] = pd.to_datetime(df['date'])
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df.loc[mask]
Out:
date value1 value2 value3
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.0201 NaN
10 2021-11-03 NaN 1.0100 NaN
11 2021-11-30 1.015 1.0000 NaN
You can filter rows by November of 2021 and test if all rows has NaNs by conditions:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, ~df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()]
Or:
df['date'] = pd.to_datetime(df['date'])
df = df.loc[:, df[df['date'].dt.to_period('m') == pd.Period('2021-11')].notna().any()]
EDIT: If need manually set some columns for not processing use:
mask = (df['date'] >= '2021-11-01') & (df['date'] <= '2021-11-30')
df = df.loc[:, df.loc[mask].notna().any()]
Out:
date value1 value2
0 2021-10-12 1.015 1.115668
1 2021-10-13 NaN 1.104622
2 2021-10-14 NaN 1.093685
3 2021-10-15 1.015 1.082857
4 2021-10-16 1.015 1.072135
5 2021-10-29 1.015 1.061520
6 2021-10-30 1.015 1.051010
7 2021-10-31 1.015 NaN
8 2021-11-01 1.015 NaN
9 2021-11-02 1.015 1.020100
10 2021-11-03 NaN 1.010000
11 2021-11-30 1.015 1.000000
EDIT:
df = df.assign(value4 = np.nan)
print (df)
date value1 value2 value3 value4
0 2021-10-12 1.015 1.115668 1.015000 NaN
1 2021-10-13 NaN 1.104622 1.030225 NaN
2 2021-10-14 NaN 1.093685 NaN NaN
3 2021-10-15 1.015 1.082857 NaN NaN
4 2021-10-16 1.015 1.072135 1.077284 NaN
5 2021-10-29 1.015 1.061520 1.093443 NaN
6 2021-10-30 1.015 1.051010 1.109845 NaN
7 2021-10-31 1.015 NaN 1.126493 NaN
8 2021-11-1 1.015 NaN NaN NaN
9 2021-11-2 1.015 1.020100 NaN NaN
10 2021-11-3 NaN 1.010000 NaN NaN
11 2021-11-30 1.015 1.000000 NaN NaN
df['date'] = pd.to_datetime(df['date'])
m = df[df['date'].dt.to_period('m') == pd.Period('2021-11')].isna().all()
m.loc['value4'] = False
print (m)
date False
value1 False
value2 False
value3 True
value4 False
dtype: bool
df = df.loc[:, ~m]
print (df)
date value1 value2 value4
0 2021-10-12 1.015 1.115668 NaN
1 2021-10-13 NaN 1.104622 NaN
2 2021-10-14 NaN 1.093685 NaN
3 2021-10-15 1.015 1.082857 NaN
4 2021-10-16 1.015 1.072135 NaN
5 2021-10-29 1.015 1.061520 NaN
6 2021-10-30 1.015 1.051010 NaN
7 2021-10-31 1.015 NaN NaN
8 2021-11-01 1.015 NaN NaN
9 2021-11-02 1.015 1.020100 NaN
10 2021-11-03 NaN 1.010000 NaN
11 2021-11-30 1.015 1.000000 NaN

NaN values in product id

Am concatenating my two dataframes base_df and base_df1 with base_df having product id and base_df1 as sales, profit and discount.
base_df1
sales profit discount
0 0.050090 0.000000 0.262335
1 0.110793 0.000000 0.260662
2 0.309561 0.864121 0.241432
3 0.039217 0.591474 0.260687
4 0.070205 0.000000 0.263628
base_df['Product ID']
0 FUR-ADV-10000002
1 FUR-ADV-10000108
2 FUR-ADV-10000183
3 FUR-ADV-10000188
4 FUR-ADV-10000190
final_df=pd.concat([base_df1,base_df], axis=0, ignore_index=True,sort=False)
But my final_df.head() having NaN values in product_id column, what might be the issue.
sales Discount profit product id
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
Try using axis=1:
final_df=pd.concat([base_df1,base_df], axis=1, sort=False)
Output:
sales profit discount ProductID
0 0.050090 0.000000 0.262335 FUR-ADV-10000002
1 0.110793 0.000000 0.260662 FUR-ADV-10000108
2 0.309561 0.864121 0.241432 FUR-ADV-10000183
3 0.039217 0.591474 0.260687 FUR-ADV-10000188
4 0.070205 0.000000 0.263628 FUR-ADV-10000190
With axis=0 you are stacking the dataframes vertically and with pandas using intrinsic data alignment, meaning aligning the data with the indexes, you are generating the following dataframe:
final_df=pd.concat([base_df1,base_df], axis=0, sort=False)
sales profit discount ProductID
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
0 NaN NaN NaN FUR-ADV-10000002
1 NaN NaN NaN FUR-ADV-10000108
2 NaN NaN NaN FUR-ADV-10000183
3 NaN NaN NaN FUR-ADV-10000188
4 NaN NaN NaN FUR-ADV-10000190

Error while performing operation on DatetimeIndexResampler type

I have a time-series data frame and want to find difference between the date in each record and the last (maximum) date within that data-frame. But getting error - TypeError: unsupported operand type(s) for -: 'DatetimeIndex' and 'SeriesGroupBy'. Seems from the error that data frame is not in the 'right' type to be allowed to have these operations allowed. How can I avoid this or possibly cast the data in some other format to be able to do the operation. Below is sample code which reproduces the error
import pandas as pd
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],[54.7,36.3,'2010-07-21'],[52.3,38.7,'2010-07-26'],[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df.index = df.date
df = df.resample('D')
print(type(df))
diff = (df.date.max() - df.date).values
I think you need create DatetimeIndex first by DataFrame.set_index, so if aggregate by max then get consecutive values:
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],
[54.7,36.3,'2010-07-21'],
[52.3,38.7,'2010-07-26'],
[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df1 = df.set_index('date').resample('D').max()
#alternative if not duplicated datetimes
#df1 = df.set_index('date').asfreq('D')
print (df1)
col1 col2
date
2010-07-20 54.7 36.3
2010-07-21 54.7 36.3
2010-07-22 NaN NaN
2010-07-23 NaN NaN
2010-07-24 NaN NaN
2010-07-25 NaN NaN
2010-07-26 52.3 38.7
2010-07-27 NaN NaN
2010-07-28 NaN NaN
2010-07-29 NaN NaN
2010-07-30 52.3 38.7
Then subtract max value of index with itself and convert timedeltas to days by TimedeltaIndex.days:
df1['diff'] = (df1.index.max() - df1.index).days
print (df1)
col1 col2 diff
date
2010-07-20 54.7 36.3 10
2010-07-21 54.7 36.3 9
2010-07-22 NaN NaN 8
2010-07-23 NaN NaN 7
2010-07-24 NaN NaN 6
2010-07-25 NaN NaN 5
2010-07-26 52.3 38.7 4
2010-07-27 NaN NaN 3
2010-07-28 NaN NaN 2
2010-07-29 NaN NaN 1
2010-07-30 52.3 38.7 0

Replacing values in a string with NaN

Faced a simple task, but I can not solve. There is a table in df:
Date X1 X2
02.03.2019 2 2
03.03.2019 1 1
04.03.2019 2 3
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
And I need for rows where Date < 05.03.2019 set X1=NaN, X2=NaN:
Date X1 X2
02.03.2019 NaN NaN
03.03.2019 NaN NaN
04.03.2019 NaN NaN
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
First convert column Date to datetimes and then set values by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
df.loc[df['Date'] < '2019-03-05', ['X1','X2']] = np.nan
print (df)
Date X1 X2
0 2019-03-02 NaN NaN
1 2019-03-03 NaN NaN
2 2019-03-04 NaN NaN
3 2019-03-05 1.0 12.0
4 2019-03-06 2.0 2.0
5 2019-03-07 3.0 3.0
6 2019-03-08 4.0 1.0
7 2019-03-09 1.0 2.0
If there is DatetimeIndex:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
#change datetime to 2019-03-04
df.loc[:'2019-03-04'] = np.nan
print (df)
X1 X2
Date
2019-03-02 NaN NaN
2019-03-03 NaN NaN
2019-03-04 NaN NaN
2019-03-05 1.0 12.0
2019-03-06 2.0 2.0
2019-03-07 3.0 3.0
2019-03-08 4.0 1.0
2019-03-09 1.0 2.0
Or:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
df.loc[df.index < '2019-03-05'] = np.nan
Dont use this solution, this is just another approach possible (-: (this will affect all columns)
df.mask(df.Date < '05.03.2019').combine_first(df[['Date']])
Date X1 X2
0 02.03.2019 NaN NaN
1 03.03.2019 NaN NaN
2 04.03.2019 NaN NaN
3 05.03.2019 1.0 12.0
4 06.03.2019 2.0 2.0
5 07.03.2019 3.0 3.0
6 08.03.2019 4.0 1.0
7 09.03.2019 1.0 2.0

Resources