I have below Data from the excel sheet and i want every NaN to be filled from Just its previous value even if its one or more NaN. I tried with ffill() method but doesn't solve the purpose because it takes very First value before NaN of the column and populated that to all NaN.
Could someone help pls.
My Dtaframe:
import pandas as pd
df = pd.read_excel("Example-sheat.xlsx",sheet_name='Sheet1')
#df = df.fillna(method='ffill')
#df = df['AuthenticZTed domaTT controller'].ffill()
print(df)
My DataFrame output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 NaN TTv1670
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 NaN xxgb-gen
8 ZTPGRKMIK1DC200.example.com TTva038
9 DU2RDCRDC1DC204.example.com TTv0071
10 NaN ttv0032
11 KT1MUC02DUDC201.example.com TTv0073
12 NaN TTv0679
13 TN1SZZ67DC200.example.com TTv1180
14 TT1NDZ45DC202.example.com TTv1181
15 TT1BLR01APDC200.example.com TTv0859
16 TN1SZZ67DC200.example.com xxg2089
17 NaN TTv1846
18 ZTPGRKMIK1DC200.example.com TTvtp064
19 PR1CPQ01DC200.example.com TTv0950
20 PR1CPQ01DC200.example.com TTc7005
21 NaN TTv0678
22 KT1MUC02DUDC201.example.com TTv257032798
23 PR1CPQ01DC200.example.com xxg2016
24 NaN TTv0313
25 TT1BLR01APDC200.example.com TTc4901
26 NaN TTv0710
27 DU2RDCRDC1DC204.example.com xxg3008
28 NaN TTv1080
29 PR1CPQ01DC200.example.com xxg2022
30 NaN xxg2057
31 NaN TTv1522
32 TN1SZZ67DC200.example.com TTv258998881
33 PR1CPQ01DC200.example.com TTv259064418
34 ZTPGRKMIK1DC200.example.com TTv259129955
35 TT1BLR01APDC200.example.com xxg2034
36 NaN TTv259326564
37 TNHSZPBCD2DC200.example.com TTv259129952
38 KT1MUC02DUDC201.example.com TTv259195489
39 ZTPGRKMIK1DC200.example.com TTv0683
40 ZTPGRKMIK1DC200.example.com TTv0885
41 TT1BLR01APDC200.example.com dbexh
42 NaN TTvtp065
43 TN1PEK01APDC200.example.com TTvtp057
44 ZTPGRKMIK1DC200.example.com TTvtp007
45 NaN TTvtp063
46 TT1BLR01APDC200.example.com TTvtp032
47 KTphbgsa11dc201.example.com TTvtp046
48 NaN TTvtp062
49 PR1CPQ01DC200.example.com TTv0235
50 NaN TTv0485
51 TT1NDZ45DC202.example.com TTv0236
52 NaN TTv0486
53 PR1CPQ01DC200.example.com TTv0237
54 NaN TTv0487
55 TT1BLR01APDC200.example.com TTv0516
56 TN1CQI02DC200.example.com TTv1285
57 TN1PEK01APDC200.example.com TTv0440
58 NaN liv9007
59 HR1GDL28DC200.example.com TTv0445
60 NaN tuv006
61 FTGFTPTP34DC203.example.com TTv0477
62 NaN tuv002
63 TN1CQI02DC200.example.com TTv0534
64 TN1SZZ67DC200.example.com TTv0639
65 NaN TTv0825
66 NaN TTv1856
67 TT1BLR01APDC200.example.com TTva101
68 TN1SZZ67DC200.example.com TTv1306
69 KTphbgsa11dc201.example.com TTv1072
70 NaN webx02
71 KT1MUC02DUDC201.example.com TTv1310
72 PR1CPQ01DC200.example.com TTv1151
73 TN1CQI02DC200.example.com TTv1165
74 NaN tuv90
75 TN1SZZ67DC200.example.com TTv1065
76 KTphbgsa11dc201.example.com TTv1737
77 NaN ramn01
78 HR1GDL28DC200.example.com ramn02
79 NaN ptb001
80 HR1GDL28DC200.example.com ptn002
81 NaN ptn003
82 TN1SZZ67DC200.example.com TTs0057
83 PR1CPQ01DC200.example.com TTs0058
84 NaN TTs0058-duplicZTe
85 PR1CPQ01DC200.example.com xxg2080
86 KTphbgsa11dc204.example.com xxg2081
87 TN1PEK01APDC200.example.com xxg2082
88 NaN xxg3002
89 TN1SZZ67DC200.example.com xxg2084
90 NaN xxg3005
91 ZTPGRKMIK1DC200.example.com xxg2086
92 NaN xxg3007
93 KT1MUC02DUDC201.example.com xxg2098
94 NaN xxg3014
95 TN1PEK01APDC200.example.com xxg2026
96 NaN xxg2094
97 TN1PEK01APDC200.example.com livtp005
98 KT1MUC02DUDC201.example.com xxg2059
99 ZTPGRKMIK1DC200.example.com acc9102
100 NaN xxg2111
101 TN1CQI02DC200.example.com xxgtp009
Desired Output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 TT1NDZ45DC202.example.com TTv1670 <---
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 DU2RDCRDC1DC204.example.com xxgb-gen <---
1- You are already close to your solution, just use shift() with ffill() it should work.
df = df.apply(lambda x: x.fillna(df['AuthenticZTed domaTT controller']).shift()).ffill()
2- As Quang Suggested that in the comments aso works..
df['AuthenticZTed domaTT controller'] = df['AuthenticZTed domaTT controller'].ffill()
3- or you can also try follows
df = df.fillna({var: df['AuthenticZTed domaTT controller'].shift() for var in df}).ffill()
4- other way around you can define a cols variable if you have multiple columns and then loop through it.
cols = ['AuthenticZTed domaTT controller', 'source KTvice naHR']
for cols in df.columns:
df[cols] = df[cols].ffill()
print(df)
OR
df.loc[:,cols] = df.loc[:,cols].ffill()
Related
Let's say I have a daily data as follows:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=list('ABC'))
df
Out:
A B C
2013-02-26 85 57 0
2013-02-27 94 86 44
2013-02-28 62 91 29
2013-03-01 21 93 24
2013-03-02 12 70 70
.. .. ..
2013-05-22 57 13 81
2013-05-23 43 68 85
2013-05-24 55 50 53
2013-05-25 75 78 66
2013-05-26 70 93 3
For column A and B, I need to calculate their monthly pct change on daily basis, for example, the monthly pct change value of A for 2013-05-26 will be calculated by: A's value in 2013-05-26 divided by the value in 2013-04-26 minus 1.
My idea is like this: create new columns 'A1', 'B1' by shifting them one month forward, then df['A_MoM'] will be calculated by df['A']/df['A_shifted'] - 1, same logic for column B.
Since not all the months share same length of days, so I will use last day's value of last months, ie., to calculate 2013-03-30's pct change will be calculated by: 2013-03-30's value/2013-02-28's value - 1.
I tried the code below, but it generates a dataframe with all NaNs:
df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().fillna(method=ffill)
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']], axis=0) - 1
Out:
A A1 B B1
2013-02-26 NaN NaN NaN NaN
2013-02-27 NaN NaN NaN NaN
2013-02-28 NaN NaN NaN NaN
2013-03-01 NaN NaN NaN NaN
2013-03-02 NaN NaN NaN NaN
.. .. .. ..
2013-05-22 NaN NaN NaN NaN
2013-05-23 NaN NaN NaN NaN
2013-05-24 NaN NaN NaN NaN
2013-05-25 NaN NaN NaN NaN
2013-05-26 NaN NaN NaN NaN
How could achieve that correctly? Sincere thanks at advance.
Edit:
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])
df.columns
df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill().add_suffix('_shifted')
df2 = df.filter(regex='_values$').div(df1.to_numpy(), axis=0) - 1
df.join(df2.add_suffix('_MoM'))
Out:
ValueError: Unable to coerce to DataFrame, shape must be (90, 2): given (93, 2)
Reason is different columns names, solution is converting df[['A1', 'B1']] to numpy array:
df[['A1', 'B1']] = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df[['A1', 'B1']].to_numpy(), axis=0) - 1
print (df)
A B C A1 B1 A_MoM B_MoM
2013-02-26 85 57 0 NaN NaN NaN NaN
2013-02-27 94 86 44 NaN NaN NaN NaN
2013-02-28 62 91 29 NaN NaN NaN NaN
2013-03-01 21 93 24 NaN NaN NaN NaN
2013-03-02 12 70 70 NaN NaN NaN NaN
.. .. .. ... ... ... ...
2013-05-22 57 13 81 14.0 50.0 3.071429 -0.740000
2013-05-23 43 68 85 2.0 45.0 20.500000 0.511111
2013-05-24 55 50 53 89.0 52.0 -0.382022 -0.038462
2013-05-25 75 78 66 86.0 54.0 -0.127907 0.444444
2013-05-26 70 93 3 4.0 45.0 16.500000 1.066667
[90 rows x 7 columns]
Or if possible assign output to df1, so columns names are not changed, so possible divide with same columns names, here A, B correctly:
df1 = df[['A', 'B']].shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
df[['A_MoM', 'B_MoM']] = df[['A', 'B']].div(df1, axis=0) - 1
print (df)
A B C A_MoM B_MoM
2013-02-26 85 57 0 NaN NaN
2013-02-27 94 86 44 NaN NaN
2013-02-28 62 91 29 NaN NaN
2013-03-01 21 93 24 NaN NaN
2013-03-02 12 70 70 NaN NaN
.. .. .. ... ...
2013-05-22 57 13 81 3.071429 -0.740000
2013-05-23 43 68 85 20.500000 0.511111
2013-05-24 55 50 53 -0.382022 -0.038462
2013-05-25 75 78 66 -0.127907 0.444444
2013-05-26 70 93 3 16.500000 1.066667
[90 rows x 5 columns]
EDIT: After resample is changed also datetimeIndex, so added reindex for same indices in both DataFrames:
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.randint(0, 100, size=(90, 3)), index=dates, columns=['A_values', 'B_values', 'C'])
df1 = df.filter(regex='_values$').shift(freq=pd.DateOffset(months=1)).resample('D').last().ffill()
print (df1.columns)
Index(['A_values', 'B_values'], dtype='object')
df2 = df.filter(regex='_values$').div(df1, axis=0).sub(1).reindex(df.index)
print (df.filter(regex='_values$').columns)
Index(['A_values', 'B_values'], dtype='object')
df = df.join(df2.add_suffix('_MoM'))
print (df)
A_values B_values C A_values_MoM B_values_MoM
2013-02-26 85 57 0 NaN NaN
2013-02-27 94 86 44 NaN NaN
2013-02-28 62 91 29 NaN NaN
2013-03-01 21 93 24 NaN NaN
2013-03-02 12 70 70 NaN NaN
... ... .. ... ...
2013-05-22 57 13 81 3.071429 -0.740000
2013-05-23 43 68 85 20.500000 0.511111
2013-05-24 55 50 53 -0.382022 -0.038462
2013-05-25 75 78 66 -0.127907 0.444444
2013-05-26 70 93 3 16.500000 1.066667
[90 rows x 5 columns]
I have two dfs as shown below.
df1:
Date t_factor
2020-02-01 5
2020-02-02 23
2020-02-03 14
2020-02-04 23
2020-02-05 23
2020-02-06 23
2020-02-07 30
2020-02-08 29
2020-02-09 100
2020-02-10 38
2020-02-11 38
2020-02-12 38
2020-02-13 70
2020-02-14 70
2020-02-15 38
2020-02-16 38
2020-02-17 70
2020-02-18 70
2020-02-19 38
2020-02-20 38
2020-02-21 70
2020-02-22 70
2020-02-23 38
2020-02-24 38
2020-02-25 70
2020-02-26 70
2020-02-27 70
df2:
From to plan score
2020-02-03 2020-02-05 start 20
2020-02-07 2020-02-08 foundation 25
2020-02-10 2020-02-12 learn 10
2020-02-14 2020-02-16 practice 20
2020-02-15 2020-02-21 exam 30
2020-02-20 2020-02-23 test 10
From the above I would like to append the plan column to df1 based on the From and to date value in df2 and Date value in df1.
Expected output:
output_df
Date t_factor plan
2020-02-01 5 NaN
2020-02-02 23 NaN
2020-02-03 14 start
2020-02-04 23 start
2020-02-05 23 start
2020-02-06 23 NaN
2020-02-07 30 foundation
2020-02-08 29 foundation
2020-02-09 100 NaN
2020-02-10 38 learn
2020-02-11 38 learn
2020-02-12 38 learn
2020-02-13 70 NaN
2020-02-14 70 practice
2020-02-15 38 NaN
2020-02-16 38 NaN
2020-02-17 70 exam
2020-02-18 70 exam
2020-02-19 38 exam
2020-02-20 38 NaN
2020-02-21 70 NaN
2020-02-22 70 test
2020-02-23 38 test
2020-02-24 38 NaN
2020-02-25 70 NaN
2020-02-26 70 NaN
2020-02-27 70 NaN
Note:
If there is any overlapping date, then keep plan as NaN for that date.
Example:
2020-02-14 to 2020-02-16 plan is practice.
And 2020-02-15 to 2020-02-21 plan is exam.
So there is overlap is on 2020-02-15 and 2020-02-16.
Hence plan should be NaN for that date range.
I would like to implement function shown below.
def (df1, df2)
return output_df
Use: (This solution if From and to dates in dataframe df2 overlaps and we need to choose the values from column plan with respect to earliest date possible)
d1 = df1.sort_values('Date')
d2 = df2.sort_values('From')
df = pd.merge_asof(d1, d2[['From', 'plan']], left_on='Date', right_on='From')
df = pd.merge_asof(df, d2[['to', 'plan']], left_on='Date', right_on='to',
direction='forward', suffixes=['', '_r']).drop(['From', 'to'], 1)
df['plan'] = df['plan'].mask(df['plan'].ne(df.pop('plan_r')))
Details:
Use pd.merge_asof to perform a asof merge on the dataframes d1 and d2 on corresponding columns Date and From with default direction='backward' to create a new merged dataframe df, again use pd.merge_asof to asof merge the dataframes df and d2 on corresponding columns Date and to with direction='forward'.
print(df)
Date t_factor plan plan_r
0 2020-02-01 5 NaN start
1 2020-02-02 23 NaN start
2 2020-02-03 14 start start
3 2020-02-04 23 start start
4 2020-02-05 23 start start
5 2020-02-06 23 start foundation
6 2020-02-07 30 foundation foundation
7 2020-02-08 29 foundation foundation
8 2020-02-09 100 foundation learn
9 2020-02-10 38 learn learn
10 2020-02-11 38 learn learn
11 2020-02-12 38 learn learn
12 2020-02-13 70 learn practice
13 2020-02-14 70 practice practice
14 2020-02-15 38 exam practice
15 2020-02-16 38 exam practice
16 2020-02-17 70 exam exam
17 2020-02-18 70 exam exam
18 2020-02-19 38 exam exam
19 2020-02-20 38 test exam
20 2020-02-21 70 test exam
21 2020-02-22 70 test test
22 2020-02-23 38 test test
23 2020-02-24 38 test NaN
24 2020-02-25 70 test NaN
25 2020-02-26 70 test NaN
26 2020-02-27 70 test NaN
Use Series.ne + Series.mask to mask the values in column plan where plan is not equal to plan_r.
print(df)
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 NaN
15 2020-02-16 38 NaN
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 exam
19 2020-02-20 38 NaN
20 2020-02-21 70 NaN
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN
Using pd.to_datetime convert the date like columns to pandas datetime series:
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['From', 'to']] = df2[['From', 'to']].apply(pd.to_datetime)
Create a pd.IntervalIndex from the columns From and to of df2, then use Series.map on the column Date of df1 to map it to column plan from df2 (after setting the idx):
idx = pd.IntervalIndex.from_arrays(df2['From'], df2['to'], closed='both')
df1['plan'] = df1['Date'].map(df2.set_index(idx)['plan'])
Result:
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 practice
15 2020-02-16 38 practice
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 NaN
19 2020-02-20 38 test
20 2020-02-21 70 test
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN
I have two dataframe as shown below.
df1:
Date t_factor plan plan_score
0 2020-02-01 5 NaN 0
1 2020-02-02 23 NaN 0
2 2020-02-03 14 start 0
3 2020-02-04 23 start 0
4 2020-02-05 23 start 0
5 2020-02-06 23 NaN 0
6 2020-02-07 30 foundation 0
7 2020-02-08 29 foundation 0
8 2020-02-09 100 NaN 0
9 2020-02-10 38 learn 0
10 2020-02-11 38 learn 0
11 2020-02-12 38 learn 0
12 2020-02-13 70 NaN 0
13 2020-02-14 70 practice 0
14 2020-02-15 38 NaN 0
15 2020-02-16 38 NaN 0
16 2020-02-17 70 exam 0
17 2020-02-18 70 exam 0
18 2020-02-19 38 exam 0
19 2020-02-20 38 NaN 0
20 2020-02-21 70 NaN 0
21 2020-02-22 70 test 0
22 2020-02-23 38 test 0
23 2020-02-24 38 NaN 0
24 2020-02-25 70 NaN 0
25 2020-02-26 70 NaN 0
26 2020-02-27 70 NaN 0
df2:
From to plan score
2020-02-03 2020-02-05 start 20
2020-02-07 2020-02-08 foundation 25
2020-02-10 2020-02-12 learn 10
2020-02-14 2020-02-16 practice 20
2020-02-15 2020-02-21 exam 30
2020-02-20 2020-02-23 test 10
Explanation:
I have loaded the both data frame and I would like to export this dataframes as 1 excel file with Sheet1 = df1 and Sheet2 = df2.
I tried below.
import pandas as pd
from pandas import ExcelWriter
def save_xls(list_dfs, xls_path):
with ExcelWriter(xls_path) as writer:
for n, df in enumerate(list_dfs):
df.to_excel(writer,'sheet%s' % n)
writer.save()
save_xls([df1, df2], os.getcwd())
And it is giving me following error.
---------------------------------------------------------------------------
OptionError Traceback (most recent call last)
~/admvenv/lib/python3.7/site-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
630 try:
--> 631 engine = config.get_option(f"io.excel.{ext}.writer")
632 if engine == "auto":
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in __call__(self, *args, **kwds)
230 def __call__(self, *args, **kwds):
--> 231 return self.__func__(*args, **kwds)
232
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in _get_option(pat, silent)
101 def _get_option(pat, silent=False):
--> 102 key = _get_single_key(pat, silent)
103
~/admvenv/lib/python3.7/site-packages/pandas/_config/config.py in _get_single_key(pat, silent)
87 _warn_if_deprecated(pat)
---> 88 raise OptionError(f"No such keys(s): {repr(pat)}")
89 if len(keys) > 1:
OptionError: "No such keys(s): 'io.excel..writer'"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-16-80bc8a5d0d2f> in <module>
----> 1 save_xls([df1, df2], os.getcwd())
<ipython-input-15-0d1448e7aea8> in save_xls(list_dfs, xls_path)
1 def save_xls(list_dfs, xls_path):
----> 2 with ExcelWriter(xls_path) as writer:
3 for n, df in enumerate(list_dfs):
4 df.to_excel(writer,'sheet%s' % n)
5 writer.save()
~/admvenv/lib/python3.7/site-packages/pandas/io/excel/_base.py in __new__(cls, path, engine, **kwargs)
633 engine = _get_default_writer(ext)
634 except KeyError:
--> 635 raise ValueError(f"No engine for filetype: '{ext}'")
636 cls = get_writer(engine)
637
ValueError: No engine for filetype: ''
Your code is fine, you are just missing the excel file name, and therefore the extension. That is what your error is saying.
Try
save_xls([df1, df2], os.getcwd() + '/name.xlsx')
or include a default excel file name in your function.
I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.
I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0
To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.