Pandas condition-based row elimination in DataFrame - excel

I have a DataFrame in with information stored in a column until an unknown row number. After this row number, the column only stores NaN values. However, throughout the column some random NaN values appear as well. I want a cumulation to check how many NaN values are repeated to determine the the last row storing information.
My code is as follows:
first, I create a NaN checker that accumulates the number of NaN values row after row
next, I checks whether the NaN checker exceeds a certain threshold (3 in this case)
last, if the threshold is exceeded, the subsequent rows are eliminated
Check_NaN =
Fruits['bananas'].isnull().astype(int).groupby(Fruits['bananas']
.notnull().astype(int).cumsum()).sum()
for row in Fruits:
for cell in row['bananas']:
if cell(Check_NaN) < 3:
sum_Fruits.update(Fruits)
else:
row.dropna(subset=['bananas'])
Below is a data sample for Fruits['bananas']. These are rows 110-130 from which the end of Excel-information in the DataFrame is indicated by the beginning of NaN values.
110 banana red
111 banana green
112 banana white
113 banana yellow
114 banana black
115 banana orange
116 banana purple
117 banana pink
118 banana blue
119 banana silver
120 banana grey
121 banana gold
122 banana white
123 banana orange
124 --
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
However, I do run into a problem that is in for cell in row['bananas']: which gives TypeError: string indices must be integers.
To me this is confusing as I can not iterate over the rows that I want to eliminate the rows. I need reusable code as the beginning of NaN values is different for each Excel sheet. How can I write my script such that the threshold of 3 NaN values is understood and eliminates the rest of the rows?

To achieve this you could look at the shift function in Pandas, then shift twice and check if all three values are NaN
Try this:
# Find the rows where itself and the two subsequent rows are null in the bananas column
All_three_null = Fruits[‘banana’].isna() & Fruits[‘banana’].shift(-1).isna() & Fruits[‘banana’].shift(-2).isna()
# Find the index of the first row where this happens
First_instance = Fruits[All_three_null].index.min()
# Filter the data to remove all the null rows
Good_data = Fruits[Fruits.index <= First_instance]
Another option which will be better if you want to move from 3 NaNs in a row to 30!
The basic idea is to group all the subsequent NaN occurances into a uniquely identifiable group, then find the first group that exceeds the set limit and use this group to filter the original DataFrame
NaN_in_a_Row = 3
Fruits['Row_Not_NaN'] = Fruits['banana'].notna()
Fruits['First_Nan_After_Not_Nan'] = Fruits['banana'].isna() & Fruits['banana'].shift(1).notna()
Fruits['Group_ID'] = (Fruits['Row_Not_Nan']+Fruits['First_Nan_After_Not_Nan']).cumsum()
Fruits['Number_of_Rows'] = 1
Filter = Fruits.groupby(['Group_ID'])['Number_of_Rows'].sum()
Filter = Filter[Filter["Number_of_Rows"]>=NaN_in_a_Row].Group_ID.min()
Fruits = Fruits[Fruits.Group_ID < Filter]

Related

Convert columns with large numbers and float64 as int type to suppress scientific formatting

I tried many SO answers but nothing tried so far worked.I have a column in a dfwhere there is a large value in a column like 89898989898989898 Whatever I do this is not being displayed as a number. All the cols are of float64 dtype.I do not have any float values in my df
After creating pivot I get a dataframe df and I tried converting to int then writing to excel, does not seem to have any difference and displays as scientific formatting(I can see the value when I click on the cell in the value bar) I could not convert to int directly as there are Nan values in the column :-
ID CATEG LEVEL COLS VALUE COMMENTS
1 A 2 Apple 1e+13 comment1
1 A 3 Apple 1e+13 comment1
1 C 1 Apple 1e+13 comment1
1 C 2 Apple 345 comment1
1 C 3 Apple 289 comment1
1 B 1 Apple 712 comment1
1 B 2 Apple 1e+13 comment1
2 B 3 Apple 376 comment1
2 C None Orange 1e+13 comment1
2 B None Orange 135 comment1
2 D None Orange 423 comment1
2 A None Orange 866 comment1
2 None Orange 496 comment2
After pivot the Apple column looks like this (providing just sample values to show the scientific notation values) :-
index Apple
1655 1e+13
1656 1e+13
1657 1e+13
1658 NaN
1659 NaN
df=pd.pivot_table(dfe,index=['ID','CATEG','LEVEL'],columns=['COLS'],values=['VALUE'])
df= df.fillna(0).astype(np.int64)
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer :
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
writer.save()
What should I do to get rid of scientific formatting and get it displayed as a number in excel.
Also is there a way to autofit the columns in excel while writing from python to excel.Im using pd.ExcelWriter to write to excel
You can use set_option to avoid scientific formating
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Also you can change column type with this command:
df['col'] = df[['col']].fillna(0)
df['col'] = df['col'].astype(int)
The following code generates a Dataframe with two columns. One with text and one with numbers and writes them to excel and changes the format so in order for excel to display the numbers as dot separated integers rather than in scientific notation. If you prefer a different format I am sure you can adopt the format string in the code. I hope that's what you're looking for :).
col1 = [float("nan"),*np.random.randint(10**15,size=5)]
col2 = [random.choice(["apple", "orange"]) for _ in range (6)]
with pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter') as writer:
pd.DataFrame(data={"numbers":col1, "strings":col2})\
.to_excel(writer)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('B:B', None, format_)
For your specific example try:
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer:
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('C:C', None, format_)
worksheet.set_column('E:E', None, format_)
Notice that I am guessing form the other thread that your columns with numbers will be C and E. You will probably have to widen the columns for the huge numbers. Excel will write ######### for it otherwise.

How to calculate mean by skipping String Value in Numeric Column?

Name Emp_ID Salary Age
0 John Abc 21000 31
1 mark Abn 34000 82
2 samy bbc thirty 78
3 Johny Ajc 21000 34
4 John Ajk 2100.28 twentyone
How to calculate mean of 'Age' Column without changing string value in that column. Basically i want to loop through age column for numerical value and gives mean of that list of value. If any string comes it should skip that value?
Use pd.to_numeric with the argument errors='coerce', which turns values to NaN if it can't convert it to numeric. Then use Series.mean:
pd.to_numeric(df['Age'],errors='coerce').mean()
#Out
56.25

Issue with parsing text file in pandas

I have the following text file which I would like to load in python:
cabin embarked boat body
0 B5 S 2 NaN
1 C22 C26 S 11 NaN
2 C22 C26 S NaN NaN
3 C22 C26 S NaN 135.0
4 C22 C26 S NaN NaN
5 E12 S 3 NaN
6 D7 S 10 NaN
7 A36 S NaN NaN
8 C101 S D NaN
Based on the response to a similar question that I received, I tried the following:
df = pd.read_fwf("test.csv", header=0, index_col=0)
and it worked fine.
But the following doesnt work:
pd.read_csv("test.csv", sep="\s{2,}", header=0, index_col=0, engine="python")
I get the following error:
ValueError: Expected 4 fields in line 2, saw 5
Given the fact that sep="\s{2,}" considers the fields to be separated by 2 or more whitespaces,
line 2 (0 B5 S 2 NaN),
should have been parsed without any problem. Also, I see only 4 fields in line 2 (excluding the row index which is taken care of by index_col=0); which is the 5th field that the error is referring to?
cabin embarked is only one space apart and gets parsed as a single string.
pd.read_csv is given some latitude and figures that there is an empty space for an index.
cabin embarked boat body
# ^ ^ ^ ^
# field 1 field 2 field 3 field 4
# this row establishes expectations
0 B5 S 2 NaN
^ ^ ^ ^ ^
field 1 field 2 field 3 fd 4 field 5
And that's the error. Row 1 established precedence of 4 fields and row 2 shows 5.

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Using Pandas filtering non-numeric data from two columns of a Dataframe

I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)
import pandas as pd
df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))
Which results in this data table:
A B C D
0 1 96 12 apples
1 2 33 Not measured oranges
2 3 45 15 peaches
3 4 66 plums
4 5 8 42 pears
I'm not clear how to get to this table:
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
I tried dropna, but the types are "object" since there are non-numeric entries.
I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?
You can first create subset with columns B,C and apply to_numeric, check if all values are notnull. Then use boolean indexing:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit with isnull and xor (^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric with isnull and notnull is fastest:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop

Resources