Convert columns with large numbers and float64 as int type to suppress scientific formatting - python-3.x

I tried many SO answers but nothing tried so far worked.I have a column in a dfwhere there is a large value in a column like 89898989898989898 Whatever I do this is not being displayed as a number. All the cols are of float64 dtype.I do not have any float values in my df
After creating pivot I get a dataframe df and I tried converting to int then writing to excel, does not seem to have any difference and displays as scientific formatting(I can see the value when I click on the cell in the value bar) I could not convert to int directly as there are Nan values in the column :-
ID CATEG LEVEL COLS VALUE COMMENTS
1 A 2 Apple 1e+13 comment1
1 A 3 Apple 1e+13 comment1
1 C 1 Apple 1e+13 comment1
1 C 2 Apple 345 comment1
1 C 3 Apple 289 comment1
1 B 1 Apple 712 comment1
1 B 2 Apple 1e+13 comment1
2 B 3 Apple 376 comment1
2 C None Orange 1e+13 comment1
2 B None Orange 135 comment1
2 D None Orange 423 comment1
2 A None Orange 866 comment1
2 None Orange 496 comment2
After pivot the Apple column looks like this (providing just sample values to show the scientific notation values) :-
index Apple
1655 1e+13
1656 1e+13
1657 1e+13
1658 NaN
1659 NaN
df=pd.pivot_table(dfe,index=['ID','CATEG','LEVEL'],columns=['COLS'],values=['VALUE'])
df= df.fillna(0).astype(np.int64)
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer :
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
writer.save()
What should I do to get rid of scientific formatting and get it displayed as a number in excel.
Also is there a way to autofit the columns in excel while writing from python to excel.Im using pd.ExcelWriter to write to excel

You can use set_option to avoid scientific formating
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Also you can change column type with this command:
df['col'] = df[['col']].fillna(0)
df['col'] = df['col'].astype(int)

The following code generates a Dataframe with two columns. One with text and one with numbers and writes them to excel and changes the format so in order for excel to display the numbers as dot separated integers rather than in scientific notation. If you prefer a different format I am sure you can adopt the format string in the code. I hope that's what you're looking for :).
col1 = [float("nan"),*np.random.randint(10**15,size=5)]
col2 = [random.choice(["apple", "orange"]) for _ in range (6)]
with pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter') as writer:
pd.DataFrame(data={"numbers":col1, "strings":col2})\
.to_excel(writer)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('B:B', None, format_)
For your specific example try:
with pd.ExcelWriter('file.xlsx',options={'nan_inf_to_errors': True}) as writer:
df.groupby('ID').apply(lambda x: x.dropna(how='all', axis=1).to_excel(writer,sheet_name=str(x.name),na_rep=0,index=True))
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format_ = workbook.add_format({'num_format': '#,##'})
worksheet.set_column('C:C', None, format_)
worksheet.set_column('E:E', None, format_)
Notice that I am guessing form the other thread that your columns with numbers will be C and E. You will probably have to widen the columns for the huge numbers. Excel will write ######### for it otherwise.

Related

Map Pandas Series Containing key/value pairs to a new columns with data

I have a dataframe containing a pandas series (column 2) as below:
column 1
column 2
column 3
1123
Requested By = John Doe 1\n Requested On = 12 October 2021\n Comments = This is a generic request
INC29192
1251
NaN
INC18217
1918
Requested By = John Doe 2\n Requested On = 2 September 2021\n Comments = This is another generic request
INC19281
I'm struggling to extract, split and map column 2 data to a series of new column names with the appropriate data for that record (where possible, that is where there is data available as I have NaNs).
The Desired output is something like (where Ive dropped the column 3 data for legibility):
column 1
column 3
Requested By
Requested On
Comments
1123
INC29192
John Doe 1
12 October 2021
This is a generic request
1251
INC18217
NaN
NaN
NaN
1918
INC19281
John Doe 2
2 September 2021
This is another generic request
I have spent quite some time, trying various approaches, from lambda functions to comprehensions to explode methods but havent quite found a solution that provides the desired output.
First I would convert column 2 values to dictionaries and then convert them to Dataframes and join them to your df:
df['column 2'] = df['column 2'].apply(lambda x:
{y.split(' = ',1)[0]:y.split(' = ',1)[1]
for y in x.split(r'\n ')}
if not pd.isna(x) else {})
df = df.join(pd.DataFrame(df['column 2'].values.tolist())).drop('column 2', axis=1)
print(df)
Output:
column 1 column 3 Requested By Requested On Comments
0 1123 INC29192 John Doe 1 12 October 2021 This is a generic request
1 1251 INC18217 NaN NaN NaN
2 1918 INC19281 John Doe 2 2 September 2021 This is another generic request

Convert dataframe display float format to human readable for output display purpose only

I wish to display the dataframe column values in human readable format like 10, 100, 1K, 1M, 1B, etc.
So far, I could convert the scientific values (say) 1.111111e1 to numeric float format using pandas options with following arguments:
`display.float_format = '{:.2f}'.format
Note, 2 in the above line means 2 points of decimal. Change them as you like.
But, still the output is pretty hard to read when the column has so many varying numeric values. Especially in financial use case, with columns such as currency, turnover, profit, etc.
How to do this?
Note: I do not wish to convert the stored values into string format. I have calculations on the column values, so that is not feasible. Further, I won't create new columns for display purpose. So, df['new_col'] = df['col']/1000000 won't work either.
Sample dataframe:
pd.DataFrame([10.,100.,1000.,10000.,100000.,1000000.,10000000.,100000000.,1000000000.,10000000000.])
0 1.000000e+01
1 1.000000e+02
2 1.000000e+03
3 1.000000e+04
4 1.000000e+05
5 1.000000e+06
6 1.000000e+07
7 1.000000e+08
8 1.000000e+09
9 1.000000e+10
Use the following function with display.float_format argument in pandas options method to get the desired outcome.
lambda x : '{:.2f}'.format(x) if abs(x) < 1000 else ('{:.2f} K'.format(x/1000) if abs(x) < 1000000 else ('{:.2f} M'.format(x/1000000) if abs(x) < 1000000000 else '{:.2f} B'.format(x/1000000000)))
Output:
0 10.00
1 100.00
2 1.00 K
3 10.00 K
4 100.00 K
5 1.00 M
6 10.00 M
7 100.00 M
8 1.00 B
9 10.00 B

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

Pandas condition-based row elimination in DataFrame

I have a DataFrame in with information stored in a column until an unknown row number. After this row number, the column only stores NaN values. However, throughout the column some random NaN values appear as well. I want a cumulation to check how many NaN values are repeated to determine the the last row storing information.
My code is as follows:
first, I create a NaN checker that accumulates the number of NaN values row after row
next, I checks whether the NaN checker exceeds a certain threshold (3 in this case)
last, if the threshold is exceeded, the subsequent rows are eliminated
Check_NaN =
Fruits['bananas'].isnull().astype(int).groupby(Fruits['bananas']
.notnull().astype(int).cumsum()).sum()
for row in Fruits:
for cell in row['bananas']:
if cell(Check_NaN) < 3:
sum_Fruits.update(Fruits)
else:
row.dropna(subset=['bananas'])
Below is a data sample for Fruits['bananas']. These are rows 110-130 from which the end of Excel-information in the DataFrame is indicated by the beginning of NaN values.
110 banana red
111 banana green
112 banana white
113 banana yellow
114 banana black
115 banana orange
116 banana purple
117 banana pink
118 banana blue
119 banana silver
120 banana grey
121 banana gold
122 banana white
123 banana orange
124 --
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
However, I do run into a problem that is in for cell in row['bananas']: which gives TypeError: string indices must be integers.
To me this is confusing as I can not iterate over the rows that I want to eliminate the rows. I need reusable code as the beginning of NaN values is different for each Excel sheet. How can I write my script such that the threshold of 3 NaN values is understood and eliminates the rest of the rows?
To achieve this you could look at the shift function in Pandas, then shift twice and check if all three values are NaN
Try this:
# Find the rows where itself and the two subsequent rows are null in the bananas column
All_three_null = Fruits[‘banana’].isna() & Fruits[‘banana’].shift(-1).isna() & Fruits[‘banana’].shift(-2).isna()
# Find the index of the first row where this happens
First_instance = Fruits[All_three_null].index.min()
# Filter the data to remove all the null rows
Good_data = Fruits[Fruits.index <= First_instance]
Another option which will be better if you want to move from 3 NaNs in a row to 30!
The basic idea is to group all the subsequent NaN occurances into a uniquely identifiable group, then find the first group that exceeds the set limit and use this group to filter the original DataFrame
NaN_in_a_Row = 3
Fruits['Row_Not_NaN'] = Fruits['banana'].notna()
Fruits['First_Nan_After_Not_Nan'] = Fruits['banana'].isna() & Fruits['banana'].shift(1).notna()
Fruits['Group_ID'] = (Fruits['Row_Not_Nan']+Fruits['First_Nan_After_Not_Nan']).cumsum()
Fruits['Number_of_Rows'] = 1
Filter = Fruits.groupby(['Group_ID'])['Number_of_Rows'].sum()
Filter = Filter[Filter["Number_of_Rows"]>=NaN_in_a_Row].Group_ID.min()
Fruits = Fruits[Fruits.Group_ID < Filter]

Resources