Pandas read_excel() parses date columns with blank values to NaT - python-3.x

I am trying to read an excel file that has date columns with the below code
src1_df = pd.read_excel("src_file1.xlsx", keep_default_na = False)
Even though I have specified, keep_default_na = False, I see that the data frame has 'NaT' value(s) for corresponding blank cells in Excel date columns.
Please suggest, how to get a blank string instead of 'NaT' while parsing Excel files.
I am using Python 3.x and Pandas 0.23.4

src1_df = pd.read_excel("src_file1.xlsx", na_filter=False)
Then you will have empty string ("") as "na" value
In my case I read excel per line and replace "" and "NaT" to None:
for line in src1_df.values:
for index, value in enumerate(line):
if value == '' or isinstance(value, pd._libs.tslibs.nattype.NaTType):
line[index] = None
dostuff_with(line)

Related

Change number format using headers - openpyxl

I have an Excel file in which I want to convert the number formatting from 'General' to 'Date'. I know how to do so for one column when referring to the column letter:
workbook = openpyxl.load_workbook('path\filename.xlsx')
worksheet = workbook['Sheet1']
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Report_date'], row)].number_format='yyyy-mm-dd;#'
As you can see, I now use the column letter "D" to point out the column that I want to be formatted differently. Now, I would like to use the header in row 1 called "Start_Date" to refer to this column. I tried a method from the following post to achieve this: select a column by its name - openpyxl. However, that resulted in a KeyError: "Start_Date":
# Create a dictionary of column names
ColNames = {}
Current = 0
for COL in worksheet.iter_cols(1, worksheet.max_column):
ColNames[COL[0].value] = Current
Current += 1
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Start_Date'], row)].number_format='yyyy-mm-dd;#'
EDIT
This method results in the following error:
AttributeError: 'tuple' object has no attribute 'number_format'
Additionally, I have more columns from which the number formatting needs to be changed. I have a list with the names of those columns:
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
Is there a way that I can use the list DateColumns so that I can save some lines of code?
Thanks in advance.
Please note that I posted a similar question earlier. The following post was referred to as an answer Python: Simulating CSV.DictReader with OpenPyXL. However, I don't see how the answers in that post can be adjusted to my needs.
You need to know which columns you want to change the number format on which you have conveniently put into a list, so why not just use that list.
Get the headers in your sheet, check if the Header is in the DateColumns list, if so then update all the entries in that column from row 2 to max with the date format you want...
...
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
for COL in worksheet.iter_cols(min_row=1,max_row=1):
header = COL[0]
if header.value in DateColumns:
for row in range(2, worksheet.max_row+1):
worksheet.cell(row, COL[0].column).number_format='yyyy-mm-dd;#'

Display 2 decimal places, and use comma as separator in pandas?

Is there any way to replace the dot in a float with a comma and keep a precision of 2 decimal places?
Example 1 : 105 ---> 105,00
Example 2 : 99.2 ---> 99,20
I used a lambda function df['abc']= df['abc'].apply(lambda x: f"{x:.2f}".replace('.', ',')). But then I have an invalid format in Excel.
I'm updating a specific sheet on excel, so I'm using : wb = load_workbook(filename) ws = wb["FULL"] for row in dataframe_to_rows(df, index=False, header=True): ws.append(row)
Let us try
out = (s//1).astype(int).astype(str)+','+(s%1*100).astype(int).astype(str).str.zfill(2)
0 105,00
1 99,20
dtype: object
Input data
s=pd.Series([105,99.2])
s = pd.Series([105, 99.22]).apply(lambda x: f"{x:.2f}".replace('.', ',')
First .apply takes a function inside and
f string: f"{x:.2f} turns float into 2 decimal point string with '.'.
After that .replace('.', ',') just replaces '.' with ','.
You can change the pd.Series([105, 99.22]) to match it with your dataframe.
I think you're mistaking something in here. In excel you can determine the print format i.e. the format in which numbers are printed (this icon with +/-0).
But it's not a format of cell's value i.e. cell either way is numeric. Now your approach tackles only cell value and not its formatting. In your question you save it as string, so it's read as string from Excel.
Having this said - don't format the value, upgrade your pandas (if you haven't done so already) and try something along these lines: https://stackoverflow.com/a/51072652/11610186
To elaborate, try replacing your for loop with:
i = 1
for row in dataframe_to_rows(df, index=False, header=True):
ws.append(row)
# replace D with letter referring to number of column you want to see formatted:
ws[f'D{i}'].number_format = '#,##0.00'
i += 1
well i found an other way to specify the float format directly in Excel using this code :
for col_cell in ws['S':'CP'] :
for i in col_cell :
i.number_format = '0.00'

Get the value from another cell if a pattern matches a string in another cell of the same row

Hi,
I have an Excel Workbook with above data. I use Pandas to read through the Excel sheet in my Python script.
My requirement is if I find YES in the Primary Key column, I need to get the Data Element value of that row.
So from the above sample, I need to get the 3 Data Element's value into an array variable.
I tried with the below piece of code and couldn't achieve it.
PK_COLUMNS={}
workbook_sheet = pd.read_excel(excel_file_name,sheet_name=i,keep_default_na=False)
workbook_sheet=workbook_sheet.fillna("NULL", inplace = False)
df=pd.DataFrame(workbook_sheet,columns=[0])
total_rows=len(df.axes[0])
h=0
while h<total_rows:
CURRENT_COLUMN=workbook_sheet["Primary Key"].fillna("NULL", inplace = False)
#print(CURRENT_COLUMN)
for i in CURRENT_COLUMN:
if str(i).upper() == 'YES':
PK_COLUMNS[h]=workbook_sheet.iat[h,0].strip()
print(PK_COLUMNS)
h=h+1
else:
print ("NULL")
print(PK_COLUMNS)
Any help on this is highly appreciated. Python 3.7 with Pandas.

Missing data when exporting data frame from pandas to excel

I have created a program to remove duplicate rows from an excel file using pandas. After successfully doing so I exported the new data from pandas to excel however the new excel file seems to have missing data (specifically columns involving dates). Instead of showing the actual data it just shows '##########' on the rows.
Code:
import pandas as pd
data = pd.read_excel('test.xlsx')
data.sort_values("Serial_Nbr", inplace = True)
data.drop_duplicates(subset ="Serial_Nbr", keep = "first", inplace = True)
data.to_excel (r'test_updated.xlsx')
Before and after exporting:
date date
2018-07-01 ##########
2018-08-01 ##########
2018-08-01 ##########
it means Width of cell is not capable to display the data, try to expand the width of cell's width.
cell's width is too narrow:
after expanding the cell's width:
to export to excel with datetime correctly, you must add the format code for excel export:
import pandas as pd
data = pd.read_excel('Book1.xlsx')
data.sort_values("date", inplace = False)
data.drop_duplicates(subset ="date", keep = "first", inplace = True)
#Writer datetime format
writer = pd.ExcelWriter("test_updated.xlsx",
datetime_format='mm dd yyyy',
date_format='mmm dd yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
data.to_excel(writer, sheet_name='Sheet1')
writer.save()
########## is displayed when a cell's width is too small to display its contents. You need to increase the cells' width or reduce their content
Regarding the original query on data, I agree with the response from ALFAFA.
Here I am trying to do column resizing, so that end user does not need to do the same manually in the xls.
Steps would be:
Get the column name (as per xls, column names start with 'A', 'B', 'C' etc)
colPosn = data.columns.get_loc('col#3') # Get column position
xlsColName = chr(ord('A')+colPosn) # Get xls column name (not the column header as per data frame). This will be used to set attributes of xls columns
Get resizing width of the column 'col#3' by getting length of the longest string in the column
maxColWidth = 1 + data['col#3'].map(len).max() # Gets the length of longest string of the column named 'col#3' (+1 for some buffer space to make data visible in the xls column)
use column_dimensions[colName].width attribute to increase the width of the xls column
data.to_excel(writer, sheet_name='Sheet1', index=False) # use index=False if you dont need the unwanted extra index column in the file
sheet = writer.book['Sheet1']
sheet.column_dimensions[xlsColName].width = maxColWidth # Increase the width of column to match with the longest string in the column
writer.save()
Replace last two lines from post of ALFAFA with the above blocks (all sections above) to get the column width adjusted for 'col#3'

Converting Excel file to csv using to_csv, removes leading zeros even when cells are formatted to be string

When I try to convert my excel file to csv, using to_csv function, all my item number that has 1 leading 0, loses it except for the very first row.
I have a simple forloop that iterates through all cells and converts cell values to string so I have no idea why only first row gets converted to csv format correctly with the leading 0.
for row in ws.iter_rows():
for cell in row:
cell.value = str(cell.value)
pd.read_excel('example.xlsx').to_csv('result.csv', index=False, line_terminator = ',\n')
e.g.
https://i.stack.imgur.com/Njb3n.png (won't let me directly add image but it shows the following in excel)
0100,03/21/2019,4:00,6:00
0101,03/21/2019,4:00,6:00
0102,03/21/2019,4:00,8:00
turns into:
0100,03/21/2019,4:00,6:00,
101,03/21/2019,4:00,6:00,
102,03/21/2019,4:00,8:00,
What can I do to have 0 in front of all the first items in csv?
Any insight would be appreciated.
So if you have not header in excel file: the name by default of columns is 0,1,... and so on
if you want to keep the zero at column 0 for example, just do:
pd.read_excel('example.xlsx', header=None, dtype={0:str})\
.to_csv('result.csv', index=False, line_terminator = ',\n'
if you havent header and you dont precise header=None, the first row is the header. dtype={0:str} indicates the column 0 will be str.
be carefull when you save the excel file to csv, the header is saved (here with your options), the first row will be 0,1,.. (name of columns)
if you dont want header to csv file use:
pd.read_excel('e:/test.xlsx', header=None, dtype={0:str})\
.to_csv('e:/result.csv', index=False, header=False, line_terminator = ',\n')

Resources