Missing data when exporting data frame from pandas to excel - excel

I have created a program to remove duplicate rows from an excel file using pandas. After successfully doing so I exported the new data from pandas to excel however the new excel file seems to have missing data (specifically columns involving dates). Instead of showing the actual data it just shows '##########' on the rows.
Code:
import pandas as pd
data = pd.read_excel('test.xlsx')
data.sort_values("Serial_Nbr", inplace = True)
data.drop_duplicates(subset ="Serial_Nbr", keep = "first", inplace = True)
data.to_excel (r'test_updated.xlsx')
Before and after exporting:
date date
2018-07-01 ##########
2018-08-01 ##########
2018-08-01 ##########

it means Width of cell is not capable to display the data, try to expand the width of cell's width.
cell's width is too narrow:
after expanding the cell's width:
to export to excel with datetime correctly, you must add the format code for excel export:
import pandas as pd
data = pd.read_excel('Book1.xlsx')
data.sort_values("date", inplace = False)
data.drop_duplicates(subset ="date", keep = "first", inplace = True)
#Writer datetime format
writer = pd.ExcelWriter("test_updated.xlsx",
datetime_format='mm dd yyyy',
date_format='mmm dd yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
data.to_excel(writer, sheet_name='Sheet1')
writer.save()

########## is displayed when a cell's width is too small to display its contents. You need to increase the cells' width or reduce their content

Regarding the original query on data, I agree with the response from ALFAFA.
Here I am trying to do column resizing, so that end user does not need to do the same manually in the xls.
Steps would be:
Get the column name (as per xls, column names start with 'A', 'B', 'C' etc)
colPosn = data.columns.get_loc('col#3') # Get column position
xlsColName = chr(ord('A')+colPosn) # Get xls column name (not the column header as per data frame). This will be used to set attributes of xls columns
Get resizing width of the column 'col#3' by getting length of the longest string in the column
maxColWidth = 1 + data['col#3'].map(len).max() # Gets the length of longest string of the column named 'col#3' (+1 for some buffer space to make data visible in the xls column)
use column_dimensions[colName].width attribute to increase the width of the xls column
data.to_excel(writer, sheet_name='Sheet1', index=False) # use index=False if you dont need the unwanted extra index column in the file
sheet = writer.book['Sheet1']
sheet.column_dimensions[xlsColName].width = maxColWidth # Increase the width of column to match with the longest string in the column
writer.save()
Replace last two lines from post of ALFAFA with the above blocks (all sections above) to get the column width adjusted for 'col#3'

Related

Change number format using headers - openpyxl

I have an Excel file in which I want to convert the number formatting from 'General' to 'Date'. I know how to do so for one column when referring to the column letter:
workbook = openpyxl.load_workbook('path\filename.xlsx')
worksheet = workbook['Sheet1']
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Report_date'], row)].number_format='yyyy-mm-dd;#'
As you can see, I now use the column letter "D" to point out the column that I want to be formatted differently. Now, I would like to use the header in row 1 called "Start_Date" to refer to this column. I tried a method from the following post to achieve this: select a column by its name - openpyxl. However, that resulted in a KeyError: "Start_Date":
# Create a dictionary of column names
ColNames = {}
Current = 0
for COL in worksheet.iter_cols(1, worksheet.max_column):
ColNames[COL[0].value] = Current
Current += 1
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Start_Date'], row)].number_format='yyyy-mm-dd;#'
EDIT
This method results in the following error:
AttributeError: 'tuple' object has no attribute 'number_format'
Additionally, I have more columns from which the number formatting needs to be changed. I have a list with the names of those columns:
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
Is there a way that I can use the list DateColumns so that I can save some lines of code?
Thanks in advance.
Please note that I posted a similar question earlier. The following post was referred to as an answer Python: Simulating CSV.DictReader with OpenPyXL. However, I don't see how the answers in that post can be adjusted to my needs.
You need to know which columns you want to change the number format on which you have conveniently put into a list, so why not just use that list.
Get the headers in your sheet, check if the Header is in the DateColumns list, if so then update all the entries in that column from row 2 to max with the date format you want...
...
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
for COL in worksheet.iter_cols(min_row=1,max_row=1):
header = COL[0]
if header.value in DateColumns:
for row in range(2, worksheet.max_row+1):
worksheet.cell(row, COL[0].column).number_format='yyyy-mm-dd;#'

Automate cell properties multiple Excel files

I have about 100 Excel files, in which columns K and L are represented as floating points, for instance 0.5677. I want to represent those columns as percentages, in this case 56.8%. Is there a way I can automate this? Obviously I can adjust the columns by hand, but this is quite time consuming.
I have no experience with Macro's or VBA.
Any help would be greatly appreciated.
Kind regards, M.
I found a useful way to do this using Python Pandas. In my case, Pandas is also the source for the Excel files.
import pandas as pd
# Create a Pandas dataframe from some data.
df = pd.DataFrame({'Numbers': [1010, 2020, 3030, 2020, 1515, 3030, 4545],
'Percentage': [.1, .2, .33, .25, .5, .75, .45 ],
})
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add some cell formats.
format1 = workbook.add_format({'num_format': '#,##0.00'})
format2 = workbook.add_format({'num_format': '0%'})
# Note: It isn't possible to format any cells that already have a format such
# as the index or headers or any cells that contain dates or datetimes.
# Set the column width and format.
worksheet.set_column('B:B', 18, format1)
# Set the format but not the column width.
worksheet.set_column('C:C', None, format2)
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Converting Excel file to csv using to_csv, removes leading zeros even when cells are formatted to be string

When I try to convert my excel file to csv, using to_csv function, all my item number that has 1 leading 0, loses it except for the very first row.
I have a simple forloop that iterates through all cells and converts cell values to string so I have no idea why only first row gets converted to csv format correctly with the leading 0.
for row in ws.iter_rows():
for cell in row:
cell.value = str(cell.value)
pd.read_excel('example.xlsx').to_csv('result.csv', index=False, line_terminator = ',\n')
e.g.
https://i.stack.imgur.com/Njb3n.png (won't let me directly add image but it shows the following in excel)
0100,03/21/2019,4:00,6:00
0101,03/21/2019,4:00,6:00
0102,03/21/2019,4:00,8:00
turns into:
0100,03/21/2019,4:00,6:00,
101,03/21/2019,4:00,6:00,
102,03/21/2019,4:00,8:00,
What can I do to have 0 in front of all the first items in csv?
Any insight would be appreciated.
So if you have not header in excel file: the name by default of columns is 0,1,... and so on
if you want to keep the zero at column 0 for example, just do:
pd.read_excel('example.xlsx', header=None, dtype={0:str})\
.to_csv('result.csv', index=False, line_terminator = ',\n'
if you havent header and you dont precise header=None, the first row is the header. dtype={0:str} indicates the column 0 will be str.
be carefull when you save the excel file to csv, the header is saved (here with your options), the first row will be 0,1,.. (name of columns)
if you dont want header to csv file use:
pd.read_excel('e:/test.xlsx', header=None, dtype={0:str})\
.to_csv('e:/result.csv', index=False, header=False, line_terminator = ',\n')

Python Pandas check cells for a range of numbers copy or skip if not there

I would use pandas isin or iloc functions but the excel format is complex and there are sometimes data followed by cols of no info, and the main pool of entries are cols with up to 3 pieces of data in a cell with only a '|' to separate them. Some of the cells are missing a number and I want to skip those but copy the ones with them.
Above is my current code. I have a giant excel with thousands of entries and worse, the column/rows are not neat. There are several pieces of data in each column cell per row. What I've noticed is that a number called 'tail #' is missing in some of them. What I want to do is search for that number, if it has it then copy that cell, if it does not then go to the next column in the row. Then repeat that for all cells. There is a giant header, but when I transformed it into CSV, I removed that with formatting. This is also why I am looking for a number because there are several headers. for example, years that say like 2010 but then several empty columns till the next one maybe 10 cols later. Also please not that under this header of years are several columns of data per row that are separated by two columns with no info. Also, the info in a column looks like this, '13|something something|some more words'. If it has a number as you see, I want to copy it. The numbers seem to range from 0 to no greater than 30. Lastly, I'm trying to write this using pandas but I may need a more manual way to do things because using isin, and iloc was not working.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import os.path as op
from openpyxl import workbook
import re
def extract_export_columns(df, list_of_columns, file_path):
column_df = df[list_of_columns]
column_df.to_csv(file_path, index=False, sep="|")
#Orrginal file
input_base_path = 'C:/Users/somedoc input'
main_df_data_file = pd.read_csv(op.join (input_base_path, 'som_excel_doc.csv '))
#Filter for tailnumbers
tail_numbers = main_df_data_file['abcde'] <= 30
main_df_data_file[tail_abcd]
#iterate over list
#number_filter = main_df_data_file.Updated.isin(["15"])
#main_df_data_file[number_filter]
#print(number_filter)
#for row in main_df_data_file.values:
#for value in row:
# print(value)
#print(row)
# to check the condition
# Product of code
output_base_path = r'C:\Users\some_doc output'
extract_export_columns(main_df_data_file,
['Updated 28 Feb 18 Tail #'],
op.join(output_base_path, 'UBC_example3.txt'))
The code I have loads into csv, and successfully creates a text file. I want to build the body function to scan an excel/csv file to copy and paste to a text file data that contains a number.
https://drive.google.com/file/d/1stXxgqBeo_sGksVYL9HHdn2IflFL_bb8/view?usp=sharing

Python: How to read multiple spreadsheets into a new format in a CSV?

I (newcomer) try to read from an excel document several tables and read in a new format in a single csv.
In the csv, i need the following fields: year (from a global variable), month (from a global variable), outlet (name of the tablesheet); rowvalue [a] (string to explain the row), columnvalue [1] (string to explain the cloumn), cellvalue (float)
The corresponding values must then be entered in these.
From the respective tables, only RowNum 6 to 89 need to be read
#BWA-Reader
#read the excel spreadsheet with all sheets
#Python 3.6
Importe
import openpyxl
import xlrd
from PIL import Image as PILImage
import csv
# year value of the Business analysis
year = "2018"
# month value of the Business analysis
month = "11"
# .xlxs path
wb = openpyxl.load_workbook("BWA Zusammenfassung 18-11.xlsx")
print("Found your Spreadsheet")
# List of sheets
sheets = wb.get_sheet_names()
# remove unneccessary sheets
list_to_remove = ("P",'APn','AP')
sheets_clean = list(set(sheets).difference(set(list_to_remove)))
print("sheets to load: " + str(sheets_clean))
# for loop for every sheet based on sheets_clean
for sheet in sheets_clean:
# for loop to build list for row and cell value
all_rows = []
for row in wb[sheet].rows:
current_row = []
for cell in row:
current_row.append (cell.value)
all_rows.append(current_row)
print(all_rows)
# i´m stucked -.-´
I expect an output like:
2018;11;Oldenburg;total_sales;monthly;145840.00
all sheets in one csv
Thank you so much for every idea how to solve my project!
The complete answer to this question is very dependent on the actual dataset.
I would recommend looking into pandas' read_excel() function. This will make it so much easier to extract the needed rows/columns/cells, all without looping through all of the sheets.
You might need some tutorials on pandas in order to get there, but judging by what you are trying to do, pandas might be a useful skill to have in the future!

Resources