python xlsxwriter extract value from cell - python-3.x

Is it possible to extract data that I've written to a xlsxwriter.worksheet?
import xlsxwriter
output = "test.xlsx"
workbook = xlsxwriter.Workbook(output)
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, 'top left')
if conditional:
worksheet.write(1, 1, 'bottom right')
for row in range(2):
for col in range(2):
# Now how can I check if a value was written at this coordinate?
# something like worksheet.get_value_at_row_col(row, col)
workbook.close()

Is it possible to extract data that I've written to a xlsxwriter.worksheet?
Yes. Even though XlsxWriter is write only, it stores the table values in an internal structure and only writes them to file when workbook.close() is executed.
Every Worksheet has a table attribute. It is a dictionary, containing entries for all populated rows (row numbers starting at 0 are the keys). These entries are again dictionaries, containing entries for all populated cells within the row (column numbers starting at 0 are the keys).
Therefore, table[row][col] will give you the entry at the desired position (but only in case there is an entry, it will fail otherwise).
Note that these entries are still not the text, number or formula you are looking for, but named tuples, which also contain the cell format. You can type check the entries and extract the contents depending on their nature. Here are the possible outcomes of type(entry) and the fields of the named tuples that are accessible:
xlsxwriter.worksheet.cell_string_tuple: string, format
xlsxwriter.worksheet.cell_number_tuple: number, format
xlsxwriter.worksheet.cell_blank_tuple: format
xlsxwriter.worksheet.cell_boolean_tuple: boolean, format
xlsxwriter.worksheet.cell_formula_tuple: formula, format, value
xlsxwriter.worksheet.cell_arformula_tuple: formula, format, value, range
For numbers, booleans, and formulae, the contents can be accessed by reading the respective field of the named tuple.
For array formulae, the contents are only present in the upper left cell of the output range, while the rest of the cells are represented by number entries with 0 value.
For strings, the situation is more complicated, since Excel's storage concept has a shared string table, while the individual cell entries only point to an index of this table. The shared string table can be accessed as the str_table.string_table attribute of the worksheet. It is a dictionary, where the keys are strings and the values are the associated indices. In order to access the strings by index, you can generate a sorted list from the dictionary as follows:
shared_strings = sorted(worksheet.str_table.string_table, key=worksheet.str_table.string_table.get)
I expanded your example from above to include all the explained features. It now looks like this:
import xlsxwriter
output = "test.xlsx"
workbook = xlsxwriter.Workbook(output)
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, 'top left')
worksheet.write(0, 1, 42)
worksheet.write(0, 2, None)
worksheet.write(2, 1, True)
worksheet.write(2, 2, '=SUM(X5:Y7)')
worksheet.write_array_formula(2,3,3,4, '{=TREND(X5:X7,Y5:Y7)}')
worksheet.write(4,0, 'more text')
worksheet.write(4,1, 'even more text')
worksheet.write(4,2, 'more text')
worksheet.write(4,3, 'more text')
for row in range(5):
row_dict = worksheet.table.get(row, None)
for col in range(5):
if row_dict != None:
col_entry = row_dict.get(col, None)
else:
col_entry = None
print(row,col,col_entry)
shared_strings = sorted(worksheet.str_table.string_table, key=worksheet.str_table.string_table.get)
print()
if type(worksheet.table[0][0]) == xlsxwriter.worksheet.cell_string_tuple:
print(shared_strings[worksheet.table[0][0].string])
# type checking omitted for the rest...
print(worksheet.table[0][1].number)
print(bool(worksheet.table[2][1].boolean))
print('='+worksheet.table[2][2].formula)
print('{='+worksheet.table[2][3].formula+'}')
workbook.close()

Is it possible to extract data that I've written to a xlsxwriter.worksheet?
No. XlsxWriter is write only. If you need to keep track of your data you will need to do it in your own code, outside of XlsxWriter.

Related

Change number format using headers - openpyxl

I have an Excel file in which I want to convert the number formatting from 'General' to 'Date'. I know how to do so for one column when referring to the column letter:
workbook = openpyxl.load_workbook('path\filename.xlsx')
worksheet = workbook['Sheet1']
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Report_date'], row)].number_format='yyyy-mm-dd;#'
As you can see, I now use the column letter "D" to point out the column that I want to be formatted differently. Now, I would like to use the header in row 1 called "Start_Date" to refer to this column. I tried a method from the following post to achieve this: select a column by its name - openpyxl. However, that resulted in a KeyError: "Start_Date":
# Create a dictionary of column names
ColNames = {}
Current = 0
for COL in worksheet.iter_cols(1, worksheet.max_column):
ColNames[COL[0].value] = Current
Current += 1
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Start_Date'], row)].number_format='yyyy-mm-dd;#'
EDIT
This method results in the following error:
AttributeError: 'tuple' object has no attribute 'number_format'
Additionally, I have more columns from which the number formatting needs to be changed. I have a list with the names of those columns:
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
Is there a way that I can use the list DateColumns so that I can save some lines of code?
Thanks in advance.
Please note that I posted a similar question earlier. The following post was referred to as an answer Python: Simulating CSV.DictReader with OpenPyXL. However, I don't see how the answers in that post can be adjusted to my needs.
You need to know which columns you want to change the number format on which you have conveniently put into a list, so why not just use that list.
Get the headers in your sheet, check if the Header is in the DateColumns list, if so then update all the entries in that column from row 2 to max with the date format you want...
...
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
for COL in worksheet.iter_cols(min_row=1,max_row=1):
header = COL[0]
if header.value in DateColumns:
for row in range(2, worksheet.max_row+1):
worksheet.cell(row, COL[0].column).number_format='yyyy-mm-dd;#'

Convert all strings with numbers to integers in DataFrames

I am using pandas with openpyxl to process multiple Excel files into a single Excel file as output. In this output file, cells can contain a combination of numbers and other characters or exclusively numbers, and all cells are stored as text.
I want all cells that only contain numbers in the output file to be stored as numbers. As the columns with numbers are known (5 to 8), I used the following code to transform the text to floats:
for dictionary in list_of_Excelfiles
dictionary[DataFrame][5:8].astype(float)
However, this manual procedure is not scalable and might be prone to errors when other characters than numbers are present in the column. As such, I want to create a statement that transforms any cell with only numbers to an integer.
What condition can filter for cells with only numbers and transform these to integers?
You could use try and except and apply map, here is a full example:
create some random data for example:
def s():
return [''.join(random.choices([x for x in string.ascii_letters[:6]+string.digits], k=random.randint(1, 5))) for x in range(5)]
df = pd.DataFrame()
for c in range(4):
df[c] = s()
define a try and except func:
def try_int(s):
try:
return int(s)
except ValueError:
return s
apply on each cell:
df2 = df.applymap(try_int)

Missing data when exporting data frame from pandas to excel

I have created a program to remove duplicate rows from an excel file using pandas. After successfully doing so I exported the new data from pandas to excel however the new excel file seems to have missing data (specifically columns involving dates). Instead of showing the actual data it just shows '##########' on the rows.
Code:
import pandas as pd
data = pd.read_excel('test.xlsx')
data.sort_values("Serial_Nbr", inplace = True)
data.drop_duplicates(subset ="Serial_Nbr", keep = "first", inplace = True)
data.to_excel (r'test_updated.xlsx')
Before and after exporting:
date date
2018-07-01 ##########
2018-08-01 ##########
2018-08-01 ##########
it means Width of cell is not capable to display the data, try to expand the width of cell's width.
cell's width is too narrow:
after expanding the cell's width:
to export to excel with datetime correctly, you must add the format code for excel export:
import pandas as pd
data = pd.read_excel('Book1.xlsx')
data.sort_values("date", inplace = False)
data.drop_duplicates(subset ="date", keep = "first", inplace = True)
#Writer datetime format
writer = pd.ExcelWriter("test_updated.xlsx",
datetime_format='mm dd yyyy',
date_format='mmm dd yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
data.to_excel(writer, sheet_name='Sheet1')
writer.save()
########## is displayed when a cell's width is too small to display its contents. You need to increase the cells' width or reduce their content
Regarding the original query on data, I agree with the response from ALFAFA.
Here I am trying to do column resizing, so that end user does not need to do the same manually in the xls.
Steps would be:
Get the column name (as per xls, column names start with 'A', 'B', 'C' etc)
colPosn = data.columns.get_loc('col#3') # Get column position
xlsColName = chr(ord('A')+colPosn) # Get xls column name (not the column header as per data frame). This will be used to set attributes of xls columns
Get resizing width of the column 'col#3' by getting length of the longest string in the column
maxColWidth = 1 + data['col#3'].map(len).max() # Gets the length of longest string of the column named 'col#3' (+1 for some buffer space to make data visible in the xls column)
use column_dimensions[colName].width attribute to increase the width of the xls column
data.to_excel(writer, sheet_name='Sheet1', index=False) # use index=False if you dont need the unwanted extra index column in the file
sheet = writer.book['Sheet1']
sheet.column_dimensions[xlsColName].width = maxColWidth # Increase the width of column to match with the longest string in the column
writer.save()
Replace last two lines from post of ALFAFA with the above blocks (all sections above) to get the column width adjusted for 'col#3'

About lists in python

I have an excel file with a column in which values are in multiple rows in this format 25/02/2016. I want to save all this rows of dates in a list. Each row is a separate value. How do I do this? So far this is my code:
I have an excel file with a column in which values are in multiple rows in this format 25/02/2016. I want to save all this rows of dates in a list. Each row is a separate value. How do I do this? So far this is my code:
import openpyxl
wb = openpyxl.load_workbook ('LOTERIAREAL.xlsx')
sheet = wb.get_active_sheet()
rowsnum = sheet.get_highest_row()
wholeNum = []
for n in range(1, rowsnum):
wholeNum = sheet.cell(row=n, column=1).value
print (wholeNum[0])
When I use the print statement, instead of printing the value of the first row which should be the first item in the list e.g. 25/02/2016, it is printing the first character of the row which is the number 2. Apparently it is slicing thru the date. I want the first row and subsequent rows saved as separate items in the list. What am I doing wrong? Thanks in advance
wholeNum = sheet.cell(row=n, column=1).value assigns the value of the cell to the variable wholeNum, so you're never adding anything to the initial empty list and just overwrite the value each time. When you call wholeNum[0] at the end, wholeNum is a the last string that was read, and you're getting the first character of it.
You probable want wholeNum.append(sheet.cell(row=n, column=1).value) to accumulate a list.
wholeNum =
This is an assignment. It makes the name wholeNum refer to whatever object the expression to the right of the = operator evaluates to.
for ...:
wholeNum = ...
Performing assignment in a loop is frequently not useful. The name wholeNum will refer to whatever value was assigned to it in the last iteration of the loop. The other iterations have no discernible effect.
To append values to a list, use the .append() method.
for ...:
wholeNum.append( ... )
print( wholeNum )
print( wholeNum[0] )

Generating test data in Excel for an EAV table

This is a pretty complicated question so be prepared! I want to generate some test data in excel for my EAV table. The columns I have are:
user_id, attribute, value
Each user_id will repeat for a random number of times between 1-4, and for each entry I want to pick a random attribute from a list, and then a random value which this can take on. Lastly I want the attributes for each id entry to be unique i.e. I do not want more than one entry with the same id and attribute. Below is an example of what I mean:
user_id attribute value
100001 gender male
100001 religion jewish
100001 university imperial
100002 gender female
100002 course physics
Possible values:
attribute value
gender male
female
course maths
physics
chemistry
university imperial
cambridge
oxford
ucl
religion jewish
hindu
christian
muslim
Sorry that the table above messed up. I don't know how to paste into here while retaining the structure! Hopefully you can see what I'm talking about otherwise I can get a screenshot.
How can I do this? In the past I have generated random data using a random number generator and a VLOOKUP but this is a bit out of my league.
My approach is to create a table with all four attributes for each ID and then filter that table randomly to get between one and four filtered rows per ID. I assigned a random value to each attribute. The basic setup looks like this:
To the left is the randomized eav table and to the left is the lookup table used for the randomized values. Here's the formulas. Enter them and copy down:
Column A - Establishes a random number every four digits. This determines the attribute that must be selected:
=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)
Column B - Uses the formula in A to determine if row is included:
=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)
Column C - Creates the IDs, starting with 100,001:
=(INT((ROW()-2)/4)+100000)+1
Column D - Repeats the four attributes:
=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")
Column E - Finds the first occurence of the Column D attribute in the lookup table and selects a randomly offset value:
=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))
When you filter on the TRUEs in Column B you'll get your list of one to four Attributes per ID. Disconcertingly, the filtering forces a recalculation, so the filtered list will no longer say TRUE for every cell in column B.
If this was mine I'd automate it a little more, perhaps by putting the "magic number" 4 in it's own cell (the count of attributes).
There are a number of ways to do this. You could use either perl or python. Both have modules for working with spreadsheets. In this case, I used python and the openpyxl module.
# File: datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example: datagen.py myfile.xlsx
import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter
# verify that user specified an argument
if len(sys.argv) < 2:
print "Specify an excel filename to save the data, e.g myfile.xlsx"
exit(-1)
# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()
# Modify this line to specify the range of user ids
ids = range(100001, 100100)
# data structure for the attributes and values
data = { 'gender': ['male', 'female'],
'course': ['maths', 'physics', 'chemistry'],
'university': ['imperial','cambridge','oxford', 'ucla'],
'religion': ['jewish', 'hindu', 'christian','muslim']}
# Write column headers in the spreadsheet
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'
row = 1
# Loop through each user id
for user_id in ids:
# randomly select how many attributes to use
attr_cnt = random.randint(1,4)
attributes = data.keys()
for idx in range(attr_cnt):
# randomly select attribute
attr = random.choice(attributes)
# remove the selected attribute from further selection for this user id
attributes.remove(attr)
# randomly select a value for the attribute
value = random.choice(data[attr])
row = row + 1
# write the values for the current row in the spreadsheet
ws.cell('%s%s'%('A', row)).value = user_id
ws.cell('%s%s'%('B', row)).value = attr
ws.cell('%s%s'%('C', row)).value = value
# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1])
print "Done!"

Resources