How to loop lists of list obtained from python-docx where each list is a table and write the tables into a seperate worksheets - python-3.x

I am using python-docx to extract two tables from a document.
I have iterated over the tables and created a list of lists. Each individual list represents a table, and within that I have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.
I am facing difficulty when creating a data frame for each table and writing each table on a seperate excel sheet.
from docx.api import Document
import pandas as pd
import csv
import json
import unicodedata
document = Document('Sampletable1.docx')
tables = document.tables
print (len(tables))
big_data = []
for table in document.tables:
data = []
Keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
dic = dict(zip(keys, text))
data.append(dic)
big_data.append(data)
print(big_data)
The output of the above code is:
2
[[{'Asset': 'Growth investments', 'Target investment mix': '66.50%', 'Actual investment mix': '66.30%', 'Variance': '-0.20%'}, {'Asset': 'Defensive investments', 'Target investment mix': '33.50%', 'Actual investment mix': '33.70%', 'Variance': '0.20%'}], [{'Owner': 'REST Super', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER TransPension', 'Product': 'TTR Pension', 'Type': 'New', 'Status': 'New', 'Customer 2': 'Customer 1'}, {'Owner': 'TWUSUPER', 'Product': 'Superannuation', 'Type': 'Existing', 'Status': 'Existing'}]]
How do I access the above lists??
Further I tried to create a pandas data frame
#write the data into a data frame
for thing in big_data:
#print(thing)
df = pd.DataFrame(thing)
print(df)
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
I got the first table on the excel but unable to work with second table.
I am expecting both the table to be in the same excel workbook(dftable3.xlsx) but in different worksheets(Sheet1,Sheet2)
I have attached the images of the tables.
Thanks in advance

How do I access the above lists??
You already did, by iterating over them, or printing them.
Consider using the pretty-print library:
import pprint
pprint.pprint(big_data)
I am expecting ... different worksheets(Sheet1,Sheet2)
Well, that's unlikely, given the constant 'Sheet1' argument you supplied.
Here is one way to accomplish that:
writer = pd.ExcelWriter('dftable3.xlsx', engine='xlsxwriter')
for i, thing in enumerate(big_data):
df = pd.DataFrame(thing)
df.to_excel(writer, sheet_name=f'Sheet{i}')
writer.save()
Note the scope of writer -- it must be longer lived than each of the constituent dfs.

Related

Pandas - dataframe with column with dictionary, save value instead

I have the below stories_data dictionary, which I'm able to create a df from but since owner is a dictionary as well I would like to get the value of that dictionary so the owner column would have 178413540
import numpy as np
import pandas as pd
stories_data = {'caption': 'Tel_gusto', 'like_count': 0, 'owner': {'id': '178413540'}, 'headers': {'Content-Encoding': 'gzip'}
x = pd.DataFrame(stories_data.items())
x.set_index(0, inplace=True)
stories_metric_df = x.transpose()
del stories_metric_df['headers']
I've tried this but it gets the key not the value
stories_metric_df['owner'].explode().apply(pd.Series)
You can use .str, even for objects/dicts:
stories_metric_df['owner'] = stories_metric_df['owner'].str['id']
Output:
>>> stories_metric_df
0 caption like_count owner
1 Tel_gusto 0 178413540
Another solution would be to skip the explode, and just extract id:
stories_metric_df['owner'].apply(pd.Series)['id']
although I suspect my first solution would be faster.

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

How to rename columns with Pandas? Object has no attribute error when using rename

I am trying to write a simple code in Python3 that takes all xls files in a folder, converts all text to uppercase, combines all the files into one file and save as an xlsx file. This all works. However, I also want to alter the names of the header row using rename. I can't get the code to rename anything I get the following error message:
data.rename(columns={'A 1': 'A1',
AttributeError: 'ExcelFile' object has no attribute 'rename'
Anyone help please? Thanks.
This is my code so far:
all_data = pd.DataFrame()
for f in glob.glob(r'C:\\Test\\*.xls'):
df = pd.read_excel(f)
df = df.applymap(lambda s:s.upper() if type(s) == str else s)
all_data = all_data.append(df, ignore_index=True)
writer = pd.ExcelWriter(r'C:\\Test\\alldata.xlsx', engine='xlsxwriter')
all_data.to_excel(writer)
writer.save()
print("All data in upload folder combined into one file")
files = glob.glob(r'C:\\Test\\*.xls')
for f in files:
os.remove(f)
data = pd.ExcelFile(r'C:\\Test\\alldata.xlsx')
data.rename(columns={'A 1': 'A1',
'A 2': 'B1',
'A 3: 'C1',
}, inplace=True)
data.ExcelFile.save
Try read_excel and to_excel instead of pd.ExcelFile and data.ExcelFile.save respectively. In the code you upload you have also forget the single quote after the 'A:3'. Here is an example:
all_data = pd.DataFrame()
data = pd.read_excel('test.xlsx')
data.rename(columns={'A 1': 'A1',
'A 2': 'B1',
'A 3': 'C1',
}, inplace=True)
data.to_excel("output.xlsx")

How to create Excel **Table** with pandas.to_excel()?

Need the achieve this programmatically from a dataframe:
https://learn.microsoft.com/en-us/power-bi/service-admin-troubleshoot-excel-workbook-data
Here is one way to do it using XlsxWriter:
import pandas as pd
# Create a Pandas dataframe from some data.
data = [10, 20, 30, 40, 50, 60, 70, 80]
df = pd.DataFrame({'Rank': data,
'Country': data,
'Population': data,
'Data1': data,
'Data2': data})
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("pandas_table.xlsx", engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object. Turn off the default
# header and index and skip one row to allow us to insert a user defined
# header.
df.to_excel(writer, sheet_name='Sheet1', startrow=1, header=False, index=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Get the dimensions of the dataframe.
(max_row, max_col) = df.shape
# Create a list of column headers, to use in add_table().
column_settings = []
for header in df.columns:
column_settings.append({'header': header})
# Add the table.
worksheet.add_table(0, 0, max_row, max_col - 1, {'columns': column_settings})
# Make the columns wider for clarity.
worksheet.set_column(0, max_col - 1, 12)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:
Update: I've added a similar example to the XlsxWriter docs: Example: Pandas Excel output with a worksheet table
You can't do it with to_excel. A workaround is to open the generated xlsx file and add the table there with openpyxl:
import pandas as pd
df = pd.DataFrame({'Col1': [1,2,3], 'Col2': list('abc')})
filename = 'so58326392.xlsx'
sheetname = 'mySheet'
with pd.ExcelWriter(filename) as writer:
if not df.index.name:
df.index.name = 'Index'
df.to_excel(writer, sheet_name=sheetname)
import openpyxl
wb = openpyxl.load_workbook(filename = filename)
tab = openpyxl.worksheet.table.Table(displayName="df", ref=f'A1:{openpyxl.utils.get_column_letter(df.shape[1])}{len(df)+1}')
wb[sheetname].add_table(tab)
wb.save(filename)
Please note the all table headers must be strings. If you have an un-named index (which is the rule) the first cell (A1) will be empty which leads to file corruption. To avoid this give your index a name (as shown above) or export the dataframe without the index using:
df.to_excel(writer, sheet_name=sheetname, index=False)
Another workaround, if you don't want to save, re-open, and re-save, is to use xlsxwriter. It can write ListObject tables directly, but does not do so directly from a dataframe, so you need to break out the parts:
import pandas as pd
import xlsxwriter as xl
df = pd.DataFrame({'Col1': [1,2,3], 'Col2': list('abc')})
filename = 'output.xlsx'
sheetname = 'Table'
tablename = 'TEST'
(rows, cols) = df.shape
data = df.to_dict('split')['data']
headers = []
for col in df.columns:
headers.append({'header':col})
wb = xl.Workbook(filename)
ws = wb.add_worksheet()
ws.add_table(0, 0, rows, cols-1,
{'name': tablename
,'data': data
,'columns': headers})
wb.close()
The add_table() function expects 'data' as a list of lists, where each sublist represents a row of the dataframe, and 'columns' as a list of dicts for the header where each column is specified by a dictionary of the form {'header': 'ColumnName'}.
I created a package to write properly formatted excel tables from pandas: pandas-xlsx-tables
from pandas_xlsx_tables import df_to_xlsx_table
import pandas as pd
data = [10, 20, 30, 40, 50, 60, 70, 80]
df = pd.DataFrame({'Rank': data,
'Country': data,
'Population': data,
'Strings': [f"n{n}" for n in data],
'Datetimes': [pd.Timestamp.now() for _ in range(len(data))]})
df_to_xlsx_table(df, "my_table", index=False, header_orientation="diagonal")
You can also do the reverse with xlsx_table_to_df
Based on the answer of #jmcnamara, but as a convenient function and using "with" statement:
import pandas as pd
def to_excel(df:pd.DataFrame, excel_name: str, sheet_name: str, startrow=1, startcol=0):
""" Exports pandas dataframe as a formated excel table """
with pd.ExcelWriter(excel_name, engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name=sheet_name, startrow=startrow, startcol=startcol, header=True, index=False)
workbook = writer.book
worksheet = writer.sheets[sheet_name]
max_row, max_col = df.shape
olumn_settings = [{'header': header} for header in df.columns]
worksheet.add_table(startrow, startcol, max_row+startrow, max_col+startcol-1, {'columns': column_settings})
# style columns
worksheet.set_column(startcol, max_col + startcol, 21)

How to create a dictionary from CSV using a loop function?

I am trying to create several dictionaries out of a table of comments from a CSV with the following columns:
I need to create a dictionary for every row (hopefully using a loop so I don't have to create them all manually), where the dictionary keys are:
ID
ReviewType
Comment
However, I cannot figure out a fast way to do this. I tried creating a list of dictionaries using the following code:
# Import libraries
import csv
import json
import pprint
# Open file
reader = csv.DictReader(open('Comments.csv', 'rU'))
# Create list of dictionaries
dict_list = []
for line in reader:
dict_list.append(line)
pprint.pprint(dict_list)
However, now I do not know how to access the dictionaries or whether the key value pairs are matched properly since in the following image:
The ID, ReviewType and Comment do not seem to be showing as
dictionary keys
The Comment value seems to be showing as a list of half-sentences.
Is there any way to just create one dictionary for each row instead of a list of dictionaries?
Note: I did look at this question, however it didn't really help.
Here you go. I put the comment into an array
# Import libraries
import csv
import json
import pprint
# Open file
def readPerfReviewCSVToDict(csvPath):
reader = csv.DictReader(open(csvPath, 'rU'))
perfReviewsDictionary = []
for line in reader:
perfReviewsDictionary.append(line)
perfReviewsDictionaryWithCommentsSplit = []
for item in perfReviewsDictionary:
itemId = item["id"]
itemType = item["type"]
itemComment = item["comments"]
itemCommentDictionary = []
itemCommentDictionary = itemComment.split()
perfReviewsDictionaryWithCommentsSplit.append({'id':itemId, 'type':itemType, 'comments':itemCommentDictionary})
return perfReviewsDictionaryWithCommentsSplit
dict_list = readPerfReviewCSVToDict("test.csv")
pprint.pprint(dict_list)
The output is:
[{'comments': ['test', 'ape', 'dog'], 'id': '1', 'type': 'Test'},
{'comments': ['dog'], 'id': '2', 'type': 'Test'}]
Since you haven't given a reproducible example, with a sample DataFrame, I've created one for you
import pandas as pd
df = pd.DataFrame([[1, "Contractor", "Please post"], [2, "Developer", "a reproducible example"]])
df.columns = ['ID', 'ReviewType', 'Comment']
In your computer, instead of doing this, type:
df = pd.read_csv(file_path)
to read in the csv file as a pandas DataFrame.
Now I will create a list, called dictList which will be empty initially, I am going to populate it with a dictionary for each row in the DataFrame df
dictList = []
#Iterate over each row in df
for i in df.index:
#Creating an empty dictionary for each row
rowDict = {}
#Populating it
rowDict['ID'] = df.at[i, 'ID']
rowDict['ReviewType'] = df.at[i, 'ReviewType']
rowDict['Comment'] = df.at[i, 'Comment']
#Once I'm done populating it, I will append it to the list
dictList.append(rowDict)
#Go to the next row and repeat.
Now iterating over the list of dictionaries we have created for my example
for i in dictList:
print(i)
We get
{'ID': 1, 'ReviewType': 'Contractor', 'Comment': 'Please post'}
{'ID': 2, 'ReviewType': 'Developer', 'Comment': 'a reproducible example'}
Do you want this?
DICT = {}
for line in reader:
DICT[line['ID']] = line

Resources