Add outer borders using xlsxwritter - python-3.x

I have a dataframe and I want to set outer borders. I tried the below code but it adds border to each and every cell within 'A1:I83' range. I only want to add outer thick border:
border_format=workbook.add_format({
'border':1,
'align':'left',
'font_size':10
})
worksheet_rating_input.conditional_format('A1:I83' , { 'type' : 'no_blanks' , 'format' : border_format})

One way you could do it is to set the format of J1 through J183 with a thick left border and set the format of A184 to I184 to have a thick top border.
I've posted a fully reproducible example of this below. In my example, I make use of df.shape to set the borders dependant on dimensionality of my dataframe.
import xlsxwriter
import pandas as pd
import numpy as np
# Creating a dataframe
df = pd.DataFrame(np.random.randn(182, 9), columns=list('ABCDEFGHI'))
column_list = df.columns
# Create a Pandas Excel writer using XlsxWriter engine.
writer = pd.ExcelWriter("test.xlsx", engine='xlsxwriter')
df.to_excel(writer, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
leftFormat = workbook.add_format({'left': 5})
topFormat = workbook.add_format({'top': 5})
for row in range(0, df.shape[0] + 1):
worksheet.write(row, df.shape[1], '', leftFormat)
for col in range(0, df.shape[1]):
worksheet.write(df.shape[0] + 1, col, '', topFormat)
worksheet.freeze_panes(1, 0) #freezing top row
writer.save()
With Expected Output:

Related

Altair/Vega-Lite heatmap: Filter top k

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

How to update the date range on X-Axis with python-pptx

I have a multi-line chart that I'm trying to update the data for. I can change the data for the data series (1 to 5) in my case using a dataframe; I'm unable to figure out how to change the range for the category axis.
In the current scenario, I have the daterange starting from 2010; I can't figure out how to update that dynamically bases on input data
My chart is as shown below:
My chart data is as below:
My code is as below:
import pandas as pd
from pptx import Presentation
from pptx.chart.data import CategoryChartData, ChartData
df = pd.DataFrame({
'Date':['2010-01-01','2010-02-01','2010-03-01','2010-04-01','2010-05-01'],
'Series 1': [0.262918, 0.259484,0.263314,0.262108,0.252113],
'Series 2': [0.372340,0.368741,0.375740,0.386040,0.388732],
'Series 3': [0.109422,0.109256,0.112426,0.123932,0.136620],
'Series 4': [0.109422,0.109256,0.112426,0.123932,0.136620], # copy of series 3 for easy testing
'Series 5': [0.109422,0.109256,0.112426,0.123932,0.136620], # copy of series 3 for easy testing
})
prs = Presentation(presentation_path)
def update_multiline(chart,df):
plot = chart.plots[0]
category_labels = [c.label for c in plot.categories]
# series = plot.series[0]
chart_data = CategoryChartData()
chart_data.categories = [c.label for c in plot.categories]
category_axis = chart.category_axis
category_axis.minimum_scale = 1 # this should be a date
category_axis.minimum_scale = 100 # this should be a date
tick_labels = category_axis.tick_labels
df = df.drop(columns=['Date'])
for index in range(df.shape[1]):
columnSeriesObj = df.iloc[:, index]
chart_data.add_series(plot.series[index].name, columnSeriesObj)
chart.replace_data(chart_data)
# ================================ slide index 3 =============================================
slide_3 = prs.slides[3]
slide_3_title = slide_3.shapes.title # assigning a title
graphic_frame = slide_3.shapes
# slide has only one chart and that's the 3rd shape, hence graphic_frame[2]
slide_3_chart = graphic_frame[2].chart
update_multiline(slide_3_chart, df)
prs.save(output_path)
How to update the date range if my date in the dataframe starts from say 2015 i.e. 'Date':['2015-01-01','2015-02-01','2015-03-01','2015-04-01','2015-05-01']
You are simply copying the categories of the old chart into the new chart with:
chart_data.categories = [c.label for c in plot.categories]
You must draw the category labels from the dataframe if you expect them to change.

Locate columns in dataframe to graph

I have an excel file with 3 columns and 60 rows. The first column is the Date but I want to put that on the x axis and plot the other 2 against it. I need help locating the 2 other columns so i can enter it in ax1.plot() and ax2.plot().
I have tried to locate it by [:,1] but that doesnt work and I have tried to locate it by the name of the column. The second column is "S&P/TSX Composite index (^GSPTSE)" and the third column is "Bitcoin CAD (BTC-CAD)"
import pandas as pd
import matplotlib.pyplot as plt
InputData = pd.read_excel('Python_assignment_InputData.xlsx')
#InputData = InputData[:15]
"""
print("a)\n", InputData,"\n")
print("b)")
InputData['Date'] = InputData.DATE.dt.year
InputData['Year'] = pd.to_datetime(InputData.Date).dt.year
"""
#ax1 = InputData.iloc[:,1]
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot()
ax2.plot()
ax1.set_ylabel("TSX", color = 'b')
ax2.set_ylabel("BTC", color = 'g')
ax1.set_xlabel("Year")
plt.title("Question 6")
plt.show()

Using xlsxwriter (or other packages) to create Excel tabs with specific naming, and write dataframe to the corresponding tab

I am trying to query based on different criteria, and then create individual tabs in Excel to store the query results.
For example, I want to query all the results that match criteria A, and write the result to an Excel tab named "A". The query result is stored in the panda data frame format.
My problem is, when I want to perform 4 different queries based on criteria "A", "B", "C", "D", the final Excel file only contains one tab, which corresponds to the last criteria in the list. It seems that all the previous tabs are over-written.
Here is sample code where I replace the SQL query part with a pre-set dataframe and the tab name is set to 0, 1, 2, 3 ... instead of the default Sheet1, Sheet2... in Excel.
import pandas as pd
import xlsxwriter
import datetime
def GCF_Refresh(fileCreatePath, inputName):
currentDT = str(datetime.datetime.now())
currentDT = currentDT[0:10]
loadExcelName = currentDT + '_' + inputName + '_Load_File'
fileCreatePath = fileCreatePath +'\\' + loadExcelName+'.xlsx'
wb = xlsxwriter.Workbook(fileCreatePath)
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
#worksheet = writer.sheets[str(iCount)]
#worksheet.write(0, 0, 'Name')
df['Name'].to_excel(fileCreatePath, sheet_name=str(iCount), startcol=0, startrow=1, header=None, index=False)
writer.save()
writer.close()
# Change the file path here to store on your local computer
GCF_Refresh("H:\\", "Bulk_Load")
My goal for this sample code is to have 5 tabs named, 0, 1, 2, 3, 4 and each tab has 'tom', 'nick' and 'juli' printed to it. Right now, I just have one tab (named 4), which is the last tab among all the tabs I expected.
There are a number of errors in the code:
The xlsx file is created using XlsxWriter directly and then overwritten by creating it Again in Pandas.
The to_excel() method takes a reference to the writer object not the file path.
The save() and close() are the same thing and shouldn't be in the
loop.
Here is a simplified version of your code with these issues fixes:
import pandas as pd
import xlsxwriter
fileCreatePath = 'test.xlsx'
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
df['Name'].to_excel(writer,
sheet_name=str(iCount),
startcol=0,
startrow=1,
header=None,
index=False)
writer.save()
Output:
See Working with Python Pandas and XlsxWriter in the XlsxWriter docs for some details about getting Pandas and XlsxWriter working together.

Why is plot returning "ValueError: could not convert string to float:" when a dataframe column of floats is being passed to the plot function?

I am trying to plot a dataframe I have created from an excel spreadsheet using either matplotlib or matplotlib and pandas ie. df.plot. However, python keeps returning a cannot convert string to float error. This is confusing since when I print the column of the dataframe it appears to be all float values.
I've tried printing the values of the dataframe column and using the pandas.plot syntax. I've also tried saving the column to a new variable.
import pandas as pd
from matplotlib import pyplot as plt
import glob
import openpyxl
import math
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.styles import Border, Side, Alignment
import seaborn as sns
import itertools
directory = 'E:\some directory'
#QA_directory = directory + '**/*COPY.xlsx'
wb = openpyxl.load_workbook(directory + '\\Calcs\\' + "excel file.xlsx", data_only = 'True')
plt.figure(figsize=(16,9))
axes = plt.axes()
plt.title('Drag Amplification', fontsize = 16)
plt.xlabel('Time (s)', fontsize = 14)
plt.ylabel('Cf', fontsize = 14)
d = pd.DataFrame()
n=[]
for sheets in wb.sheetnames:
if '2_1' in sheets and '2%' not in sheets and '44%' not in sheets:
name = sheets[:8]
print(name)
ws = wb[sheets]
data = ws.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (itertools.islice(r, 1, None) for r in data)
df = pd.DataFrame(data, index=idx, columns=cols)
df = df.dropna()
#x = df['x/l']
#y = df.Cf
print(df.columns)
print(df.Cf.values)
x=df['x/l'].values
plt.plot(x, df.Cf.values)
"""x = [wb[sheets].cell(row=row,column=1).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value]
print(x)
Cf = [wb[sheets].cell(row=row,column=6).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value]
d[name+ 'x'] = pd.DataFrame(x)
d[name + '_Cf'] = pd.Series(Cf, index=d.index)
print(name)"""
print(df)
plt.show()
I'm expecting a plot of line graphs with the values of x/l on the x access and Cf on the 'y' with a line for each of the relevant sheets in the workbook. Any insights as to why i am getting this error would be appreciated!

Resources