rangeBetween(unboundedPreceding to previous row(included)) - apache-spark

I would like to get the calculations in window frame in specific range ->
rangeBetween(unboundedPreceding to previous row(included))
I do not want to include the currentRow in my calcualtions. The approach -1 seems be fine but it is a simplification of reality.
val profileWindow = Window.partitionBy("id")
.orderBy(col("source_event_timestamp").cast(LongType))
.rangeBetween(Window.unboundedPreceding, -1)
Is there any other possibility how to change the code presented above.

Related

Can you access a VBA list with an in-cell Excel formula?

I wrote a VBA script/macro which runs when a change is detected in a specific range (n x m) of cells. Then, it changes the values in another range (n x 1) based on what is detected in the first range.
This bit works perfectly ... but then comes the age old erased undo stack problem. Unfortunately, the ability for the user to undo their last ~10 or so actions is required.
My understanding is that the undo stack is only cleared when VBA directly edits something on the sheet - but it is preserved if the VBA is just running in the back without editing the sheet.
So my question is: Is it possible to use an in cell formula (something like below) to pull values from a VBA array?
'sample of in-cell function in cell A3
=function_to_get_value_from_vba_array(vba_array, index_of_desired_value)
Essentially, VBA would store a 1D array of strings with the values needed for the range. And by using a formula to grab the value from the array: I might be able to get around the issue of the undo stack being erased.
Thanks!
Solution
You need to do something like the following: your argument for the function should be calling the array bulding; I created one dummy function that creates some sample arrays to demonstrate it. In your case, likely you will need to store the changes on the worksheet event in a global array variable instead, and as you stated, do nothing on the worksheet (whenever a change happens, just redim or appended it on your global array as needed). However, a new problem may arise and that is when you close/reopen, or by some reason the array is lost, so you need to keep track of it, I would suggest to catch before close event and then convert the formulas to static values.
Function vba_array(TxtCase As String)
Dim ArrDummy(1) As Variant
Select Case TxtCase
Case "Txt": ArrDummy(0) = "Hi": ArrDummy(1) = "Hey"
Case "Long": ArrDummy(0) = 0: ArrDummy(1) = 1
Case "Boolean": ArrDummy(0) = True: ArrDummy(1) = False
End Select
vba_array = ArrDummy
End Function
In your calling function, do the following
Function get_value_from_vba_array(vba_array() As Variant, index_of_desired_value As Long) As Variant
'when parsing, even with option base 0 it starts at 1, so we need to add 1 up
get_value_from_vba_array = vba_array(index_of_desired_value + 1)
End Function
In your book, your formula should be something like:
=get_value_from_vba_array(vba_array("Txt"),1)
Demo
I did some actions before, so you are able to see that the "undo" works

extremely slow add a table to python-docx from a csv file

I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.
I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose
My question is if this could be normal behavior or if exist any other way to improve this task.
I'm using this for loop to read several cvs files and parsing it in table format
for toTAB in listBRUTO:
df= pd.read_csv(toTAB)
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
t.style = 'LightShading-Accent1' # border
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
#TABLE Format
for row in t.rows:
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.name = 'Calibri'
font.size= Pt(7)
doc.add_page_break()
doc.save('blabla.docx')
Thanks in advance
You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.
I would start with refactoring this block and see how much improvement that yields:
# --- add the rest of the data frame ---
for i in range(df.shape[0]):
for j, cell in enumerate(table.rows[i + 1].cells):
cell.text = str(df.values[i, j])
python-docx walk the whole table every single time you access its "cells" property.
so you better call ".cell" as less as possible and use a cache for cells instead.
these are two examples access a table with size 3*1500:
code 1: about 150.0s
for row in table.rows:
print('processing: {0:30s}'.format(row.cells[0].text),end='\r')
code 2: about 1.4s
clls=table._cells
for row_idx in range(len(clls)//table._column_count):
print('processing: {0:30s}'.format(
clls[0 + row_idx*table._column_count].text),end='\r')
clls=table._cells in code 2 use "_cells" to process the cell-merging, so ccls[column_idx + row_idx*table._column_count].text works just as fine as table.rows[row_idx].cells[column_idx].text, and dont require table to be exactly rectangular
For rectangular table without merged cells you can export all cells into list-of-lists structure and fill them very quickly (less then 0.5s vs 15s for ~300 lines tables with 3 columns):
from docx.table import _Cell
def get_cells_grid(table):
cells = [[]]
col_count = table._column_count
for tc in table._tbl.iter_tcs():
cells[-1].append(_Cell(tc, table))
if len(cells[-1]) == col_count:
cells.append([])
return cells
cells = get_cells_grid(t)
for i in range(df.shape[0]):
for j in range(df.shape[i]):
cells[i][j].text = str(df.values[i, j])
Function based on table._cells() code: https://github.com/python-openxml/python-docx/blob/da75fcf01f7f322e846e2ac3e1936aedd766acc8/docx/table.py#L162
Just to add my experience, if you have to create a huge table, create the whole structure first, meaning all the rows and cells you will need; and then store the cells like so
table_cells = table._cells (according to #kztopia)
And from there you can manipulate cells as you wish, merging, adding text etc... with a rather optimized fastness since you make only one call to cell()
In my use case, for a table being, in my opinion, not so big (~130rows, 8cells per row), it used to take 9sec to create the whole thing and now i'm at .5 or so.
Keep in mind that, the bigger the table, the more time it'll take to execute cell().

Can For loops be dynamic in that the limits change as the code is run?

I'm creating a forecasting model for a fleet of equipment using Excel wholly written with VBA.
While forecasting the utilisation of equipment, some equipment will reach its replacement threshold and a new piece of equipment takes over from there. This will require a new row added to the table for the new equipment.
I would have thought that a For loop would be dynamic, so using a variable for the upper limit would be re-evaluated on every loop, but this seems not to be the case.
I set up a simple scenario to test as per the code below, starting with 2 listrows in the table.
Sub Test1()
Set Table1 = Sheet1.ListObjects("Table1")
x = Table1.ListRows.Count
For i = 1 To x
Set NewRow = Table1.ListRows.Add
x = Table1.ListRows.Count
NewRow.Range(1, 1) = x
Next i
End Sub
I assumed it would run infinitely but it will only run as per the initial case provided.
Is using a different type of loop (Do While or Do Until) the ONLY way to achieve a genuinely dynamic outcome?
To sum things in the comments to your question up:
Modifying the target of a for-loop might be prohibited in Viual Basic. There are other languages out there that in principle allow for this kind of operation, however, it's not a good programming style.
The reason is, that a for-loop is a loop counting over a fixed interval (that should not change during the loop's execution).
Instead of using a for-loop here, one may consider using a while-loop:
i = 1
While i <= x
Set NewRow = Table1.ListRows.Add
x = Table1.ListRows.Count
NewRow.Range(1, 1) = x
i = i+1
Wend
Caution:
This loop will run forever (or rather: until you reach a maximum of resources, in which case it will crash). The reason is, that you move the upper bound for the iteration 1 unit further away while approaching it by 1 unit.
The best way to approach what you actually want to achieve is using a buffer list:
Identify the items you want to create a new row for
Insert that row into a second list
Iterate over the second list and append the items to the original list
This way, you avoid testing the newly inserted items (which most likely won't be outdated by now).

Excel Javascript (Office.js) - LastRow/LastColumn - better alternative?

I have been a fervent reader of StackOverflow over the last few years, and I was able to resolve pretty much everything in VBA Excel with a search and some adapting. I never felt the need to post any questions before, so I do apologize if this somehow duplicates something else, or there is an answer to this already and I couldn't find it.
Now I`m considering Excel-JS in order to create an AddIn (or more), but have to say that Javascript is not exactly my bread and butter. Over the time of using VBA, I find that one of the most simple and common needs is to get the last row in a sheet or given range, and maybe less often the last column.
I've managed to put some code together in Javascript to get similar functionality, and as it is... it works. There are 2 reasons I`m posting this
Looking to improve the code, and my knowledge
Maybe someone else can make use of the code meanwhile
So... in order to get my lastrow/lastcolumn, I use global variables:
var globalLastRow = 0; //Get the last row in used range
var globalLastCol = 0; //Get the last column in used range
Populate the global variables with the function to return lastrow/lastcolumn:
function lastRC(wsName) {
return Excel.run(function (context) {
var wsTarget = context.workbook.worksheets.getItem(wsName);
//Get last row/column from used range
var uRange = wsTarget.getUsedRange();
uRange.load(['rowCount', 'columnCount']);
return context.sync()
.then(function () {
globalLastRow = uRange.rowCount;
globalLastCol = uRange.columnCount;
});
});
}
And lastly get the value where I need them in other functions:
var lRow = 0; var lCol = 0;
await lastRC("randomSheetName");
lRow = globalLastRow; lCol = globalLastCol;
I`m mainly interested if I can return the values directly from the function lastRC (and how...), rather than go around with this solution.
Any suggestions are greatly appreciated (ideally if they don't come with stones attached).
EDIT:
I've gave up on using an extra function for this as for now, given that it uses extra context.sync, and as I've read since this post, the less syncs, the better.
Also, the method above is only good, as long your usedrange starts in cell "A1" (or well, in the first row/column at least), otherwise a row/column count is not exactly helpful, when you need the index.
Luckily, there is another method to get the last row/column:
var uRowsIndex = ws.getCell(0, 0).getEntireColumn().getUsedRange().getLastCell().load(['rowIndex']);
var uColsIndex = ws.getCell(0, 0).getEntireRow().getUsedRange().getLastCell().load(['columnIndex']);
To break down one of this examples, you are:
starting at cell "A1" getCell(0, 0)
select the entire column "A:A" getEntireColumn()
select the usedrange in that column getUsedRange() (i.e.: "A1:A12")
select the last cell in the used range getLastCell() (i.e.: "A12")
load the row index load(['rowIndex']) (for "A12" rowIndex = 11)
If your data is constant, and you don't need to check lastrow at specific column (or last column at specific row), then the shorter version of the above is:
uIndex = ws.getUsedRange().getLastCell().load(['rowIndex', 'columnIndex']);
Lastly, keep in mind that usedrange will consider formatting as well, not just values, so if you have formatted rows under your data, expect the unexpected.
late edit - you can specify if you want your used range to be of values only (thanks Ethan):
getUsedRange(valuesOnly?: boolean): Excel.Range;
I have to say a big thank you to Michael Zlatkovsky who has put a lot of work, in a lot of documentation, which I`m far from finishing to read.

VBA subroutine slows down a lot after first execution

I have a subroutine that generates a report of performance of different portfolios within 5 families. The thing is that the portfolios in question are never the same and the amount in each family neither. So, I copy paste a template (that is formated and...) and add the formated row (containing the formula and...) in the right family for each portfolio in the report. Everything works just fine, the code is not optimal and perfect of course, but it works fine for what we need. The problem is not the code itself, it is that when I execute the code the first time, it goes really fast (like 1 second)... but from the second time, the code slows down dramatically (almost 30 second for a basic task identical to the first one). I tried all the manual calculation, not refreshing the screen and ... but it is really not where the problem comes from. It looks like a memory leak to me, but I cannot find where is the problem! Why would the code runs very fast but sooooo much slower right after... Whatever the length of the report and the content of the file, I would need to close excel and reopen it for each report.
**Not sure if I am clear, but it is not because the code makes the excel file larger or something, because after the first (fast) execution, if I save the workbook, close and reopen it, the (new) first execution will again be very fast, but if I would have done the same excat thing without closing and reopening it would have been very slow...^!^!
Dim Family As String
Dim FamilyN As String
Dim FamilyP As String
Dim NumberOfFamily As Integer
Dim i As Integer
Dim zone As Integer
Sheets("RapportTemplate").Cells.Copy Destination:=Sheets("Rapport").Cells
Sheets("Rapport").Activate
i = 3
NumberOfFamily = 0
FamilyP = Sheets("RawDataMV").Cells(i, 4)
While (Sheets("RawDataMV").Cells(i, 3) <> "") And (i < 100)
Family = Sheets("RawDataMV").Cells(i, 4)
FamilyN = Sheets("RawDataMV").Cells(i + 1, 4)
If (Sheets("RawDataMV").Cells(i, 3) <> "TOTAL") And _
(Sheets("RawDataMV").Cells(i, 2) <> "Total") Then
If (Family <> FamilyP) Then
NumberOfFamily = NumberOfFamily + 1
End If
With Sheets("Rapport")
.Rows(i + 8 + (NumberOfFamily * 3)).EntireRow.Insert
.Rows(1).Copy Destination:=Sheets("Rapport").Rows(i + 8 + (NumberOfFamily * 3))
.Cells(i + 8 + (NumberOfFamily * 3), 6).Value = Sheets("RawDataMV").Cells(i, 2).Value
.Cells(i + 8 + (NumberOfFamily * 3), 7).Value = Sheets("RawDataMV").Cells(i, 3).Value
End With
End If
i = i + 1
FamilyP = Family
Wend
For i = 2 To 10
If Sheets("Controle").Cells(16, i).Value = "" Then
Sheets("Rapport").Cells(1, i + 11).EntireColumn.Hidden = True
Else
Sheets("Rapport").Cells(1, i + 11).EntireColumn.Hidden = False
End If
Next i
Sheets("Rapport").Cells(1, 1).EntireRow.Hidden = True
'Define printing area
zone = Sheets("Rapport").Cells(4, 3).End(xlDown).Row
Sheets("Rapport").PageSetup.PrintArea = "$D$4:$Y$" & zone
Sheets("Rapport").Calculate
Sheets("RANK").Calculate
Sheets("SommaireGroupeMV").Calculate
Sheets("SommaireGroupeAlpha").Calculate
Application.CutCopyMode = False
End Sub
I do not have laptop with me at the moment but you may try several things:
use option explicit to make sure you declare all variables before using them;
from what I remember native vba type for numbers is not integer but long, and integers are converted to long, to save the computation time use long instead of integers;
your Family variables are defined as strings but you store in them whole cells and not their values i.e. =cells() instead of =cells().value;
a rule of a thumb is to use cells(rows.count, 4).end(xlup).row
instead of cells(3, 4).end(xldown).row.;
conditional formatting may slow down things a lot;
use for each loop on a range if possible instead of while, or even copy range to variant array and iterate over that (that is the fastest solution);
use early binding rahter of late binding, i.e., define objects in a proper type as soon a possible;
do not show printing area (page breaks etc.);
try to do some pofiling and look for the bottlenecks - see finding excel vba bottlenecks;
paste only values if you do not need formats;
clear clipboard after each copy/paste;
set objects to Nothing after finishing using them;
use Value2 instead of Value - that will ignore formatting and take only numeric value instead of formatted value;
use sheet objects and refer to them, for example
Dim sh_raw As Sheet, sh_rap As Sheet
set sh_raw = Sheets("RawDataMV")
set sh_rap = Sheets("Rapport")
and then use sh_raw instead of Sheets("RawDataMV") everywhere;
I had the same problem, but I finally figured it out. This is going to sound ridiculous, but it has everything to do with print page setup. Apparently Excel recalculates it every time you update a cell and this is what's causing the slowdown.
Try using
Sheets("Rapport").DisplayPageBreaks = False
at the beginning of your routine, before any calculations and
Sheets("Rapport").DisplayPageBreaks = True
at the end of it.
I had the same problem. I am far from expert programer. The above answers helped my program but did not solve the problem. I'm running excel 2013 on a 5 year old lap top. Open the program without running it, go to File>OptionsAdvanced, Scroll down to Data and uncheck "Disable undo for large Pivot table refresh...." and "Disable undo for large data Model operation". You could also try leaving them checked but decreasing their value. One or both of these seem to be creating a ever increase file that slows the macro and eventual grinds it to a stop. I assume closing excel clears the files they create so that's why it runs fast when excel is closed and reopened at least for a while. Someone with more knowledge will have to explain what these changes will do and what the consequences are of unchecking them. It appears these changes will be applied to any new spread sheets you create. Maybe these changes would not be necessary if I had a newer more powerful computer.

Resources