Why does huge strings take a lot of time to be initialized?

Why does huge strings take a lot of time to be initialized? - string

I have a pretty technical question for which I don't imagine the answer and I would like to get an advice for optimization issues.
In my sheet, I've built a lot of XML lines. Specifically, they are 389,256 lines spread across the Range("A1") -> Range("A389256").
My objective is to build a string which contains all these lines, to hence fill it into an XML file. I do this with the following piece of code:
Private Function buildFileText() As String
Dim ss As String
Dim j As Long
For j = 1 To Sheets("FileContent").Range("A1").End(xlDown).Row
ss = ss & Sheets("FileContent").Range("A" & j).Value & vbNewLine
Next j
buildFileText = ss
End Function
Basically, I just build the string starting from an empty string and adding line by line all the content of my spreadsheet.
What is scaring is the time that this piece of code takes to execute: I've put a timer right before and right after the For loop, and it took 1 hour and 44 minutes to execute.
I don't find this behavior normal, because although the high number of lines, if I try to perform the same action on (let's say) 10,000 lines, it takes not even a second. Imagining it would take a second, I would expect the whole action to take 1 second*40=40 seconds approximately.
On the other hand, if it was a purely memory issue, I would have expected a stack overflow which didn't occur. So it seems the time it takes to perform each concatenation grows exponentially.
My questions:
Would anyone explain me why this is happening?
Does anyone have any suggestion to improve the performance of this code? Maybe I should split the concatenation in several strings (say 40 strings of 10k lines each) and concatenate each of them at a later stage?

It does not take a long time to initialize Huge strings.
Strings are not resizable. When you concatenate two strings together a Temp string is created to can hold both values. The values are then assigned to the Temp string and the target string is then replaced with the temp string.
Why does it take less than a Second for the first 10,00 lines and 389,256 takes 1 hour and 44 minutes?
"So it seems the time it takes to perform each concatenation grows exponentially" - It is actually growing at a consistent rate. If it were growing exponentially Excel would crash pretty quickly.
But the problem is that it is growing and every time you concatenate more memory is needed to create the new larger strings.
What can we do to improve performance?
In your case I would use MSXML2 to create the XML output. It is well documented will make your code easily extendable.
The second option is to implement a String Builder Pattern. String Builders reduce the number of concatenations by initializing a very large output string and writing the new strings to the next position in the output string.
My answer to Excel vba xml parsing performance shows how use a String BUilder Pattern to expost an Excel table as XML. Parfiat's answer to the same question demonstrates how to use MSXML2 to create the XML file.

other than the ways you see in comments, you may also want to try an "array" approach
whose maximum array size limitation can be overcome by splitting it in as much subarrays as necessary, as follows:
Private Function buildFileText() As String
Dim ss As String
Dim count As Long
With Worksheets("FileContent")
With .Range("A1", .Cells(.Rows.count, 1).End(xlUp))
Do While .count - count > 24684
ss = ss & Join(Application.Transpose(.Offset(count).Resize(24684).Value), vbNewLine)
count = count + 24684
Loop
buildFileText = ss & Join(Application.Transpose(.Offset(count).Resize(.count - count).Value), vbNewLine)
End With
End With
End Function

Related

split String Variable in few numeric Variables in SPSS

I have a string variable with comma separated numbers that I want to split into four numeric variables.
makeArr
var1a
var1b
var1c
var1d
6,8,13,10
6
8
13
10
10,11,2
10
11
2
7,1,14,3
7
1
14
3
With:
IF (CHAR.INDEX(makeArr,',') >= 1)
f12a=CHAR.SUBSTR(makeArr,1,CHAR.INDEX(makeArr,',')-1).
EXECUTE.
IF (CHAR.INDEX(makeArr,',') >= 1)
f12b=CHAR.SUBSTR(makeArr,CHAR.INDEX(makeArr,',')+1,CHAR.INDEX(makeArr,',')-1).
EXECUTE.
I always get the first variable written without any problems.
This no longer works with the second variable because it has a different length and the comma is also written here.
So I would need a split at the comma and the division of the numbers over the comma.

Since char.substr will only tell you about the location of the first occurence of the search string, you need to start the second search from a new location - AFTER the first occurence, and this gets more and more complicated as you continue. My suggestion is create a copy of your array variable, which you will cut pieces off as you proceed - so that you are only searching for the first occurence of "," every time.
First I recreate your example data to demonstrate on.
data list free/makeArr (a20).
begin data
"6,8,13,10" "10,11,2" "7,1,14,3"
end data.
Now I copy your array into a new variable #tmp. Note that I add a "," at the end so the syntax stays the same for all parts of the array. I add the "#" at the beginning of the name to make it invisible, you can remove it if you want.
It is possible to do the following calculation in steps as you started to do, but nicer to loop throug the steps (especially if this is an example for a longer array).
string f12a f12b f12c f12d #tmp (a20).
compute #tmp=concat(rtrim(makeArr),",").
do repeat nwvr=f12a f12b f12c f12d.
do IF #tmp<>"".
compute nwvr=CHAR.SUBSTR(#tmp,1,CHAR.INDEX(#tmp,',')-1).
compute #tmp=CHAR.SUBSTR(#tmp,CHAR.INDEX(#tmp,',')+1).
end if.
end repeat.
EXECUTE.

Here I found a different solution for what I think is the same problem:
https://www.ibm.com/mysupport/s/question/0D50z00006PsP3tCAF/splitting-a-string-variable-divided-by-commas-into-new-single-variables?language=es
One line of code makes the work:
spssinc trans result=var_1 to var_4 type=20/formula 're.split(", *", makeArr)'.

Lua: Parsing and Manipulating Input with Loops - Looking for Guidance

I am currently attempting to parse data that is sent from an outside source serially. An example is as such:
DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
This data can come in many different lengths, but the first few pieces are all the same. Each "piece" originally comes in with CRLF after, so I've replaced them with string.gsub(input,"\r\n","|") so that is why my input looks the way it does.
The part I would like to parse is:
4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
The "4" tells me that there will be four lines total to create this file. I'm using this as a means to set the amount of passes in the loop.
The 7x5 is the font height.
The 1 is the xpos.
The 25 is the ypos.
The variable data (172-24 in this case) is the text at these parameters.
As you can see, it should continue to loop this pattern throughout the input string received. Now the "4" can actually be any variable > 0; with each number equaling a set of four variables to capture.
Here is what I have so far. Please excuse the loop variable, start variable, and print commands. I'm using Linux to run this function to try to troubleshoot.
function loop_input(input)
var = tonumber(string.match(val, "DATA|0|(%d*).*"))
loop = string.match(val, "DATA|0|")
start = string.match(val, loop.."(%d*)|.*")
for obj = 1, var do
for i = 1, 4 do
if i == 1 then
i = "font" -- want the first group to be set to font
elseif i == 2 then
i = "xpos" -- want the second group to be set to xpos
elseif i == 3 then
i = "ypos" -- want the third group to be set to ypos
else
i = "txt" -- want the fourth group to be set to text
end
obj = font..xpos..ypos..txt
--print (i)
end
objects = objects..obj -- concatenate newly created obj variables with each pass
end
end
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
print(loop_input(val))
Ideally, I want to create a loop that, depending on the var variable, will plug in the captured variables between the pipe deliminators and then I can use them freely as I wish. When trying to troubleshoot with parenthesis around my four variables (like I have above), I receive the full list of four variables four times in a row. Now I'm having difficulty actually cycling through the input string and actually grabbing them out as the loop moves down the data string. I was thinking that using the pipes as a means to delineate variables from one another would help. Am I wrong? If it doesn't matter and I can keep the [/r/n]+ instead of each "|" then I am definitely all for that.
I've searched around and found some threads that I thought would help but I'm not sure if tables or splitting the inputs would be advisable. Like these threads:
Setting a variable in a for loop (with temporary variable) Lua
How do I make a dynamic variable name in Lua?
Most efficient way to parse a file in Lua
I'm fairly new to programming and trying to teach myself. So please excuse my beginner thread. I have both the "Lua Reference Manual" and "Programming in Lua" books in paperback which is how I've tried to mock my function(s) off of. But I'm having a problem making the connection.
I thank you all for any input or guidance you can offer!
Cheers.

Try this:
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
val = val .. "|"
data = val:match("DATA|0|%d+|(.*)$")
for fh,xpos,ypos,text in data:gmatch("(.-)|(.-)|(.-)|(.-)|") do
print(fh,xpos,ypos,text)
end

Extracting data from series of excel files (MATLAB)

I'll begin by saying I am really not good in programming especially in extracting data so please bear with me. I think my problem is simple, I just can't figure out how to do it.
My problem is I want to extract part of the data in a series of excel files stored in the same folder. To be specific, let's say I have 10 excel files with 1000 data in each (from A1:A1000). I want to extract the first 100 data (A1:A100) in each excel files and store it in a single variable with a 10x100 size (each row represents each file).
I would really appreciate if any of you can help me. This would make my data processing a lot faster.
EDIT: I have figured out the code but my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
here's the code i've written:
for k=1:1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,'A1:A100'))';
z(k,:)=data(1,:);
end
I'm not sure how i will edit this part data=(xlsread(file,'A1:A100'))' to do the loop i wanted to do.

my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
Why? Why not extract A1:A1000 in one block and then reshape or otherwise split up the data?
data(k,:)=(xlsread(file,'A1:A1000'))';
Then the A1:A100 data is in data(k,1:100), and so on. If you do this:
data = data(reshape, [10 100 10]);
Then data(:,:,1) should be your A1:A100 values as in your original loop, and so on until data(:,:,10).

This should do it:
for sec = 1:1:10
for k=1:1:10
file=['',int2str(k),'.xlsx'];
section = ['A', num2str(1+(100*(sec-1)), ':A', mum2str(100*sec)]
data=(xlsread(file, section))';
z(k,:)=data(1,:);
end
output(sec) = z;
end

Here's a suggestion to loop through the different cells to read. Obviously, you can change how you arrange the collected data in z. I have done it as the first index representing the different cells to read (1 for 1:100, 2 for 101:200, etc...), the second index being the file number (as per your original code) and the third index the data (100 data points).
% pre-allocate data
z = zeros(10,10,100);
for kk=1:10
cells_to_read = ['A' num2str(kk*100-99) ':A' num2str(kk*100)];
for k=1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,cells_to_read))';
z(kk,k,:)=data(1,:);
end
end

Name variable based on string MATLAB

I have a variable that is created by a loop. The variable is large enough and in a complicated enough form that I want to save the variable each time it comes out of the loop with a different name.
PM25 is my variable. But I want to save it as PM25_year in which the year changes based on `str = fname(13:end)'
PM25 = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]); % Reshape and permute to achieve the right shape. Each face of the 3D should be one day
str = fname(13:end); % The year
% Third dimension is organized so that the data for each site is on a face
save('PM25_str', 'PM25_Daily_US.mat', '-append')
The str would be a year, like 2008. So the variable saved would be PM25_2008, then PM25_2009, etc. as it is created.

Defining new variables based on data isn't considered best practice, but you can store your data more efficiently using a cell array. You can store even a large, complicated variable like your PM25 variable within a single cell. Here's how you could go about doing it:
Place your PM25 data for each year into the cell array C using your loop:
for i = 1:numberOfYears
C{i} = PM25;
end
Resulting in something like this:
C = { PM25_2005, PM25_2006, PM25_2007 };
Now let's say you want to obtain your variable for the year 2006. This is easy (assuming you aren't skipping years). The first year of your data will correspond to position 1, the second year to position 2, etc. So to find the index of the year you want:
minYear = 2005;
yearDesired = 2006;
index = yearDesired - minYear + 1;
PM25_2006 = C{index};

You can do this using eval, but note that it's often not considered good practice. eval may be a security risk, as it allows user input to be executed as code. A better way to do this may be to use a cell array or an array of objects.
That said, I think this will do what you want:
for year = 2008:2014
eval(sprintf('PM25_%d = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]);',year));
save('PM25_Daily_US.mat',sprintf('PM25_%d',year),'-append');
end

I do not recommend to set variables like this since there is no way to track these variables and completely prevents all kind of error checking that MATLAB does beforehand. This kind of code is handled completely in runtime.
Anyway in case you have a really good reason for doing this I recommend that you use the function assignin for this.
assignin('caller', ['myvar',num2str(1)], 63);

fminsearch is overwriting in loop while storing data

I have the code mentioned below in matlab. I want to write all the 162 rows and 4 columns calculated into an excel file.
When i use xlswrite in the code i get only one row and 4 columns as the value of P gets overwritten in each iterative step.
If i use another loop inside the for loop the execution time increase drastically. Please help to least write the values of P into an array which i can later write into excel file(when i tried 'In an assignment A(I) = B, the number of elements in B and I must be the same' error appeared.)
please help
function FitSMC_BC
clc
% Parameters: P(1)=theta_S; P(2)=theta_r; P(3)=psib; P(4)=lamda;
smcdata=xlsread('asimdata');
nn=length(smcdata)-1;
for i=1:nn
psi=smcdata(:,1);
thetaObs=smcdata(:,i+1);
%Make an initial guess:
Pini=[0.5 0.1 -1 1.5];
P=fminsearch(#ObFun,Pini,[],psi,thetaObs);
disp(['result',num2str(i),': P=',num2str(P)]);
theta=Gettheta(P,psi);
end
function OF=ObFun(P,psi,thetaObs)
theta=Gettheta(P,psi);
OF=sqrt(mean((theta - thetaObs).^2));
function theta=Gettheta(P,psi)
SoilPars.theta_S=P(1);
SoilPars.theta_r=P(2);
SoilPars.psib=P(3);
SoilPars.lamda=P(4);
[theta]=thetaFun(psi,SoilPars);
function [theta]=thetaFun(psi,SoilPars)
theta_S=SoilPars.theta_S;
theta_r=SoilPars.theta_r;
psib=SoilPars.psib;
lamda=SoilPars.lamda;
theta=theta_r+((theta_S-theta_r)*((psib./psi).^lamda));
theta(psi>psib)=theta_S;

You can modify the P line with
P(i,:) = fminsearch(#ObFun,Pini,[],psi,thetaObs);
P will store each calculation (4 element vector) in a new line.
You may also initialise P before the for loop with P = nan(nn, 4);
Then write P in an Excel file using xlswrite.

I haven't studied your code in-depth, but as far as I can tell, you have two options:
Create a matrix P and use xlswrite on the entire matrix. This seems to me like the most reasonable approach.
Use xlswrite1 from the fileexchange in a loop. This will increase execution time a bit, but not nearly as much as using regular xlswrite as it is specially deigned to be used inside loops. The reason why it is so much faster is because it only opens and closes the Excel-file once, whereas the regular xlswrite opens and closes it every time you call the function.

You seem to know how to use indexing so I'm not sure why you're simply doing something like this:
P = zeros(size(smcdata,1),nn)
for i=1:nn
...
P(:,i) = fminsearch(#ObFun,Pini,[],psi,thetaObs);
disp(['result',num2str(i),': P=',num2str(P(:,i))]);
theta = Gettheta(P(:,i),psi); % Why is this here? Are you writing it to file too?
end
xlswrite('My_FileName.xls',P);
Or you could call xlswrite on each iteration of the loop (probably slower) and append the new data using something like this:
for i=1:nn
...
P = fminsearch(#ObFun,Pini,[],psi,thetaObs);
disp(['result',num2str(i),': P=',num2str(P)]);
theta = Gettheta(P,psi); % Why is this here? Are you writing it to file too?
xlswrite('My_FileName.xls',P,1,['A' int2str((i-1)*size(P,2)+1)]);
end
Of course your code isn't runnable so you'll have to debug any other little errors. Also, since smcdata seems to be a matrix rather than a vector, you should be careful using length with it. You probably should use size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why does huge strings take a lot of time to be initialized? - string

Related

split String Variable in few numeric Variables in SPSS

Lua: Parsing and Manipulating Input with Loops - Looking for Guidance

Extracting data from series of excel files (MATLAB)

Name variable based on string MATLAB

fminsearch is overwriting in loop while storing data

Categories

Resources