How to improve the speed of STRREAD()? - string

I have a cell array named 'datetime' as the format below:
2009.01.01 00:00:02.169
this 'datetime' array is 1819833x1 size which is large!!!
I want split it into 2 cell array: 'date' and 'time'.
date='2009.01.01' and time='00:00:02.169'.
So I use the for loop as below:
for i=1:numel(datetime)
[date(i), time(i)] = strread(datetime{i},'%s%s','delimiter',' ');
end
As you can see, it use a loop and the speed is really slow when process such a big data.
I try the code this afternoon, and almost ONE HOUR past, the job is still not done....
So can anyone give me a advice?
Thanks!

So first I would preallocate the date and time, no mather which solution you pick. Next I did some experiments with the following setup
s = '2009.01.01 00:00:02.169';
S = repmat({s}, 100000, 1);
The results are
Using strread
tic, for i=1:numel(S), [~, ~] = strread(S{i},'%s%s','delimiter',' '); end, toc
Elapsed time is 3.694143 seconds.
Using regexp
tic, for i=1:numel(S), [~] = regexp(S{i},'\s+', 'split'); end, toc
Elapsed time is 1.324754 seconds.
Using cellfun
tic, cellfun(#(x) regexp(x, '\s+', 'split'), S, 'UniformOutput', false); toc
Elapsed time is 2.072437 seconds.
As you can see, most of those approaches are very slow. Fortunately, many functions in MATLAB can use cells directly, watch this:
tic, Sresult = regexp(S, '\s+', 'split'); toc
Elapsed time is 0.253819 seconds.
You can now access the result by Sresult{i}{1} or Sresult{i}{2} or simply
date = cellfun(#(x) x{1}, Sresult, 'UniformOutput', false);
time = cellfun(#(x) x{2}, Sresult, 'UniformOutput', false);
Elapsed time is 0.835277 seconds.
Ultra Fast Method
The fastest method I can think of is requiring, that the format is always the same, i.e. the length of each string is equal. In your case, I can imagine it to be true. Then you can use something like this
tic, Sa = cell2mat(S); Sdate = Sa(:,1:10); Stime = Sa(:, 12:end); toc
Elapsed time is 0.060586 seconds.
Here you get another speed factor of about 20!

Here's one approach. Not sure how fast it will be:
datetime = {'2009.01.01 00:00:02.169'
'2009.01.02 00:01:05.169'}; %// example data. Cell array of strings
datetime_split = regexp(datetime, '\s+', 'split'); %// split according to spaces
%// Alternatively: datetime_split = cellfun(#strsplit, datetime, 'uniformoutput', 0);
datetime_split = [datetime_split{:}];
date = datetime_split(1:2:end);
time = datetime_split(2:2:end);
With the above data, this produces
>> date
date =
'2009.01.01' '2009.01.02'
>> time
time =
'00:00:02.169' '00:01:05.169'

So, thanks Robert...your advice really helpful!!!
First, I did preallocation, and the time for loop + strread() combination is reduced to less than 40s with my 'datetime' array which is 1819833x1 size.
So it is the main improvement, we can see that the reduce of memory re-allocation and memory data copying can speed up the process a lot....especially when you perform on a large size of sample data.

Related

Python: Average of datetime deltas in a for loop (excel file input)

First time posting, been trying to figure this one out for a bit and feel like I'm either approaching it wrong or over complicating it.
Goal: Ingest excel sheet with 2 columns of dates, find the difference of time between the dates per row, then find the average of all the differences.
I'm using openpyxl to do this, as it's an xlsx. The date values in the cell come out in the '%Y-%m-%d %H:%M:%S' format.
Here's what I have at the moment:
Excel_File = 'C:\Some\File\Location'
wb = load_workbook(Excel_File)
Data_Tab = wb['SheetA']
Dates_A = Data_Tab['A']
Dates_B = Data_Tab{'B']
for A, B in zip(Dates_A[1:], Dates_B[1:]):
A_str = str(A.value) #converting to string to convert to datetime since I couldn't find another way to do this
B_str = str(B.value)
A_conv = datetime.datetime.strptime(A_str,'%Y-%m-%d %H:%M:%S')
B_conv = datetime.datetime.strptime(B_str,'%Y-%m-%d %H:%M:%S')
A_B_Delta = B_conv - A_conv
Where I've gotten stuck is how to add all the A_B_Deltas together and get an average.
I would need to figure out how to get the total to input into average, which I guess I could just increment a variable to get this number. Such as:
Total_Count = 0
Total_Count += 1
But how do I add the deltas to get an average?
I have tried adding them to a variable with no success at this point. I also tried setting a empty datetime object variable but that doesn't appear to be possible as it'll just error out.
A_B_Delta = 0 # Initialize variable
denominator = 0 # Initialize variable
for A, B in zip(Dates_A[1:], Dates_B[1:]):
A_str = str(A.value)
B_str = str(B.value)
A_conv = datetime.datetime.strptime(A_str,'%Y-%m-%d %H:%M:%S')
B_conv = datetime.datetime.strptime(B_str,'%Y-%m-%d %H:%M:%S')
A_B_Delta += B_conv - A_conv # Use += to increment variable
denominator += 1 # Use += to increment variable
average = A_B_Delta / denominator # Compute average after the for loop
Initializing A_B_Delta and incrementing it sum up all the time difference. There are many ways you can get the denominator to compute the average.

MATLAB concatenate string variables

I know strjoin can be used to concatenate strings, like 'a' and 'b' but what if one of the strings is a variable, like
a=strcat('file',string(i),'.mat')
and I want:
strjoin({'rm',a})
MATLAB throws an error when I attempt this, and it's driving me crazy!
Error using strjoin (line 53) First input must be a string array or cell array of character vectors
What version of MATLAB are you using? What is the error? The first input to strjoin needs to be a cell array. Try strjoin({'rm'},a).
Also, before 17a, do:
a = strcat('file', num2str(i),'.mat')
In >=17a do:
a = "file" + i + ".mat";
Here is a performance comparison:
function profFunc
tic;
for i = 1:1E5
a = strcat('file', num2str(i),'.mat');
end
toc;
tic;
for i = 1:1E5
a = "file" + i + ".mat";
end
toc;
end
>> profFunc
Elapsed time is 6.623145 seconds.
Elapsed time is 0.179527 seconds.

Difference between dates in Python and print as 00:00:00

So I have a string for a date that I convert to datetime and I want to print the time difference between this date (in utc) and the current time in utc. ie. if it's 1 day and 5 hours ahead, print "01:05:00". Or if it's 6 minutes ahead, print "00:00:06". If the date is in the past, then prepend "-", like "-00:00:06".
So far I have a pretty bad solution that prints something like "0:0:27" if it's 27 minutes away and "-2:-5:-46" if it's 2 days in the past. I would like to have a consistent formatting of xx:xx:xx every time. I've looked at many questions and not even sure if i need to use relativedelta or just datetime.timedelta. Any suggestions?
for ticket in json.loads(data):
ticket_rdate = ticket["time_string"]
if ticket_rdate:
ticket_rdate = datetime.datetime.strptime(ticket_rdate, "%Y-%m-%d %H:%M:%S")
difference = relativedelta(datetime.datetime.utcnow(), ticket_rdate)
ticket["time_until"] = str(difference.days * -1) + ":" + str(difference.hours * -1) + ":" + str(difference.minutes * -1) + ""
sorted_tickets.append(ticket)
return sorted_tickets
This is in python 3.
If you just do
difference = (datetime.datetime.utcnow() - ticket_rdate)
str(difference)
you'll get the output in a consistent format. You can do some string formatting after that to get that in the format that you mentioned.
Considering you're limiting the precision to days, I'd say you don't really need to use relativedelta. If you wanted weeks or months, sure, but you don't.
So keeping that in mind in Py3 you can just do something like:
delta = time1 - time2
delta = delta // timedelta(minutes=1)
print(str(delta))
And let the provided functions do the heavy lifting.
Use string's format method which will let you specify zero padding. Like:
now = datetime.datetime.utcnow()
diff = utcnow - ticket_rdate
negative = '-' if ticket_rdate < utcnow else ''
hours = int(diff.total_seconds() / 3600)
mins = int(diff.total_seconds() / 60)
ticket['time_until'] = '{}{}:{:02}:{:02}'.format(negative, diff.days, hours, mins)
Also, the reason to use relativedelta over timedelta is if you want more flexibility. relativedelta will let you work with other units like months and years, and provides better support for negative deltas.
https://docs.python.org/2/library/datetime.html#timedelta-objects
http://labix.org/python-dateutil#head-ba5ffd4df8111d1b83fc194b97ebecf837add454

contains with separate result for each of multiple patterns

Matlab's documentation for the function TF = contains(str,pattern) states:
If pattern is an array containing multiple patterns, then contains returns 1 if it finds any element of pattern in str.
I want a result for each pattern individually however.
That is:
I have string A='a very long string' and two patterns B='very' and C='long'. I want to check if B is contained in A and if C is contained in A. I could do it like this:
result = false(2,1);
result(1) = contains(A,B);
result(2) = contains(A,C);
but for many patterns this takes quite a while. What is the fast way to do this?
I don't know or have access to that function; it must be "new", so I don't know its particular idiosyncrasies.
How I would do that is:
result = ~cellfun('isempty', regexp(A, {B C}));
EIDT
Judging from the documentation, you can do the exact same thing with contains:
result = contains(A, {B C});
except that seems to return contains(A,B) || contains(A,C) rather than the array [contains(A,B) contains(A,C)]. So I don't know, I can't test it here. But if all else fails, you can use the regex solution above.
The new text processing functions in 16b are the fastest with string. If you convert A to a string you may see much better performance.
function profFunc
n = 1E6;
A = 'a very long string';
B = 'very';
C = 'long';
tic;
for i = 1:n
result(1) = contains(A,B);
result(2) = contains(A,C);
end
toc;
tic;
for i = 1:n
x = regexp(A, {B,C});
end
toc;
A = string(A);
tic;
for i = 1:n
result(1) = contains(A,B);
result(2) = contains(A,C);
end
toc;
end
>> profFunc
Elapsed time is 7.035145 seconds.
Elapsed time is 9.494433 seconds.
Elapsed time is 0.930393 seconds.
Questions: Where do B and C come from? Do you have a lot of hard coded variables? Can you loop? Looping would probably be the fastest. Otherwise something like
cellfun(#(x)contains(A,x),{B C})
is an option.

fast way to convert datetime to string

I want to know if there is faster way to convert a datetime to a string besides datestr.
datetime is inserted every other lines in my main function (including all of its dependency). I need time at that line of code is executed.
I think my only option is to convert datetime to string faster.
t = datetime('now');
DateString = datestr(t);
I profiled and it seems it is called 12570846 times. It takes 16030.021s in total.
My goal of doing this is get the current time when the line is executed and to match with other information that I get from other program with timestamps. I match two files (one from this MATLAB code and one from my other program) with time stamps.
One way you could do this would be to compare the current time to the time the previous time through the loop. You should only recompute the datestring value if it's different. But we can actually go a step further, because the output of datestr (as you're calling it) only shows seconds. So we can actually ignore microsecond differences.
Example Using now (~128 Seconds)
Below I have an example loop that caches the date string representation. It compares the serial date (in seconds) to the date for which the last date string was generated. Only if it's different is the date string updated.
% Number of seconds in a day to convert the serial date number
secperday = 60 * 60 * 24;
% Store the current time as a reference
lasttime = now;
datestring = datestr(lasttime);
for k = 1:12570846
N = now;
seconds = round(N * secperday);
if ~isequal(lasttime, seconds)
% Update the cache
lasttime = seconds;
datestring = datestr(N);
end
% Use datestring however you want
disp(datestring)
end
Example using clock (~24 seconds)
Another option is to use clock which will give you the different date components in a vector. You can round the last element which represents seconds and milliseconds. By rounding it you suppress the milliseconds. This method seems to be a faster approach.
N = clock;
% Remove milliseconds
N(end) = round(N(end));
lasttime = N;
datestring = datestr(N);
for k = 1:12570846
N = clock;
% Ignore milliseconds
N(end) = round(N(end));
if ~isequal(N, lasttime)
lasttime = N;
datestring = datestr(N);
end
disp(datestring)
end
Funtion-Based Solution
If you want to get the current time as a date string at several points within your code, it is likely much better to create a function which will wrap this functionality. Here is an example of such a function.
function str = getDateString()
% Use persistent variables to cache the current value
persistent lasttime datestring
% Get the current time
thistime = clock;
% Truncate the milliseconds
thistime(end) = floor(thistime(end));
% See if the time has changed since the last time we called this
if ~isequal(thistime, lasttime)
lasttime = thistime;
% Update the cached string reprsentation
datestring = datestr(thistime);
end
str = datestring;
end
You can then call this from anywhere within your code to get the date string and it will only be computed when necessary.
If your loop time is pretty short you might convert the date time every 10th loop or something like that if it will be close enough.

Resources