MATLAB vs. GNU Octave Textscan disparity - string

I wish to read some data from a .dat file without saving the file first. In order to do so, my code looks as follows:
urlsearch= 'http://minorplanetcenter.net/db_search/show_object?utf8=&object_id=2005+PM';
url= 'http://minorplanetcenter.net/tmp/2005_PM.dat';
urlmidstep=urlread(urlsearch);
urldata=urlread(url);
received= textscan(urldata , '%5s %7s %1s %1s %1s %17s %12s %12s %9s %6s %6s %3s ' ,'delimiter', '', 'whitespace', '');
data_received = received{:}
urlmidstep's function is just to do a "search", in order to be able to create the temporary .dat file. This data is then stored in urldata, which is a long char array. When I then use textscan in MATLAB, I get 12 columns as desired, which are stored in a cell array data_received.
However, in Octave I get various warning messages: warning: strread: field width '%5s' (fmt spec # 1) extends beyond actual word limit (for various field widths). My question is, why is my result different in Octave and how could I fix this? Shouldn't Octave behave the same as MATLAB, as in theory any differences should be dealt with as bugs?
Surely specifying the width of the strings and leaving both the delimiter and whitespace input arguments empty should tell the function to only deal with width of string, allowing spaces to be a valid characters.
Any help would be much appreciated.

I thinhk textscan works differently in MATLAB and Octave. To illustrate let's simplify the example. The code:
test_line = 'K05P00M C2003 01 28.38344309 37 57.87 +11 05 14.9 n~1HzV645';
test = textscan(test_line,'%5s','delimiter','');
test{:}
will would yield the following in MATLAB:
>> test{:}
ans =
'K05P0'
'0M C'
'2003 '
'01 28'
'.3834'
'4309 '
'37 57'
'.87 +'
'11 05'
'14.9 '
'n~1Hz'
'V645'
whereas in Octave, you get:
>> test{:}
ans =
{
[1,1] = K05P0
[2,1] = C2003
[3,1] = 01
[4,1] = 28.38
[5,1] = 37
[6,1] = 57.87
[7,1] = +11
[8,1] = 05
[9,1] = 14.9
[10,1] = n~1Hz
}
So it looks like Octave jumps to the next word and discards any remaining character in the current word, whereas MATLAB treats the whole string as one continuous word.
Why that is and which is one is correct, I do not know, but hopefully it'll point you in the right direction for understanding what is going on. You can try adding the delimiter to see how it affects the results.

Related

Write statement for a complex format / possibility to write more than once on the same excel line

I am presently working on a file to open one by one .txt documents, extract data, to finally fill a .excel document.
Because I did not know how it is possible to write multiple times on the same line of my Excel document after one write statement (because it jumps to the next line), I have created a string of characters which is filled time after time :
Data (data_limite(x),x=1,8)/10, 9, 10, 7, 9, 8, 8, 9/
do file_descr = 1,nombre_fichier,1
taille_data1 = data_limite(file_descr)
nvari = taille_data1-7
write (new_data1,"(A30,A3,A11,A3,F5.1,A3,A7,F4.1,<nvari>(A3))") description,char(9),'T-isotherme',char(9),T_trait,char(9),'d_gamma',taille_Gam,(char(9),i=1,nvari)
ecriture_descr = ecriture_descr//new_data1
end do
Main issue was I want to adapt char(9) amount with the data_limite value so I built a write statement with a variable amount of char(9).
At the end of the do-loop, I have a very complex format of ecriture_descr which has no periodic format due to the change of the nvari value
Now I want to add this to the first line of my .excel :
Open(Unit= 20 ,File='resultats.RES',status='replace')
write(20,100) 'param',char(9),char(9),char(9),char(9),char(9),'*',char(9),'nuances',char(9),'*',char(9),ecriture_descr
100 format (a5,5(a3),a,a3,a7,a,a3,???)
but I do not know how to write this format. It would have been easier if, at each iteration of the do-loop I could fill the first line of my excel and continue to fill the first line at each new new_data1 value.
EDIT : maybe adding advance='no' in my write statement would help me, I am presently trying to add it
EDIT 2 : it did not work with advance='no' but adding a '$' at the end of my format write statement disable the return of my function. By moving it to my do-loop, I guess I can solve my problem :). I am presently trying to add it
First of all, your line
ecriture_descr = ecriture_descr//new_data1
Is almost certainly not doing what you expect it to do. I assume that both ecriture_descr and new_data are of type CHARACTER(len=<some value>) -- that is a fixed length string. If you assign anything to such a string, the string is cut to length (if the assigned is too long), or padded with spaces (if the assigned is too short:
program strings
implicit none
character(len=8) :: h
h = "Hello"
print *, "|" // h // "|" ! Prints "|Hello |"
h = "Hello World"
print *, "|" // h // "|" ! Prints "|Hello Wo|"
end program strings
And this combination will work against you: ecriture_descr will already be padded to the max with spaces, so when you append new_data1 it will be just outside the range of ecriture_descr, a bit like this:
h = "Hello" ! h is actually "Hello "
h = h // "World" ! equiv to h = "Hello " // "World"
! = "Hello World"
! ^^^^^^^^^
! Only this is assigned to h => no change
If you want a string aggregator, you need to use the trim function which removes all trailing spaces:
h = trim(h) // " World"
Secondly, if you want to write to a file, but don't want to have a newline, you can add the option advance='no' into the write statement:
do i = 1, 100
write(*, '(I4)', advance='no') i
end do
This should make your job a lot easier than to create one very long string in memory and then write it all out in one go.

How to print in ASCII binary fields

I am familiar with C++ and I try to start to work with python.
From a serial line I manage to receive a string of binary characters (not ASCII) with python, let's say the rx 'buffer'.
I have to split this string into different fields, the method I am using is:
stx = (rx[0])
ctl = (rx[1])
node = (rx[2])
cTime = (rx[3:6])
nTime = (rx[7:10])
etx = (rx[11])
(currently I have not find a way to define a structure as for C++).
Now my problem is to print these fields as ASCII typically using:
print "%d-%d-%d-%ld-%ld-%d" % (stx,ctl,node,cTime,nTime,etx)
The error message is:
TypeError: %d format: a number is required, not str
I have already tried to convert the fields in different formats, but nothing works.
Can somebody help me?
You probably need to convert your variables to numbers by float(stx) (or int) but if you only want to print them just use %s instead of %d which expects the variable to be a double (meaning float).
For example:
rx = '1' * 12
stx = float(rx[0])
ctl = float(rx[1])
node = float(rx[2])
cTime = float(rx[3:6])
nTime = float(rx[7:10])
etx = float(rx[11])
print "%d-%d-%d-%ld-%ld-%d" % (stx,ctl,node,cTime,nTime,etx)
prints
1-1-1-111-111-1
or with strings you change the last line:
print "%s-%s-%s-%s-%s-%s" % (stx,ctl,node,cTime,nTime,etx)
In case you are dealing with raw binary data you might need the builtin struct module.
As far as I understand it (given you haven't posted an actual content and desired output I cannot verify it) you might want something like:
import struct
stx, ctl, node, cTime, nTime, etx = struct.unpack('fffddf', rx)
That would expect 3 floats, 2 doubles and then again 1 float. If you have integer or other datatypes you must edit the fffddf string. See the format types of struct
Brrr awfull
I have finally found one way for such a simple function:
> print int(stx.encode('hex'),16),int(ctl.encode('hex'),16),
> int(node.encode('hex'), 16), int(cTime.encode('hex'), 16),
> int(nTime.encode('hex'), 16), int(etx.encode('hex'), 16)
Not sure I will like python

How to read the number of a prespecified character that appears in a string variable in Matlab

I process a big file with Matlab. In each line of the input file, data are separated with dots ".". Due to poor format, the number of dots may change line by line of the input file.
For example:
line1 = 'DIDYMOTE.150.L20'
line2 = 'N.ELBETI.150.L10'
How can I read the number of dots that appear in each line ?
In matlab everything is an array. So
data = load('file.txt');
[no_lines, no_characters] = size(data);
for i = 1 : no_lines
no_dots[i] = 0
for j = 1 : no_characters
if data[i][j] == '.'
no_dots[i] = no_dots[i] + 1
end
end
end
However, matlab has no strings, and is very unsuitable for handling text data. If any of the lines has different length you will get an error. Even if this is not the case, you are better off using another language for this. It will take you less time to learn how to process text in Python (for example), than trying to fit your problem into matlab.

Replace multiple substrings using strrep in Matlab

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.
Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array
strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')
However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.
Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.
With regexprep it would be:
regexprep(text,'Frame \d*',';')
and for a string of 25MB it takes around 47 seconds to replace all the instances.
EDIT 1: added the equivalent regexprep command
EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep
Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)
rawData = strrep(text,'Frame ','');
This results in something like this:
1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers
rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');
then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).
rawData{1}(firstInstance:spacing:lastInstance)=[];
And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!
Hope this helps somehow.
I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:
s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}
You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:
f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}
In fact, if you just want the elements from the first line, you can use this:
l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}
However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.
Using regular expressions:
result = regexprep(text,'Frame [0-9]+','');
It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)
% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths
% Computations
rep_dest = cellfun(#(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';'
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters
With these example data:
>> text
text =
Hello Frame 1 test string Frame 22 end of Frame 2 this
>> result
result =
Hello ; test string ; end of ; this

Error reading a fixed-width string with textscan in MATLAB

I'm reading fixed-width (9 characters) data from a text file using textscan. Textscan fails at a certain line containing the string:
' 9574865.0E+10 '
I would like to read two numbers from this:
957486 5.0E+10
The problem can be replicated like this:
dat = textscan(' 9574865.0E+10 ','%9f %9f','Delimiter','','CollectOutput',true,'ReturnOnError',false);
The following error is returned:
Error using textscan
Mismatch between file and format string.
Trouble reading floating point number from file (row 1u, field 2u) ==> E+10
Surprisingly, if we add a minus, we don't get an error, but a wrong result:
dat = textscan(' -9574865.0E+10 ','%9f %9f','Delimiter','','CollectOutput',true,'ReturnOnError',false);
Now dat{1} is:
-9574865 0
Obviously, I need both cases to work. My current workaround is to add commas between the fields and use commas as a delimiter in textscan, but that's slow and not a nice solution. Is there any way I can read this string correctly using textscan or another built-in (for performance reasons) MATLAB function?
I suspect textscan first trims leading white space, and then parses the format string. I think this, because if you change yuor format string from
'%9f%9f'
to
'%6f%9f'
your one-liner suddenly works. Also, if you try
'%9s%9s'
you'll see that the first string has its leading whitespace removed (and therefore has 3 characters "too many"), but for some reason, the last string keeps its trailing whitespace.
Obviously, this means you'd have to know exactly how many digits there are in both numbers. I'm guessing this is not desirable.
A workaround could be something like the following:
% Split string on the "dot"
dat = textscan(<your data>,'%9s%9s',...
'Delimiter' , '.',...
'CollectOutput' , true,...
'ReturnOnError' , false);
% Correct the strings; move the last digit of the first string to the
% front of the second string, and put the dot back
dat = cellfun(#(x,y) str2double({y(1:end-1), [y(end) '.' x]}), dat{1}(:,2), dat{1}(:,1), 'UniformOutput', false);
% Cast to regular array
dat = cat(1, dat{:})
I had a similar problem and solved it by calling textscan twice, which proved to be way faster than cellfun or str2double and will work with any input that can be interpreted by Matlab's '%f'
In your case I would first call textscan with only string arguments and Whitespace = '' to correctly define the width of the fields.
data = ' 9574865.0E+10 ';
tmp = textscan(data, '%9s %9s', 'Whitespace', '');
Now you need to interweave and append a delimiter that won't interfere with your data, for example ;
tmp = [char(join([tmp{:}],';',2)) ';'];
And now you can apply the right format to your data by calling textscan again with a delimiter like:
result = textscan(tmp, '%f %f', 'Delimiter', ';', 'CollectOutput', true);
format shortE
result{:}
ans =
9.5749e+05 5.0000e+10
Comparing the speed of this approach with str2double:
n = 50000;
data = repmat(' 9574865.0E+10 ', n, 1);
% Approach 1 with str2double
tic
tmp = textscan(data', '%9s %9s', 'Whitespace', '');
result1 = str2double([tmp{:}]);
toc
Elapsed time is 2.435376 seconds.
% Approach 2 with double textscan
tic
tmp = textscan(data', '%9s %9s', 'Whitespace', '');
tmp = [char(join([tmp{:}],';',2)) char(59)*ones(n,1)]; % char(59) is just ';'
result2 = cell2mat(textscan(tmp', '%f %f', 'Delimiter', ';', 'CollectOutput', true));
toc
Elapsed time is 0.098833 seconds.

Resources