I have a huge csv file (as in: more than a few gigs) and would like to read it in Matlab and process each file. Reading the file in its entirety is impossible so I use this code to read in each line:
fileName = 'input.txt';
inputfile = fopen(fileName);
while 1
tline = fgetl(inputfile);
if ~ischar(tline)
break
end
end
fclose(inputfile);
This yiels a cell array of size(1,1) with the line as string. What I would like is to convert this cell to a normal array with just the numbers.
For example:
input.csv:
0.0,0.0,3.201,0.192
2.0,3.56,0.0,1.192
0.223,0.13,3.201,4.018
End result in Matlab for the first line:
A = [0.0,0.0,3.201,0.192]
I tried converting tline with double(tline) but this yields completely different results. Also tried using a regex but got stuck there. I got to the point where I split up all values into a different cell in one array. But converting to double with str2double yields only NaNs...
Any tips? Preferably without any loops since it already takes a while to read the entire file.
You are looking for str2num
>> A = '0.0,0.0,3.201,0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
>> A = '0.0 0.0 3.201 0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
>> A = '0.0 0.0 , 3.201 , 0.192';
>> str2num(A)
ans =
0 0 3.2010 0.1920
e.g., it's quite agnostic to input format.
However, I will not advise this for your use case. For your problem, I'd do
C = dlmread('input.txt',',', [1 1 1 inf]) % for first line
C = dlmread('input.txt',',') % for entire file
or
[a,b,c,d] = textread('input.txt','%f,%f,%f,%f',1) % for first line
[a,b,c,d] = textread('input.txt','%f,%f,%f,%f') % for entire file
if you want all columns in separate variables:
a = 0
b = 0
c = 3.201
d = 0.192
or
fid = fopen('input.txt','r');
C = textscan(fid, '%f %f %f %f', 1); % for first line only
C = textscan(fid, '%f %f %f %f', N); % for first N lines
C = textscan(fid, '%f %f %f %f', 1, 'headerlines', N-1); % for Nth line only
fclose(fid);
all of which are much more easily expandable (things like this, whatever they are, tend to grow bigger over time :). Especially dlmread is much less prone to errors than writing your own clauses is, for empty lines, missing values and other great nuisances very common in most data sets.
Try
data = dlmread('input.txt',',')
It will do exactly what you want to do.
If you still want to convert string to a vector:
line_data = sscanf(line,'%g,',inf)
This code will read the entire coma-separated string and convert each number.
Related
Very new to matlab and still learning the basics. I'm trying to write a script which calculates the distance between two peaks in a waveform. That part I have managed to do, and I have used xlswrite to put the values I have obtained onto an excel file.
For each file, I have between about 50-250 columns, with just two rows: the second row has the numerical value, and the first has the column headings, copied from original excel files I extracted the data from.
Some of the columns have similar, but not identical, headings, e.g. 'green227RightEyereading3' and 'green227RightEyereading4' etc. Is there a way I can group columns with similar headings, e.g. which have the same number/colour in the heading (I.e.green227) and either 'right eye' or 'left eye', and calculate an average of their numerical values? Link to file here: >https://www.dropbox.com/s/ezpyjr3raol31ts/SampleBatchForTesting.xls?dl=0>
>[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File','C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
>[~,name,ext] = fileparts(Excel_file);
>sheet = 2;
>FullXLSfile = [PathName, Excel_file];
>[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
>HowManyWide = size(txt_data);
>NumberOfTitles = HowManyWide(1,2);
>xlRangeA = txt_data;
>Chickens = {'Test'};
>for f = 1:xlRangeA; %%defined as top line of cells on sheet;
>Text = xlRangeA{f};
>HyphenLocations = find(Text == '-');
>R = HyphenLocations(1,1) -1;
>Chick = Text(1:R);
>Chick = cellstr(Chick);
>B = length(Chick);
>TF = strncmp(Chickens,Chick,B);
>if any(TF == 1); %do nothing
>else
>Chickens = {Chickens;Chick};
>end
>end
Here also is a link to the file that is created when I run my entire script. The values below the headings are the calculated thickesses of the tissue I'm analysing. https://www.dropbox.com/s/4p6iu9kk75ecyzl/Choroid_Thickness.xls?dl=0
Thanks very much
If the different characters are located at the very end (or the very beginning) of the heading, you can go with strncmp buit-in function and compare only part of the string. See more here. But please, provide some code and a part of your excel file. It would help.
Also, if I am not mistaken, you are saving all the data into excel and then re-call it again in order to sort it. Maybe you should consider saving only the final result in excel, it will save you some time, especially if you want to run your script many times.
EDIT:
Here is the code I came up with. It is not the best possible solution for sure, but it works with the file you uploaded. I have omitted the unnecessary lines and variables. The code works only if the numbers of each reading have the same amount of digits. They can be 4 digits as long as every entry has 4 digits. Since in each file you have waves of the same color, the only thing that you care about is whether the reading was recorded with the left or the right eye (correct?). Based on that and the code you wrote, the comparison concerns the part of the string that contains the words "Right" or "Left", i.e. the characters between the hyphens.
[Excel_file,PathName] = uigetfile('*.xls', 'Pick a File',...
'C:\Users\User\Documents\Optometry\Year 3\Dissertation\A-scan3');
sheet = 1;
FullXLSfile = [PathName,Excel_file];
[number_data,txt_data,raw_data] = xlsread(FullXLSfile,sheet);
%% data manipulation
NumberOfTitles = length(txt_data);
TextToCompare = txt_data{1};
r1 = 1; % counter for Readings1 vector
r2 = 1; % counter for Readings2 vector
for ff = 1:NumberOfTitles % in your code xlRangeA is a cell vector not a number!
Text = txt_data{ff};
HyphenLocations = find(Text == '-');
Text = Text(HyphenLocations(1,1):HyphenLocations(1,2)); % take only the part that contains the "eye" information
TextToCompare = TextToCompare(HyphenLocations(1,1):HyphenLocations(1,2)); % same here
if (Text == TextToCompare)
Readings1(r1) = number_data(ff); % store the numerical value in a vector
r1 = r1 + 1; % increase the counter of this vector
else
Readings2(r2) = number_data(ff); % same here
r2 = r2 + 1;
end
TextToCompare = txt_data{1}; % TextToCompare re-initialized for the next comparison
end
mean_readings1 = mean(Readings1); % Find the mean of the grouped values
mean_readings2 = mean(Readings2);
I am positive that this can be done in a more efficient and delicate way. I don't know exactly what kind of calculations you want to do so I only included the mean values as an example. Inside the if statement you can also store the txt_data if you need it. Below I have also included a second way which I find more delicate. Just substitute the %%data manipulation part with the part below if you want to test it:
%% more delicate way
Text_Vector = char(txt_data);
TextToCompare2 = txt_data{1};
HyphenLocations2 = find(TextToCompare2 == '-');
TextToCompare2 = TextToCompare2(HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = Text_Vector(:,HyphenLocations2(1,1):HyphenLocations2(1,2));
Text_Vector = cellstr(Text_Vector);
dummy = strcmpi(Text_Vector,TextToCompare2);
Readings1 = number_data(dummy);
Readings2 = number_data(~dummy);
I hope this helps.
Is it possible to format the output of sprintf, like following or should I use another function.
Say I have an variable dt= 9.765625e-05 and I want use sprintf to make a string for use when saving say a figure
fig = figure(nfig);
plot(x,y);
figStr = sprintf('NS2d_dt%e',dt);
saveas(fig,figStr,'pdf')
The punctuation mark dot presents me with problems, some systems mistake the format of the file.
using
figStr = sprintf('NS2d_dt%.2e',dt);
then
figStr = NS2d_dt9.77e-05
using
figStr = sprintf('NS2d_dt%.e',dt);
then
figStr = NS2d_dt1e-04
which is not precise enough. I would like something like this
using
figStr = sprintf('NS2d_dt%{??}e',dt);
then
figStr = NS2d_dt9765e-08
Essentially the only way to get your desired output is with some manipulation of the value or strings. So here's two solutions for you first with some string manipulation and second by manipulating the value. Hopefully, these 2 approaches will help reason out solutions for other problems, particularly the number manipulation.
String Manipulation
Solution
fmt = #(x) sprintf('%d%.0fe%03d', (sscanf(sprintf('%.4e', x), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3]);
Explanation
First I use sprintf to print the number in a defined format
>> sprintf('%.4e', dt)
ans =
9.7656e-05
then sscanf to read it back in making sure to remove the . and e
>> sscanf(sprintf('%.4e', dt), '%d.%de%d').'
ans =
9 7656 -5
before printing it back we perform some manipulation of the data to get the correct values for printing
>> (sscanf(sprintf('%.4e', dt), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3]
ans =
9 765.1 -8
and now we print
>> sprintf('%d%.0fe%03d', (sscanf(sprintf('%.4e', dt), '%d.%de%d').' .* [1 0.1 1]) - [0 0.5 3])
ans =
9765e-08
Number Manipulation
Solution
orderof = #(x) floor(log10(abs(x)));
fmt = #(x) sprintf('%.0fe%03d', x*(10^(abs(orderof(x))+3))-0.5, orderof(x)-3);
Explanation
First I create an anonymous orderof function which tells me the order (the number after e) of the input value. So
>> dt = 9.765625e-05;
>> orderof(dt)
ans =
-5
Next we manipulate the number to convert it to a 4 digit integer, this is the effect of adding 3 in
>> floor(dt*(10^(abs(orderof(dt))+3)))
ans =
9756
finally before printing the value we need to figure out the new exponent with
>> orderof(x)-3
ans =
-8
and printing will give us
>> sprintf('%.0fe%03d', floor(dt*(10^(abs(orderof(dt))+3))), orderof(dt)-3)
ans =
9765e-08
Reading your question,
The punctuation mark dot presents me with problems, some systems mistake the format of the file.
it seems to me that your actual problem is that when you build the file name using, for example
figStr = sprintf('NS2d_dt%.2e',dt);
you get
figStr = NS2d_dt9.77e-05
and, then, when you use that string as filename, the . is intepreted as the extension and the .pdf is not attached, so in Explorer you can not open the file double-clicking on it.
Considering that changing the representation of the number dt from 9.765e-05 to 9765e-08 seems quite wierd, you can try the following approach:
use the print function to save your figure in .pdf
add .pdf in the format specifier
This should allows you the either have the right file extension and the right format for the dt value.
peaks
figStr = sprintf('NS2d_dt_%.2e.pdf',dt);
print(gcf,'-dpdf', figStr )
Hope this helps.
figStr = sprintf('NS2d_dt%1.4e',dt)
figStr =
NS2d_dt9.7656e-05
specify the number (1.4 here) as NumbersBeforeDecimal (dot) NumbersAfterDecimal.
Regarding your request:
A = num2str(dt); %// convert to string
B = A([1 3 4 5]); %// extract first four digits
C = A(end-2:end); %// extract power
fspec = 'NS2d_dt%de%d'; %// format spec
sprintf(fspec ,str2num(B),str2num(C)-3)
NS2d_dt9765e-8
How would I be able to do the equivalent of this with strings:
a = [1 2 3; 4 5 6];
c = [];
for i=1:5
b = a(1,:)+i;
c = [c;b];
end
c =
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
Basically looking to combine several strings into a Matrix.
You're growing a variable in a loop, which is a kind of sin in Matlab :) So I'm going to show you some better ways of doing array concatenation.
There's cell strings:
>> C = {
'In a cell string, it'
'doesn''t matter'
'if the strings'
'are not of equal lenght'};
>> C{2}
ans =
doesn't matter
Which you could use in a loop like so:
% NOTE: always pre-allocate everything before a loop
C = cell(5,1);
for ii = 1:5
% assign some random characters
C{ii} = char( '0'+round(rand(1+round(rand*10),1)*('z'-'0')) );
end
There's ordinary arrays, which have as a drawback that you have to know the size of all your strings beforehand:
a = [...
'testy' % works
'droop'
];
b = [...
'testing' % ERROR: CAT arguments dimensions
'if this works too' % are not consistent.
];
for these cases, use char:
>> b = char(...
'testing',...
'if this works too'...
);
b =
'testing '
'if this works too'
Note how char pads the first string with spaces to fit the length of the second string. Now again: don't use this in a loop, unless you've pre-allocated the array, or if there really is no other way to go.
Type help strfun on the Matlab command prompt to get an overview of all string-related functions available in Matlab.
You mean storing a string on each matrix position? You can't do that, since matrices are defined over basic types. You can have a CHAR on each position:
>> a = 'bla';
>> b = [a; a]
b <2x3 char> =
bla
bla
>> b(2,3) = 'e'
b =
bla
ble
If you want to store matrices, use a cell array (MATLAB reference, Blog of Loren Shure), which are kind of similar but using "{}" instead of "()":
>> c = {a; a}
c =
'bla'
'bla'
>> c{2}
ans =
bla
I am trying to read the file with the following format which repeats itself (but I have cut out the data even for the first repetition because of it being too long):
1.00 'day' 2011-01-02
'Total Velocity Magnitude RC - Matrix' 'm/day'
0.190189 0.279141 0.452853 0.61355 0.757833 0.884577
0.994502 1.08952 1.17203 1.24442 1.30872 1.36653
1.41897 1.46675 1.51035 1.55003 1.58595 1.61824
Download the actual file with the complete data here
This is my code which I am using to read the data from the above file:
fid = fopen(file_name); % open the file
dotTXT_fileContents = textscan(fid,'%s','Delimiter','\n'); % read it as string ('%s') into one big array, row by row
dotTXT_fileContents = dotTXT_fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing 'Total Velocity Magnitude RC - Matrix' 'm/day'
data_starts = strmatch('''Total Velocity Magnitude RC - Matrix'' ''m/day''',...
dotTXT_fileContents); % data_starts contains the line numbers wherever 'Total Velocity Magnitude RC - Matrix' 'm/day' is found
ndata = length(data_starts); % total no. of data values will be equal to the corresponding no. of '** K' read from the .txt file
%# loop through the file and read the numeric data
for w = 1:ndata-1
%# read lines containing numbers
tmp_str = dotTXT_fileContents(data_starts(w)+1:data_starts(w+1)-3); % stores the content from file dotTXT_fileContents of the rows following the row containing 'Total Velocity Magnitude RC - Matrix' 'm/day' in form of string
%# convert strings to numbers
tmp_str = tmp_str{:}; % store the content of the string which contains data in form of a character
%# assign output
data_matrix_grid_wise(w,:) = str2num(tmp_str); % convert the part of the character containing data into number
end
To give you an idea of pattern of data in my text file, these are some results from the code:
data_starts =
2
1672
3342
5012
6682
8352
10022
ndata =
7
Therefore, my data_matrix_grid_wise should contain 1672-2-2-1(for a new line)=1667 rows. However, I am getting this as the result:
data_matrix_grid_wise =
Columns 1 through 2
0.190189000000000 0.279141000000000
0.423029000000000 0.616590000000000
0.406297000000000 0.604505000000000
0.259073000000000 0.381895000000000
0.231265000000000 0.338288000000000
0.237899000000000 0.348274000000000
Columns 3 through 4
0.452853000000000 0.613550000000000
0.981086000000000 1.289920000000000
0.996090000000000 1.373680000000000
0.625792000000000 0.859638000000000
0.547906000000000 0.743446000000000
0.562903000000000 0.759652000000000
Columns 5 through 6
0.757833000000000 0.884577000000000
1.534560000000000 1.714330000000000
1.733690000000000 2.074690000000000
1.078000000000000 1.277930000000000
0.921371000000000 1.080570000000000
0.934820000000000 1.087410000000000
Where am I wrong? In my final result, I should get data_matrix_grid_wise composed of 10000 elements instead of 36 elements. Thanks.
Update: How can I include the number before 'day' i.e. 1,2,3 etc. on a line just before the data_starts(w)? I am using this within the loop but it doesn't seem to work:
days_str = dotTXT_fileContents(data_starts(w)-1);
days_str = days_str{1};
days(w,:) = sscanf(days_str(w-1,:), '%d %*s %*s', [1, inf]);
Problem in line tmp_str = tmp_str{:}; Matlab have strange behaviour when handling chars. Short solution for you is replace last with the next two lines:
y = cell2mat( cellfun(#(z) sscanf(z,'%f'),tmp_str,'UniformOutput',false));
data_matrix_grid_wise(w,:) = y;
The problem is with last 2 statements. When you do tmp_str{:} you convert cell array to comma-separated list of strings. If you assign this list to a single variable, only the first string is assigned. So the tmp_str will now have only the first row of data.
Here is what you can do instead of last 2 lines:
tmp_mat = cellfun(#str2num, tmp_str, 'uniformoutput',0);
data_matrix_grid_wise(w,:) = cell2mat(tmp_mat);
However, you will have a problem with concatenation (cell2mat) since not all of your rows have the same number of columns. It's depends on you how to solve it.
I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.