I create a routine in R to import multiple Excel files that I need to merge in one big txt file. I use the read.xls function. Some of these xls files have carriage returns in cells ("\n"). Then, when I write the txt files (write.table) R interpret this "\n" as new lines.
How can I clean the xls files or read properly them to remove the not necessary "\n"?
Thanks!
The columns in your table are almost certainly factors (that's the default for character columns in R). So, we can just change the factors in each column.
First some dummy data
R> dd = data.frame(d1 = c("1", "2\n", "33"),
d2 = c("1\n", "2\n", "33"))
##Default, factor
R> levels(dd[,1])
[1] "1" "2\n" "33"
Next, we use a for loop to go over the column names:
for(i in 1:ncol(dd))
levels(dd[,i]) = gsub("\n","", levels(dd[,i]))
If you want to remove the for loop and use sapply, then this should work
##Can this be improved?
sapply(1:ncol(dd),
function(i) levels(dd[,i]) <<- gsub("\n","", levels(dd[,i])))
Related
I try to loop not only through different excel sheets(~125), but also through different excel files (~12). I managed to write a code for the sheets, but now I am struggling how to scale that up to different excel files. The excel-files all have the same structure and number/name of sheets.
Can anyone help me? Thanks a lot in advance!!
foldername = 'Raw_data';
cd(foldername);
fnames = dir('*raw.xlsx');
%% extraction of sheet name
[~, sheet_name] = xlsfinfo('Test_raw.xlsx');
%% additional array for merging later
cali=[1; 2; 5; 10; 20; 50; 100; 200; 500; 1000];
%for i=1:length(fnames) %I guess ?
for k=1:numel(sheet_name) %operation for all sheets
%extract data of one excel file, but different sheets
[~,~,raw{k}]=xlsread('Test_raw.xlsx',sheet_name{k},'A5:A14');
x=vertcat(raw{:});
end
B = reshape(x,10,k);
numind = cellfun(#isnumeric, B); %identifies numeric values
B(~numind) = {NaN} %NOT num. values to NaN
b =cell2mat(B);
final_data = [cali b];
%end
You want to loop through all your excel files. You already gathered all the filenames in fnames.
You basically did setup your for-loop, the only thing missing is replacing 'Test_raw.xlsx' in xlsread with fnames(i).name.
for i=1:length(fnames) %I guess ?
for k=1:numel(sheet_name) %operation for all sheets
%extract data of one excel file, but different sheets
[~,~,raw{k}]=xlsread('Test_raw.xlsx',sheet_name{k},'A5:A14');
x=vertcat(raw{:});
end
end
Be careful that you have to adapt your final_data variable.
For just all the data from all the files in it you could use this variable as a cell-array containing an element for each file. It is good practice allocating this array before entering the loop
final_data = cell(length(fnames),1);
%% here go the loops
clear B
B = reshape(x,10,k);
numind = cellfun(#isnumeric, B); %identifies numeric values
B(~numind) = {NaN} %NOT num. values to NaN
b =cell2mat(B);
final_data{i} = [cali b];
B, numind and b will be temporary working variables that are being overwritten each loop. Because of this, clearing them before their next use can be good practice.
After the loop, you can access your data with e.g. final_data{5} to access the fifth file.
I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.
I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose
My question is if this could be normal behavior or if exist any other way to improve this task.
I'm using this for loop to read several cvs files and parsing it in table format
for toTAB in listBRUTO:
df= pd.read_csv(toTAB)
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
t.style = 'LightShading-Accent1' # border
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
#TABLE Format
for row in t.rows:
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.name = 'Calibri'
font.size= Pt(7)
doc.add_page_break()
doc.save('blabla.docx')
Thanks in advance
You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.
I would start with refactoring this block and see how much improvement that yields:
# --- add the rest of the data frame ---
for i in range(df.shape[0]):
for j, cell in enumerate(table.rows[i + 1].cells):
cell.text = str(df.values[i, j])
python-docx walk the whole table every single time you access its "cells" property.
so you better call ".cell" as less as possible and use a cache for cells instead.
these are two examples access a table with size 3*1500:
code 1: about 150.0s
for row in table.rows:
print('processing: {0:30s}'.format(row.cells[0].text),end='\r')
code 2: about 1.4s
clls=table._cells
for row_idx in range(len(clls)//table._column_count):
print('processing: {0:30s}'.format(
clls[0 + row_idx*table._column_count].text),end='\r')
clls=table._cells in code 2 use "_cells" to process the cell-merging, so ccls[column_idx + row_idx*table._column_count].text works just as fine as table.rows[row_idx].cells[column_idx].text, and dont require table to be exactly rectangular
For rectangular table without merged cells you can export all cells into list-of-lists structure and fill them very quickly (less then 0.5s vs 15s for ~300 lines tables with 3 columns):
from docx.table import _Cell
def get_cells_grid(table):
cells = [[]]
col_count = table._column_count
for tc in table._tbl.iter_tcs():
cells[-1].append(_Cell(tc, table))
if len(cells[-1]) == col_count:
cells.append([])
return cells
cells = get_cells_grid(t)
for i in range(df.shape[0]):
for j in range(df.shape[i]):
cells[i][j].text = str(df.values[i, j])
Function based on table._cells() code: https://github.com/python-openxml/python-docx/blob/da75fcf01f7f322e846e2ac3e1936aedd766acc8/docx/table.py#L162
Just to add my experience, if you have to create a huge table, create the whole structure first, meaning all the rows and cells you will need; and then store the cells like so
table_cells = table._cells (according to #kztopia)
And from there you can manipulate cells as you wish, merging, adding text etc... with a rather optimized fastness since you make only one call to cell()
In my use case, for a table being, in my opinion, not so big (~130rows, 8cells per row), it used to take 9sec to create the whole thing and now i'm at .5 or so.
Keep in mind that, the bigger the table, the more time it'll take to execute cell().
I would like to do a data reduction operation on a spreadsheet. Preferably I would like to use MATLAB/(or excel) since I need separate output files for each case.
The link is for the spreadsheet is below
Spreadsheet
A screenshot of the spreadsheet is as below
The output I required in text files is something as below
The first sheet in the .xls file is the main input. Wheras the the following sheets (d**) are my required output. I also need these sheets in a separate ASCII file (.dat) to plot hem later on. Here is how the algorithm works
Lookup the number/string in column B(FileName)
Extract all data in Columns C and D (Saturation and ETC) with same FileName Value(Column B)
Lookup the matching FileName(Column B) value in Column E (ImageIndex).
Copy Value of ImageName(Column F) to the corresponding Value in Image(IndexColumn E)
Result would be three columns (ImageName,Saturation,ETC). ImageName would be same for each subcase
Sort the columns based on Saturation
Write each sub case as a separate .dat file
I tried using a few recipes using categorical arrays (findgroups and splitapply) in MATLAB. Didn't seem to work out for me. I would be later working on a larger data set so automation is necessary. I think this could be done using macros on excel, but I would prefer using MATLAB since I would use MATLAB to plot the data. Any other alternative suggestions are welcome
Thanks,
Here's a Matlab solution. You could do it with a rather convoluted accumarray call, but readability would be rather bad, so I'm opting for a loop here.
out is a structure which you can use to either write files, or to plot the data.
tbl = readtable('yourFile.xls');
%# get the group indices for the files
%# this assumes that you have cleaned up the dash after the 1
%# so that all of the entries in the FileName column are numeric
idx = tbl.FileName;
%# the uIdx business is to account for the possibility
%# that there are images missing from the sequence
uIdx = unique(idx);
nImages = length(uIdx);
%# preassign output structure
out(1:nImages) = struct('name','','saturation',0,'etc',0);
%# loop to extract relevant information
for iImage = uIdx(:)'
myIdx = idx==iImage;
data = tbl(myIdx,{'Saturation','ETC'});
data = sortrows(data,'Saturation');
name = tbl.ImageName{tbl.ImageIdx==iImage};
out(iImage==uIdx).name = name;
out(iImage==uIdx).saturation = data.Saturation;
out(iImage==uIdx).etc= data.ETC;
end
%# plotting
for iImage = 1:nImages
figure('name',out(iImage).name)
plot(out(iImage).saturation, out(iImage).etc,'.');
end
I have roughly 70,000 sheets that all have to have calculations done, and then all results compiled into a new sheet (which would be 70,000 lines long).
It needs to be sorted by date.
I'm VERY very very poor at matlab, but I've what I need the script to do for each excel sheet, I'm just unsure how to make it do them for all.
Thank you!!! (I took out some of the not important code)
%Reading in excel sheet
B = xlsread('24259893-008020361800.TorqueData.20160104.034602AM.csv');
%Creating new matrix
[inYdim, inXdim] = size(B);
Ydim = inYdim;
[num,str,raw]=xlsread('24259893-008020361800.TorqueData.20160104.034602AM.csv',strcat('A1:C',num2str(Ydim)));
%Extracting column C
C=raw(:,3);
for k = 1:numel(C)
if isnan(C{k})
C{k} = '';
end
end
%Calculations
TargetT=2000;
AvgT=mean(t12);
TAcc=((AvgT-TargetT)/TargetT)*100 ;
StdDev=std(B(ind1:ind2,2));
ResTime=t4-t3;
FallTime=t6-t5;
DragT=mean(t78);
BreakInT=mean(t910);
BreakInTime=(t10-t9)/1000;
BreakInE=BreakInT*BreakInTime*200*.1047;
%Combining results
Results=[AvgT TAcc StdDev ResTime FallTime DragT BreakInT BreakInTime BreakInE]
I think I need to do something along the lines of:
filenames=dir('*.csv')
and I found this that may be useful:
filenames=dir('*.csv');
for file=filenames'
csv=load(file.name);
with stuff in here
end
You have the right idea, but you need to index your file names in order to be able to step through them in the for loop.
FileDir = 'Your Directory';
FileNames = {'Test1';'Test2';'Test3'};
for k=1:length(FileNames)
file=[FileDir,'/',FileNames{k}]);
[outputdata]=xlsread(file,sheet#, data locations);
THE REST OF YOUR LOOP, Indexed by k
end
How you choose to get the file names and directory is up to you.
I have several text files with 2 columns, containing only numbers which i would like to import into a single Excel spreadsheet (Excel 2016) using matlab. Reason for using matlab (R2014a) is because i later have scripts that process the data as well as its the only progaming language i am mildly familiar with.
I tried to use the following
Using matlab to save a 100 text files to a one excel file but into different spread sheet?
but i just couldnt understand anything as I am a newbie and this example I think is for making several excel files while I want only one. Thanks for the help! Greatly appreciated.
content = dir();
col = 1;
for i = 1:10
if content(i).isdir ~= 1
fileID = fopen('AN050ME.ASC');
data = textscan(fileID, '%s %s');
fclose(fileID);
datum(:, col) = data{1};
col = col + 1;
datum(:, col) = data{2};
col = col + 1;
clear data;
end
end
filename = 'Datum.xls';
sheet=1;
xlswrite(filename, datum, sheet, 'A1');
close all;
This is basic working algorithm, you need to further work on it to optimize it for speeed
Hints:
1. pre-declare the size of datum, based of number of files.
2. if all the files you have to read are of same extension, read only
them through dir()
Good luck for fine tuning