reading a big xls file into R - excel

I have an excel file with ~10000 rows and ~250 columns, currently I am using RODBC to do the importing:
channel <- odbcConnectExcel(xls.file="s:/demo.xls")
demo <- sqlFetch(channel,"Sheet_1")
odbcClose(channel)
But this way is a bit slow (I need a minute or two to import them), and the excel is originally encrypted, I need to remove the password to work on it, which is something that I prefer not to, I wonder if there is any better way (i.e. import faster, and capable of importing encrypted excel files)
Thanks.

I recommend to try using the XLConnect package instead of RODBC.
http://cran.r-project.org/web/packages/XLConnect/index.html

Related

How do I use python to show data in excel?

I'm trying to display data from Python in Excel. Ideally, a pandas dataframe's worth of data would appear in a new, unsaved excel instance. My search has turned up ways to create excel files, and lots of ways to 'open' an excel file to read data from it, but no way to display it. My current approach was to create a file and then figure out how to open it, but I consider that approach second-best.
I found this. Another way would be to export your data to CSV and import it in Excel.
I found you can 'run' files with the OS library. As long as your computer knows what to do with it, you can create the xlsx file with whatever method and then run it to display:
import xlsxwriter
import os
w = xlsxwriter.Workbook(r"C:\Temp\test.xlsx")
s=w.add_worksheet("Sheet1")
s.write(1,3,"7")
w.close()
os.startfile(r"C:\Temp\test.xlsx")
Still not sure if you can work with an unsaved open instance of excel

Filtering SAS datasets created with PROC IMPORT with dmbs=xlsx

My client uses SAS 9.3 running on an AIX (IBM Unix) server. The client interface is SAS Enterprise Guide 5.1.
I ran into this really puzzling problem: when using PROC IMPORT in combination with dbms=xlsx, it seems impossible to filter rows based on the value of a character variable (at least, when we look for an exact match).
With an .xls file, the following import works perfectly well; the expected subset of rows is written to myTable:
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xls"
dbms = xls replace;
run;
However, using the same data but this time in an .xlsx file, an empty dataset is created (having the right number of variables and adequate column types).
proc import out = myTable(where=(myString EQ "ABC"))
datafile ="myfile.xlsx"
dbms = xlsx replace;
run;
Moreover, if we exclude the where from the PROC IMPORT, the data is seemingly imported correctly. However, filtering is still not possible. For instance, this will create an empty dataset:
data myFilteredTable;
set myTable;
where myString EQ "ABC";
run;
The following will work, but is obviously not satisfactory:
data myFilteredTable;
set myTable;
where myString LIKE "ABC%";
run;
Also note that:
Using compress or other string functions does not help
Filtering using numerical columns works fine for both xls and xlsx files.
My preferred method to read spreadsheets is to use excel libnames, but this is technically not possible at this time.
I wonder if this is a known issue, I couldn't find anything about it so far. Any help appreciated.
It sounds like your strings have extra values on the end not being picked up by compress. Try using the countc function on MyString to see if any extra characters exist on the end. You can then figure out what characters to remove with compress once they're determined.

openpyxl close archive after breaking read operation because max rows are 1048498 rows

I have two problems using openpyxl
The number of rows in the spreadsheet are 1048498. The iteration hogs memory so I put a logic to check for first five empty columns and break from it
Logic 1 works for me and code does not indefinitely iterate over the spreadsheet blank cells. I am using P4Python to delete this read only file after I am done reading it. However, openpyxl is still using that file and there is no method except save to close the archive used internally. Since my file is in read only mode, I cannot save the file. When P4 is trying to delete this file, I get this error - "The process cannot access the file because it is being used by another process."
Help is appreciated :)
If you open the file in read-only mode then it will not hog memory. Cells are created only when read. Memory use has been tested with huge files but if you think this is a bug then please submit a bug report with a sample file.
This looks like an existing issue or intended beahvior with openpyxl. If you have a read only file (P4Python sync operation - p4.run_sync(file_path_to_sync)) and if you are reading it using openpyxl, you will not be able to delete the file (P4Python p4.run_sync(file_path_to_sync + '#0') - Remove from workspace) until you save the file which is not possible (or intended in my case) since it is a read only file.

Load big Excel (xlsx) file into matlab

I use Windows 64bit with 8GB RAM and Matlab 64bit.
I tried to load a .xlsx file into matlab. The file size is around 700MB, containing a sheet with 673928 rows and 43 columns.
First I use the GUI tool 'uiimport'. After choosing the file path and name, the GUI tool needs around 3 minutes to read the .xlsx file, and then shows the data in a table. If I choose "cell array", it needs around 10 minutes to import the data into workspace.
>>whos
Name Size Bytes Class Attributes
NBPPdataV3YOS1 673928x43 3473588728 cell
It works very well, but I have many .xlsx files to import. It is impossible to import each file using GUI tool. So I use the GUI tool to generate function like this
function data = importfile(workbookFile, sheetName, range)
%% Import the data
[~, ~, data] = xlsread(workbookFile, sheetName, range);
data(cellfun(#(x) ~isempty(x) && isnumeric(x) && isnan(x),data)) = {''};
For simply, I ignore some irrelevant code. However, when I use this function to import the data, It does not work well. The used RAM by Matlab and Excel increases dramatically until almost all RAM is used. The data cannot be imported even after 30 minutes.
I also try to do it like this,
filename='E:\data.xlsx';
excelObj = actxserver('Excel.Application');
fileObj = excelObj.Workbooks.Open(filename);
sheetObj = fileObj.Worksheets.get('Item', 'sheet2');
%Read in ranges the same way as xlsread!
indata = sheetObj.Range('A1:AQ673928').Value;
The same problem occurs as xlsread().
My questions are:
1. Does the GUI import tool use xlsread() to read .xlsx file? If yes, why the generated function does not work? If no, which interface it uses?
2. Is there an efficient way to load Excel file into Matlab?
Thanks!
It sounds like you may be keeping the excel file in memory in Matlab. I would suggest looking into making sure you close the connection to each excel file after you have imported its data.
You may also find that the Matlab table class is more memory efficient than the cell class.
Good luck.

Importing xlsx file with space in filename in Stata .do file

I am totally new to Stata and am wondering how to import .xlsx data in Stata. Let's say the data is in the subdirectory Data and has name "a b c.xlsx". So, from working directory, the data is in /Data
I am trying to do
import excel using "\Data\a b c.xlsx", sheet("a")
but it's not working
it's not working
is anything but a useful error report. For future questions, please report the exact error given by Stata.
Let's say the file is in the directory /home/roberto then
clear
set more off
import excel using "/home/roberto/a b c.xlsx"
list
should work.
If you are already in /home/roberto (which you can verify using display c(pwd)), then
import excel using "a b c.xlsx"
should work.
Using backslashes to refer to directories is not encouraged. See Stata tip 65: Beware the backstabbing backslash, by Nick Cox.
See also help cd.

Resources