Spreadsheet::WriteExcel set_optimization() generates unlink errors - linux

We have been using Spreadsheet::WriteExcel since a long time and this was working like a charm.
Few years ago we migrated to Excel-Writer-XLSX which uses 5 times more memory than WriteExcel as stated in the documentation.
Thanks to XLSX, users are now able to generate larger Excel files.
We started since few weeks to face memory usage issues where the about 84% of the server memory was needed.
The same documentation states that $workbook->set_optimization() should solve the problem. The given performance figures are promising.
We tried to use $workbook->set_optimization() on a sample file but this did not work. It generates an unlink error.
If set_optimization() is removed, the Excel file is generated properly.
The example is provided by the author in this thread :
#!/usr/bin/perl -w
use strict;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('test.xlsx');
$workbook->set_optimization();
my $worksheet = $workbook->add_worksheet();
my #header_values = ( 1, 2, 3, 'foo', 'bar', 6, 7 );
my $header_cnt = 0;
for my $header_cell (#header_values){
$worksheet->write(0, $header_cnt, $header_cell);
$header_cnt++;
}
$workbook->close();
Error unlinking file /opt/.../rKhGTRYWSJ using unlink0 at /usr/local/share/perl5/Excel/Writer/XLSX/Worksheet.pm line 204
(in cleanup) Error unlinking file /opt/.../iGr8Qo8VBD using unlink0 at /usr/local/share/perl5/Excel/Writer/XLSX/Worksheet.pm line 204
we are running:
Excel-Writer-XLSX 0.70
perl v5.10.1
Red Hat Enterprise Linux Server release 6.8 (Santiago)
Any help would be appreciated.

Excel::Writer::XLSX is used to write large amount of data in XLSX and to handle large data and to reduce memory usage set_optimization() method is used.
In XLSX file, a workbook can have maximum of 10,48,576 rows and 16,384 columns can be created and if the rows count exceeds maximum limit a new sheet can be created in the same workbook and in such a way the large amount of data can be handled.
Refer "Write_largeData_XLSX.pl" from this link https://github.com/AarthiRT/Excel_Writer_XLSX for more details.

Related

matlab: Issue reading excel binary (.xlsb) file

I am trying to read in (MATLAB 7.14.0.739 (R2012a), Ubuntu 12.04, filesize ~2MB) a binary excel file containing multiple sheets but get the following error:
[status,sheets,xlFormat] = xlsfinfo('633933_2014-07-04_11-34-27.xlsb')
status =
''
sheets =
Unreadable Excel file: File contains unexpected record length. Try
saving as Excel 98.
xlFormat =
''
I have a large number of these binary files so I don't want to have to resave them to another format if possible.
The documentation clearly states that the support for xlsb is limited to windows systems having excel installed.
You may try to find some 3rd party so, python or java library which can read xlsb but I am not aware of any. Otherwise you have to switch to a different format.

openpyxl close archive after breaking read operation because max rows are 1048498 rows

I have two problems using openpyxl
The number of rows in the spreadsheet are 1048498. The iteration hogs memory so I put a logic to check for first five empty columns and break from it
Logic 1 works for me and code does not indefinitely iterate over the spreadsheet blank cells. I am using P4Python to delete this read only file after I am done reading it. However, openpyxl is still using that file and there is no method except save to close the archive used internally. Since my file is in read only mode, I cannot save the file. When P4 is trying to delete this file, I get this error - "The process cannot access the file because it is being used by another process."
Help is appreciated :)
If you open the file in read-only mode then it will not hog memory. Cells are created only when read. Memory use has been tested with huge files but if you think this is a bug then please submit a bug report with a sample file.
This looks like an existing issue or intended beahvior with openpyxl. If you have a read only file (P4Python sync operation - p4.run_sync(file_path_to_sync)) and if you are reading it using openpyxl, you will not be able to delete the file (P4Python p4.run_sync(file_path_to_sync + '#0') - Remove from workspace) until you save the file which is not possible (or intended in my case) since it is a read only file.

Load big Excel (xlsx) file into matlab

I use Windows 64bit with 8GB RAM and Matlab 64bit.
I tried to load a .xlsx file into matlab. The file size is around 700MB, containing a sheet with 673928 rows and 43 columns.
First I use the GUI tool 'uiimport'. After choosing the file path and name, the GUI tool needs around 3 minutes to read the .xlsx file, and then shows the data in a table. If I choose "cell array", it needs around 10 minutes to import the data into workspace.
>>whos
Name Size Bytes Class Attributes
NBPPdataV3YOS1 673928x43 3473588728 cell
It works very well, but I have many .xlsx files to import. It is impossible to import each file using GUI tool. So I use the GUI tool to generate function like this
function data = importfile(workbookFile, sheetName, range)
%% Import the data
[~, ~, data] = xlsread(workbookFile, sheetName, range);
data(cellfun(#(x) ~isempty(x) && isnumeric(x) && isnan(x),data)) = {''};
For simply, I ignore some irrelevant code. However, when I use this function to import the data, It does not work well. The used RAM by Matlab and Excel increases dramatically until almost all RAM is used. The data cannot be imported even after 30 minutes.
I also try to do it like this,
filename='E:\data.xlsx';
excelObj = actxserver('Excel.Application');
fileObj = excelObj.Workbooks.Open(filename);
sheetObj = fileObj.Worksheets.get('Item', 'sheet2');
%Read in ranges the same way as xlsread!
indata = sheetObj.Range('A1:AQ673928').Value;
The same problem occurs as xlsread().
My questions are:
1. Does the GUI import tool use xlsread() to read .xlsx file? If yes, why the generated function does not work? If no, which interface it uses?
2. Is there an efficient way to load Excel file into Matlab?
Thanks!
It sounds like you may be keeping the excel file in memory in Matlab. I would suggest looking into making sure you close the connection to each excel file after you have imported its data.
You may also find that the Matlab table class is more memory efficient than the cell class.
Good luck.

Oracle Table to SAS Dataset

I am facing a problem in converting a large Oracle table to a SAS dataset. I did this earlier and the method worked. However, this time, it is giving me the following error messages.
SAS code:
option compress = yes;
libname sasdata ".";
libname myora oracle user=scott password=tiger path=XYZDATA ;
data sasdata.expt_tabl;
set myora.expt_tabl;
run;
Log file:
You are running SAS 9. Some SAS 8 files will be automatically converted
by the V9 engine; others are incompatible. Please see
http://support.sas.com/rnd/migration/planning/platform/64bit.html
PROC MIGRATE will preserve current SAS file attributes and is
recommended for converting all your SAS libraries from any
SAS 8 release to SAS 9. For details and examples, please see
http://support.sas.com/rnd/migration/index.html
This message is contained in the SAS news file, and is presented upon
initialization. Edit the file "news" in the "misc/base" directory to
display site-specific news and information in the program log.
The command line option "-nonews" will prevent this display.
NOTE: SAS initialization used:
real time 1.63 seconds
cpu time 0.03 seconds
1 option compress = yes;
2 libname sasdata ".";
NOTE: Libref SASDATA was successfully assigned as follows:
Engine: V9
Physical Name: /******/dibyendu
3 libname myora oracle user=scott password=XXXXXXXXXX path=XYZDATA ;
NOTE: Libref MYORA was successfully assigned as follows:
Engine: ORACLE
Physical Name: XYZDATA
4 data sasdata.expt_tabl;
5 set myora.expt_tabl;
6 run;
NOTE: There were 6422133 observations read from the data set MYORA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: File SASDATA.EXPT_TABL.DATA is damaged. I/O processing did not complete.
NOTE: The data set SASDATA.EXPT_TABL.DATA has 6422133 observations and 49 variables.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
2 The SAS System 21:40 Monday, April 1, 2013
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
ERROR: Expecting page 1, got page -1 instead.
ERROR: Page validation error while reading SASDATA.EXPT_TABL.DATA.
NOTE: Compressing data set SASDATA.EXPT_TABL.DATA decreased size by 78.88 percent.
Compressed is 37681 pages; un-compressed would require 178393 pages.
ERROR: File SASDATA.EXPT_TABL.DATA is damaged. I/O processing did not complete.
NOTE: SAS set option OBS=0 and will continue to check statements. This might cause NOTE: No observations in data set.
NOTE: DATA statement used (Total process time):
real time 8:55.98
cpu time 1:39.33
7
ERROR: Errors printed on pages 1,2.
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 8:58.67
cpu time 1:39.40
This is running on a RH Linux Server.
Any suggestion will be appreciated.
Thanks and regards,
This sounds like a space issue on your server. How large is the file system in your default directory (from your libname sasdata '.'; statement)? Use the data set option obs=1 on your Oracle table reference to create a new SAS dataset with one row and inspect the variables.
data sasdata.dummy_test;
set myora.expt_tabl(obs=1);
run;
Perhaps there are extremely large VARCHAR or BLOB columns that are consuming too much space. Remember that SAS does not have a VARCHAR type.
Though I am not totally sure, I believe the main issue was that I was initially trying to create/write the dataset in a directory, which was restricted in some (?) sense. This was indirectly causing trouble, since the dataset created was defective. When I created it elsewhere, it was okay.
Thanks and regards,
Dibyendu

Read an Excel file directly from a R script

How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?
Let me reiterate what #Chase recommended: Use XLConnect.
The reasons for using XLConnect are, in my opinion:
Cross platform. XLConnect is written in Java and, thus, will run on Win, Linux, Mac with no change of your R code (except possibly path strings)
Nothing else to load. Just install XLConnect and get on with life.
You only mentioned reading Excel files, but XLConnect will also write Excel files, including changing cell formatting. And it will do this from Linux or Mac, not just Win.
XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.
And now there is readxl:
The readxl package makes it easy to get data out of Excel and into R.
Compared to the existing packages (e.g. gdata, xlsx, xlsReadWrite etc)
readxl has no external dependencies so it's easy to install and use on
all operating systems. It is designed to work with tabular data stored
in a single sheet.
readxl is built on top of the libxls C library, which abstracts away
many of the complexities of the underlying binary format.
It supports both the legacy .xls format and .xlsx
readxl is available from CRAN, or you can install it from github with:
# install.packages("devtools")
devtools::install_github("hadley/readxl")
Usage
library(readxl)
# read_excel reads both xls and xlsx files
read_excel("my-old-spreadsheet.xls")
read_excel("my-new-spreadsheet.xlsx")
# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")
Note that while the description says 'no external dependencies', it does require the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. Though many people have them installed for other reasons.
Yes. See the relevant page on the R wiki. Short answer: read.xls from the gdata package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see http://strawberryperl.com/). There are various caveats, and alternatives, listed on the R wiki page.
The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.
EDIT 2015-October: As others have commented here the openxlsx and readxl packages are by far faster than the xlsx package and actually manage to open larger Excel files (>1500 rows & > 120 columns). #MichaelChirico demonstrates that readxl is better when speed is preferred and openxlsx replaces the functionality provided by the xlsx package. If you are looking for a package to read, write, and modify Excel files in 2015, pick the openxlsx instead of xlsx.
Pre-2015: I have used xlsxpackage. It changed my workflow with Excel and R. No more annoying pop-ups asking, if I am sure that I want to save my Excel sheet in .txt format. The package also writes Excel files.
However, I find read.xlsx function slow, when opening large Excel files. read.xlsx2 function is considerably faster, but does not quess the vector class of data.frame columns. You have to use colClasses command to specify desired column classes, if you use read.xlsx2 function. Here is a practical example:
read.xlsx("filename.xlsx", 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets. Works also for .xls files.
read.xlsx2("filename.xlsx", 1) is faster, but you will have to define column classes manually. A shortcut is to run the command twice (see the example below). character specification converts your columns to factors. Use Dateand POSIXct options for time.
coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <- "col.number"; return(y)} # A function to see column numbers
data <- read.xlsx2("filename.xlsx", 1) # Open the file
coln(data) # Check the column numbers you want to have as factors
x <- 3 # Say you want columns 1-3 as factors, the rest numeric
data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))
Given the proliferation of different ways to read an Excel file in R and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).
I myself have been using xlsx since I started using R, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.
Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.
That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):
set.seed(51423)
data.frame(
str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
str2 = sample(sprintf("%09d", 1:NN)), #ID field 2
#varying length string field--think names/addresses, etc.
str3 =
replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
collapse = "")),
#factor-like string field with 50 "levels"
str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
#factor-like string field with 17 levels, varying length
str5 =
sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
collapse = "")), NN, TRUE),
#lognormally distributed numeric
num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
#3 binary strings
str6 = sample(c("Y","N"), NN, TRUE),
str7 = sample(c("M","F"), NN, TRUE),
str8 = sample(c("B","W"), NN, TRUE),
#right-skewed integer
int1 = ceiling(rexp(NN)),
#dates by month
dat1 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
NN, TRUE),
dat2 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
NN, TRUE),
num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
#date by day
dat3 =
sample(seq(from = as.Date("2015-06-01"),
to = as.Date("2015-07-15"), by = "day"),
NN, TRUE),
#lognormal numeric that can be positive or negative
num3 =
(-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
)
I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx, openxlsx, readxl, and gdata, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).
I'm excluding RODBC because I'm on Linux; XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite because it is no longer compatible with my version of R (seems to have been phased out).
I then ran benchmarks with NN=1000L and NN=25000L (resetting the seed before each declaration of the data.frame above) to allow for differences with respect to Excel file size. gc is primarily for xlsx, which I've found at times can create memory clogs. Without further ado, here are the results I found:
1,000-Row Excel File
benchmark1k <-
microbenchmark(times = 100L,
xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
readxl = {readxl::read_excel(fl); invisible(gc())},
gdata = {gdata::read.xls(fl); invisible(gc())})
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 194.1958 199.2662 214.1512 201.9063 212.7563 354.0327 100
# openxlsx 142.2074 142.9028 151.9127 143.7239 148.0940 255.0124 100
# readxl 122.0238 122.8448 132.4021 123.6964 130.2881 214.5138 100
# gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345 100
So readxl is the winner, with openxlsx competitive and gdata a clear loser. Taking each measure relative to the column minimum:
# expr min lq mean median uq max
# 1 xlsx 1.59 1.62 1.62 1.63 1.63 1.65
# 2 openxlsx 1.17 1.16 1.15 1.16 1.14 1.19
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 16.43 16.62 15.77 16.67 16.25 11.31
We see my own favorite, xlsx is 60% slower than readxl.
25,000-Row Excel File
Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 4451.9553 4539.4599 4738.6366 4762.1768 4941.2331 5091.0057 20
# openxlsx 962.1579 981.0613 988.5006 986.1091 992.6017 1040.4158 20
# readxl 341.0006 344.8904 347.0779 346.4518 348.9273 360.1808 20
# gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826 20
Here's the relative data:
# expr min lq mean median uq max
# 1 xlsx 13.06 13.16 13.65 13.75 14.16 14.13
# 2 openxlsx 2.82 2.84 2.85 2.85 2.84 2.89
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 128.62 128.67 129.22 129.86 129.69 126.75
So readxl is the clear winner when it comes to speed. gdata better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.
Two draws of openxlsx are 1) its extensive other methods (readxl is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx function, and 2) (more of a drawback for readxl) the col_types argument in readxl only (as of this writing) accepts some nonstandard R: "text" instead of "character" and "date" instead of "Date".
I've had good luck with XLConnect: http://cran.r-project.org/web/packages/XLConnect/index.html
library(RODBC)
file.name <- "file.xls"
sheet.name <- "Sheet Name"
## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(file.name)
dat <- sqlFetch(excel.connect, sheet.name, na.strings=c("","-"))
odbcClose(excel.connect)
Personally, I like RODBC and can recommend it.
Just gave the package openxlsx a try today. It worked really well (and fast).
http://cran.r-project.org/web/packages/openxlsx/index.html
Another solution is the xlsReadWrite package, which doesn't require additional installs but does require you download the additional shlib before you use it the first time by :
require(xlsReadWrite)
xls.getshlib()
Forgetting this can cause utter frustration. Been there and all that...
On a sidenote : You might want to consider converting to a text-based format (eg csv) and read in from there. This for a number of reasons :
whatever your solution (RODBC, gdata, xlsReadWrite) some strange things can happen when your data gets converted. Especially dates can be rather cumbersome. The HFWutils package has some tools to deal with EXCEL dates (per #Ben Bolker's comment).
if you have large sheets, reading in text files is faster than reading in from EXCEL.
for .xls and .xlsx files, different solutions might be necessary. EG the xlsReadWrite package currently does not support .xlsx AFAIK. gdata requires you to install additional perl libraries for .xlsx support. xlsx package can handle extensions of the same name.
As noted above in many of the other answers, there are many good packages that connect to the XLS/X file and get the data in a reasonable way. However, you should be warned that under no circumstances should you use the clipboard (or a .csv) file to retrieve data from Excel. To see why, enter =1/3 into a cell in excel. Now, reduce the number of decimal points visible to you to two. Then copy and paste the data into R. Now save the CSV. You'll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you've lost all of the precision in your actual source data.
Expanding on the answer provided by #Mikko you can use a neat trick to speed things up without having to "know" your column classes ahead of time. Simply use read.xlsx to grab a limited number of records to determine the classes and then followed it up with read.xlsx2
Example
# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50)
df.real <- read.xlsx2("filename.xlsx", 1,
colClasses=as.vector(sapply(df.temp, mode)))
An Excel file can be read directly into R as follows:
my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)
Reading xls and xlxs files using readxl package
library("readxl")
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")

Resources