How can I read an Excel file directly into R? Or should I first export the data to a text- or CSV file and import that file into R?
Let me reiterate what #Chase recommended: Use XLConnect.
The reasons for using XLConnect are, in my opinion:
Cross platform. XLConnect is written in Java and, thus, will run on Win, Linux, Mac with no change of your R code (except possibly path strings)
Nothing else to load. Just install XLConnect and get on with life.
You only mentioned reading Excel files, but XLConnect will also write Excel files, including changing cell formatting. And it will do this from Linux or Mac, not just Win.
XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs. For me it's been very useful.
And now there is readxl:
The readxl package makes it easy to get data out of Excel and into R.
Compared to the existing packages (e.g. gdata, xlsx, xlsReadWrite etc)
readxl has no external dependencies so it's easy to install and use on
all operating systems. It is designed to work with tabular data stored
in a single sheet.
readxl is built on top of the libxls C library, which abstracts away
many of the complexities of the underlying binary format.
It supports both the legacy .xls format and .xlsx
readxl is available from CRAN, or you can install it from github with:
# install.packages("devtools")
# read_excel reads both xls and xlsx files
# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
# If NAs are represented by something other than blank cells,
# set the na argument
read_excel("my-spreadsheet.xls", na = "NA")
Note that while the description says 'no external dependencies', it does require the Rcpp package, which in turn requires Rtools (for Windows) or Xcode (for OSX), which are dependencies external to R. Though many people have them installed for other reasons.
Yes. See the relevant page on the R wiki. Short answer: read.xls from the gdata package works most of the time (although you need to have Perl installed on your system -- usually already true on MacOS and Linux, but takes an extra step on Windows, i.e. see There are various caveats, and alternatives, listed on the R wiki page.
The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches (weird headers, multiple worksheets [you can only read one at a time, although you can obviously loop over them all], included plots, etc.). But for a well-formed, rectangular spreadsheet with plain numbers and character data (i.e., not comma-formatted numbers, dates, formulas with divide-by-zero errors, missing values, etc. etc. ..) I generally have no problem with this process.
EDIT 2015-October: As others have commented here the openxlsx and readxl packages are by far faster than the xlsx package and actually manage to open larger Excel files (>1500 rows & > 120 columns). #MichaelChirico demonstrates that readxl is better when speed is preferred and openxlsx replaces the functionality provided by the xlsx package. If you are looking for a package to read, write, and modify Excel files in 2015, pick the openxlsx instead of xlsx.
Pre-2015: I have used xlsxpackage. It changed my workflow with Excel and R. No more annoying pop-ups asking, if I am sure that I want to save my Excel sheet in .txt format. The package also writes Excel files.
However, I find read.xlsx function slow, when opening large Excel files. read.xlsx2 function is considerably faster, but does not quess the vector class of data.frame columns. You have to use colClasses command to specify desired column classes, if you use read.xlsx2 function. Here is a practical example:
read.xlsx("filename.xlsx", 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets. Works also for .xls files.
read.xlsx2("filename.xlsx", 1) is faster, but you will have to define column classes manually. A shortcut is to run the command twice (see the example below). character specification converts your columns to factors. Use Dateand POSIXct options for time.
coln <- function(x){y <- rbind(seq(1,ncol(x))); colnames(y) <- colnames(x)
rownames(y) <- "col.number"; return(y)} # A function to see column numbers
data <- read.xlsx2("filename.xlsx", 1) # Open the file
coln(data) # Check the column numbers you want to have as factors
x <- 3 # Say you want columns 1-3 as factors, the rest numeric
data <- read.xlsx2("filename.xlsx", 1, colClasses= c(rep("character", x),
rep("numeric", ncol(data)-x+1)))
Given the proliferation of different ways to read an Excel file in R and the plethora of answers here, I thought I'd try to shed some light on which of the options mentioned here perform the best (in a few simple situations).
I myself have been using xlsx since I started using R, for inertia if nothing else, and I recently noticed there doesn't seem to be any objective information about which package works better.
Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others, and a waterfall of other caveats.
That said, I'm using a (reproducible) data set that I think is in a pretty common format (8 string fields, 3 numeric, 1 integer, 3 dates):
str1 = sample(sprintf("%010d", 1:NN)), #ID field 1
str2 = sample(sprintf("%09d", 1:NN)), #ID field 2
#varying length string field--think names/addresses, etc.
str3 =
replicate(NN, paste0(sample(LETTERS, sample(10:30, 1L), TRUE),
collapse = "")),
#factor-like string field with 50 "levels"
str4 = sprintf("%05d", sample(sample(1e5, 50L), NN, TRUE)),
#factor-like string field with 17 levels, varying length
str5 =
sample(replicate(17L, paste0(sample(LETTERS, sample(15:25, 1L), TRUE),
collapse = "")), NN, TRUE),
#lognormally distributed numeric
num1 = round(exp(rnorm(NN, mean = 6.5, sd = 1.5)), 2L),
#3 binary strings
str6 = sample(c("Y","N"), NN, TRUE),
str7 = sample(c("M","F"), NN, TRUE),
str8 = sample(c("B","W"), NN, TRUE),
#right-skewed integer
int1 = ceiling(rexp(NN)),
#dates by month
dat1 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
dat2 =
sample(seq(from = as.Date("2005-12-31"),
to = as.Date("2015-12-31"), by = "month"),
num2 = round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L),
#date by day
dat3 =
sample(seq(from = as.Date("2015-06-01"),
to = as.Date("2015-07-15"), by = "day"),
#lognormal numeric that can be positive or negative
num3 =
(-1) ^ sample(2, NN, TRUE) * round(exp(rnorm(NN, mean = 6, sd = 1.5)), 2L)
I then wrote this to csv and opened in LibreOffice and saved it as an .xlsx file, then benchmarked 4 of the packages mentioned in this thread: xlsx, openxlsx, readxl, and gdata, using the default options (I also tried a version of whether or not I specify column types, but this didn't change the rankings).
I'm excluding RODBC because I'm on Linux; XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks, so to put its horse in the race on only its reading capabilities seems unfair; and xlsReadWrite because it is no longer compatible with my version of R (seems to have been phased out).
I then ran benchmarks with NN=1000L and NN=25000L (resetting the seed before each declaration of the data.frame above) to allow for differences with respect to Excel file size. gc is primarily for xlsx, which I've found at times can create memory clogs. Without further ado, here are the results I found:
1,000-Row Excel File
benchmark1k <-
microbenchmark(times = 100L,
xlsx = {xlsx::read.xlsx2(fl, sheetIndex=1); invisible(gc())},
openxlsx = {openxlsx::read.xlsx(fl); invisible(gc())},
readxl = {readxl::read_excel(fl); invisible(gc())},
gdata = {gdata::read.xls(fl); invisible(gc())})
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 194.1958 199.2662 214.1512 201.9063 212.7563 354.0327 100
# openxlsx 142.2074 142.9028 151.9127 143.7239 148.0940 255.0124 100
# readxl 122.0238 122.8448 132.4021 123.6964 130.2881 214.5138 100
# gdata 2004.4745 2042.0732 2087.8724 2062.5259 2116.7795 2425.6345 100
So readxl is the winner, with openxlsx competitive and gdata a clear loser. Taking each measure relative to the column minimum:
# expr min lq mean median uq max
# 1 xlsx 1.59 1.62 1.62 1.63 1.63 1.65
# 2 openxlsx 1.17 1.16 1.15 1.16 1.14 1.19
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 16.43 16.62 15.77 16.67 16.25 11.31
We see my own favorite, xlsx is 60% slower than readxl.
25,000-Row Excel File
Due to the amount of time it takes, I only did 20 repetitions on the larger file, otherwise the commands were identical. Here's the raw data:
# Unit: milliseconds
# expr min lq mean median uq max neval
# xlsx 4451.9553 4539.4599 4738.6366 4762.1768 4941.2331 5091.0057 20
# openxlsx 962.1579 981.0613 988.5006 986.1091 992.6017 1040.4158 20
# readxl 341.0006 344.8904 347.0779 346.4518 348.9273 360.1808 20
# gdata 43860.4013 44375.6340 44848.7797 44991.2208 45251.4441 45652.0826 20
Here's the relative data:
# expr min lq mean median uq max
# 1 xlsx 13.06 13.16 13.65 13.75 14.16 14.13
# 2 openxlsx 2.82 2.84 2.85 2.85 2.84 2.89
# 3 readxl 1.00 1.00 1.00 1.00 1.00 1.00
# 4 gdata 128.62 128.67 129.22 129.86 129.69 126.75
So readxl is the clear winner when it comes to speed. gdata better have something else going for it, as it's painfully slow in reading Excel files, and this problem is only exacerbated for larger tables.
Two draws of openxlsx are 1) its extensive other methods (readxl is designed to do only one thing, which is probably part of why it's so fast), especially its write.xlsx function, and 2) (more of a drawback for readxl) the col_types argument in readxl only (as of this writing) accepts some nonstandard R: "text" instead of "character" and "date" instead of "Date".
I've had good luck with XLConnect:
library(RODBC) <- "file.xls" <- "Sheet Name"
## Connect to Excel File Pull and Format Data
excel.connect <- odbcConnectExcel(
dat <- sqlFetch(excel.connect,, na.strings=c("","-"))
Personally, I like RODBC and can recommend it.
Just gave the package openxlsx a try today. It worked really well (and fast).
Another solution is the xlsReadWrite package, which doesn't require additional installs but does require you download the additional shlib before you use it the first time by :
Forgetting this can cause utter frustration. Been there and all that...
On a sidenote : You might want to consider converting to a text-based format (eg csv) and read in from there. This for a number of reasons :
whatever your solution (RODBC, gdata, xlsReadWrite) some strange things can happen when your data gets converted. Especially dates can be rather cumbersome. The HFWutils package has some tools to deal with EXCEL dates (per #Ben Bolker's comment).
if you have large sheets, reading in text files is faster than reading in from EXCEL.
for .xls and .xlsx files, different solutions might be necessary. EG the xlsReadWrite package currently does not support .xlsx AFAIK. gdata requires you to install additional perl libraries for .xlsx support. xlsx package can handle extensions of the same name.
As noted above in many of the other answers, there are many good packages that connect to the XLS/X file and get the data in a reasonable way. However, you should be warned that under no circumstances should you use the clipboard (or a .csv) file to retrieve data from Excel. To see why, enter =1/3 into a cell in excel. Now, reduce the number of decimal points visible to you to two. Then copy and paste the data into R. Now save the CSV. You'll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you've lost all of the precision in your actual source data.
Expanding on the answer provided by #Mikko you can use a neat trick to speed things up without having to "know" your column classes ahead of time. Simply use read.xlsx to grab a limited number of records to determine the classes and then followed it up with read.xlsx2
# just the first 50 rows should do...
df.temp <- read.xlsx("filename.xlsx", 1, startRow=1, endRow=50)
df.real <- read.xlsx2("filename.xlsx", 1,
colClasses=as.vector(sapply(df.temp, mode)))
An Excel file can be read directly into R as follows:
my_data <- read.table(file = "xxxxxx.xls", sep = "\t", header=TRUE)
Reading xls and xlxs files using readxl package
my_data <- read_excel("xxxxx.xls")
my_data <- read_excel("xxxxx.xlsx")
How do I add a column to an already existing text file in Python? First I just want to add the header, then through my investigation add values in the column.
Here is a minimal solution that does what you asked for, and gives you an overview of some useful Python features. Importantly, what you want to do can probably be done very easily with pandas but I don't use it, so the solution is plain Python. It can also be done quite differently if you want to use numpy, more efficiently too.
This is it,
input_file = 'test.txt'
output_file = 'mod_test.txt'
header = ''
matrix = []
# Parse and store all existing lines
with open(input_file, 'r') as fin:
# The header - remove the newline character only
header = next(fin).strip()
# The data - parse and store each line as a list
# of strings, store as a sublist of matrix
for line in fin:
# Update the header - now a newline is needed
header += ' CLASS\n'
# Now perform your calculations for each row
# and add new column - adding 0.0 as in the comment
for row in matrix:
# Calculations would go here
# This zero is already a string, normally a
# conversion would be needed
# Write it to the new file
with open(output_file, 'w') as fout:
# First the updated header
for row in matrix:
# Turn the entries into a single string
fout.write((' ').join(row) + '\n')
And this is a simple demonstration file that can go with it, test.txt,
C1 C2 C3
1.0 2.0 2.1
2.0 3.0 3.2
3.0 4.0 3.5
The comments highlight the most important details, but you should research each technique on your own, they are very useful when working with files.
Basically, you first load the original matrix, then modify it - add the last column, then save it into a different file. To save it to the same file you can just adjust output_file or use input_file in both cases. Writing to a different file, especially during development and debugging sounds like a better idea.
The new calculations and the writing can all be done in the same place, and if it is only adding a column of 0.0s that is probably a better way - right now the code is unnecessarily stretched. However, if you want to actually perform some more involving calculations, I recommend keeping this structure (also putting any longer calculations in separate functions).
I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18] = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.
I wonder, how to save and load numpy.array data properly. Currently I'm using the numpy.savetxt() method. For example, if I got an array markers, which looks like this:
I try to save it by the use of:
numpy.savetxt('markers.txt', markers)
In other script I try to open previously saved file:
markers = np.fromfile("markers.txt")
And that's what I get...
Saved data first looks like this:
But when I save just loaded data by the use of the same method, ie. numpy.savetxt() it looks like this:
What am I doing wrong? PS there are no other "backstage" operation which I perform. Just saving and loading, and that's what I get. Thank you in advance.
The most reliable way I have found to do this is to use np.savetxt with np.loadtxt and not np.fromfile which is better suited to binary files written with tofile. The np.fromfile and np.tofile methods write and read binary files whereas np.savetxt writes a text file.
So, for example:
a = np.array([1, 2, 3, 4])
np.savetxt('test1.txt', a, fmt='%d')
b = np.loadtxt('test1.txt', dtype=int)
a == b
# array([ True, True, True, True], dtype=bool)
c = np.fromfile('test2.dat', dtype=int)
c == a
# array([ True, True, True, True], dtype=bool)
I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).
There is a platform independent format for NumPy arrays, which can be saved and read with and np.load:'test3.npy', a) # .npy extension is added if not given
d = np.load('test3.npy')
a == d
# array([ True, True, True, True], dtype=bool)'data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load
The short answer is: you should use and np.load.
The advantage of using these functions is that they are made by the developers of the Numpy library and they already work (plus are likely optimized nicely for processing speed).
For example:
import numpy as np
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2'x', x)'y', y)
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
print(x is x_loaded) # False
print(x == x_loaded) # [[ True True True True True]]
Expanded answer:
In the end it really depends in your needs because you can also save it in a human-readable format (see Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see best way to preserve numpy arrays on disk for an expanded discussion).
However, (making an expansion since you use the word "properly" in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn't work and force it work.
Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn't save my stuff unless I opened & read the file in bytes mode with wb. It took time to google the problem, test potential solutions, understand the error message, etc... It's a small detail, but the fact that it already required me to open a file complicated things in unexpected ways. To add to that, it required me to re-read this (which btw is sort of confusing): Difference between modes a, a+, w, w+, and r+ in built-in open function?.
So if there is an interface that meets your needs, use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in Python really doesn't meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it, you'll find out later down the line (rather than spending ages debugging useless stuff like opening a simple Numpy file).
So use the interface/numpy provide. It might not be perfect, but it's most likely fine, especially for a library that's been around as long as Numpy.
I already spent the saving and loading data with numpy in a bunch of way so have fun with it. Hope this helps!
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)'x', x)'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
Some comments on what I learned: as expected, this already compresses it well (see, works out of the box without any file opening. Clean. Easy. Efficient. Use it.
np.savez uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz format. If you decide to use this (you were warned about going away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don't use this if the first already works (or any works use that!)
Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
Human-readable files are expensive to make etc. Probably not worth it.
There is something called hdf5 for large files. Cool!
Note that this is not an exhaustive answer. But for other resources check this:
For pickle (guess the top answer is don't use pickle, use Save Numpy Array using Pickle
For large files (great answer! compares storage size, loading save and more!):
For matlab (we have to accept matlab has some freakin' nice plots!): "Converting" Numpy arrays to Matlab and vice versa
For saving in human-readable format: Dump a NumPy array into a csv file
np.fromfile() has a sep= keyword argument:
Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.
The default value of sep="" means that np.fromfile() tries to read it as a binary file rather than a space-separated text file, so you get nonsense values back. If you use np.fromfile('markers.txt', sep=" ") you will get the result you are looking for.
However, as others have pointed out, np.loadtxt() is the preferred way to convert text files to numpy arrays, and unless the file needs to be human-readable it is usually better to use binary formats instead (e.g. np.load()/
I use Windows 64bit with 8GB RAM and Matlab 64bit.
I tried to load a .xlsx file into matlab. The file size is around 700MB, containing a sheet with 673928 rows and 43 columns.
First I use the GUI tool 'uiimport'. After choosing the file path and name, the GUI tool needs around 3 minutes to read the .xlsx file, and then shows the data in a table. If I choose "cell array", it needs around 10 minutes to import the data into workspace.
Name Size Bytes Class Attributes
NBPPdataV3YOS1 673928x43 3473588728 cell
It works very well, but I have many .xlsx files to import. It is impossible to import each file using GUI tool. So I use the GUI tool to generate function like this
function data = importfile(workbookFile, sheetName, range)
%% Import the data
[~, ~, data] = xlsread(workbookFile, sheetName, range);
data(cellfun(#(x) ~isempty(x) && isnumeric(x) && isnan(x),data)) = {''};
For simply, I ignore some irrelevant code. However, when I use this function to import the data, It does not work well. The used RAM by Matlab and Excel increases dramatically until almost all RAM is used. The data cannot be imported even after 30 minutes.
I also try to do it like this,
excelObj = actxserver('Excel.Application');
fileObj = excelObj.Workbooks.Open(filename);
sheetObj = fileObj.Worksheets.get('Item', 'sheet2');
%Read in ranges the same way as xlsread!
indata = sheetObj.Range('A1:AQ673928').Value;
The same problem occurs as xlsread().
My questions are:
1. Does the GUI import tool use xlsread() to read .xlsx file? If yes, why the generated function does not work? If no, which interface it uses?
2. Is there an efficient way to load Excel file into Matlab?
It sounds like you may be keeping the excel file in memory in Matlab. I would suggest looking into making sure you close the connection to each excel file after you have imported its data.
You may also find that the Matlab table class is more memory efficient than the cell class.
Good luck.
I have an excel file with ~10000 rows and ~250 columns, currently I am using RODBC to do the importing:
channel <- odbcConnectExcel(xls.file="s:/demo.xls")
demo <- sqlFetch(channel,"Sheet_1")
But this way is a bit slow (I need a minute or two to import them), and the excel is originally encrypted, I need to remove the password to work on it, which is something that I prefer not to, I wonder if there is any better way (i.e. import faster, and capable of importing encrypted excel files)
I recommend to try using the XLConnect package instead of RODBC.