import excel sheet to Guava Table - excel

Is it possible to import excel as a guava table object where the excel sheet contains more than 3 columns?
Getting confused on this as most code samples talk about only 3 columns in sheet as seen in below link too
https://www.geeksforgeeks.org/table-guava-java/

You misinterpret the Table<R,C,V>. This are not three columns but Rrow, Column and Value.
A Excel table would be a Table<String, String, Object> where row keys are R1, R2, R3, .. and the column keys are C1, C2, C3, ... The objects are the cell values.
When we get each cell content as String, then a Excel table would be:
Table<String, String, String> excelTable = HashBasedTable.create();
and a cell content would be put there like:
excelTable.put("R" + r, "C" + c, value);
Given an Excel sheet like:
The following code gets all it's content to a Guava Table.
import org.apache.poi.ss.usermodel.*;
import java.io.FileInputStream;
import java.util.Map;
import com.google.common.collect.HashBasedTable;
import com.google.common.collect.Table;
class ReadExcelToGuavaTable {
public static void main(String[] args) throws Exception {
Table<String, String, String> excelTable = HashBasedTable.create();
Workbook workbook = WorkbookFactory.create(new FileInputStream("Excel.xlsx"));
DataFormatter dataFormatter = new DataFormatter(java.util.Locale.US);
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
Sheet sheet = workbook.getSheetAt(0);
int r = 1;
int c = 1;
for (Row row : sheet) {
r = row.getRowNum() + 1;
for (Cell cell : row) {
c = cell.getColumnIndex() + 1;
String value = dataFormatter.formatCellValue(cell, formulaEvaluator);
//System.out.println("R" + r + "C" + c + " = " + value);
excelTable.put("R" + r, "C" + c, value);
}
}
// get Map corresponding to row 1 in Excel
Map<String, String> rowMap = excelTable.row("R1");
System.out.println("List of row 1 content : ");
for (Map.Entry<String, String> row : rowMap.entrySet()) {
System.out.println("Column : " + row.getKey() + ", Value : " + row.getValue());
}
// get a Map corresponding to column 4 in Excel
Map<String, String> columnMap = excelTable.column("C4");
System.out.println("List of column 4 content : ");
for (Map.Entry<String, String> column : columnMap.entrySet()) {
System.out.println("Row : " + column.getKey() + ", Value : " + column.getValue());
}
// get single cell content R5C5
System.out.println("Single cell content R5C5 :");
System.out.println("R5C5 : " + excelTable.get("R5", "C5"));
// get all rows and columns
Map<String,Map<String,String>> allMap = excelTable.rowMap();
System.out.println("List of whole table : ");
for (Map.Entry<String, Map<String, String>> row : allMap.entrySet()) {
Map<String, String> colMap = row.getValue();
for (Map.Entry<String, String> column : colMap.entrySet()) {
System.out.println(row.getKey() + column.getKey() + " = " + column.getValue());
}
}
workbook.close();
}
}

Related

how to update ext list value of pptx scatter chart

Read pptx template then use new data to replace it, there is a scatter chart
associated excel data
xVal and yVal could replace successfully but how to replace C column (extList) ?
xVal and yVal replace by below manner
final CTScatterSer ser = serList.get(0);
final CTAxDataSource xVal = ser.getXVal();
final CTNumDataSource yVal = ser.getYVal();
final CTExtension ctExtension = ser.getExtLst().getExtList().get(0);
final long ptCount = xVal.getNumRef().getNumCache().getPtCount().getVal();
for (int i = 0; i < scData.size(); i++) {
SCNameDouble data = scData.get(i);
CTNumVal xNumVal = ptCount > i ? xVal.getNumRef().getNumCache().getPtArray(i)
: xVal.getNumRef().getNumCache().addNewPt();
xNumVal.setIdx(i);
xNumVal.setV(String.format("%.2f", data.xValue));
CTNumVal yNumVal = ptCount > i ? yVal.getNumRef().getNumCache().getPtArray(i)
: yVal.getNumRef().getNumCache().addNewPt();
yNumVal.setIdx(i);
yNumVal.setV(String.format("%.2f", data.yValue));
}
final int newSize = scData.size();
xVal.getNumRef().setF(
replaceRowEnd(xVal.getNumRef().getF(),
ptCount,
newSize));
yVal.getNumRef().setF(
replaceRowEnd(yVal.getNumRef().getF(),
ptCount,
newSize));
xVal.getNumRef().getNumCache().getPtCount().setVal(newSize);
yVal.getNumRef().getNumCache().getPtCount().setVal(newSize);
Using current apache poi versions one should not trying manipulating charts using the low level CT... classes. There is XDDF for such cases now.
If it comes to PowerPoint charts, then the need is always updating the data in the embedded workbook and updating the data in the chart. See Java edit bar chart in ppt by using poi for an example using bar chart.
Of course a scatter chart is another case then as it not has a category axis but has two value axes. But this also can be updated using XDDF.
The biggest problem you have is the data labels. There is not full support for chart data labels in XDDF upto now. And since you are talkig about extLst and your Excel table shows the data labels in a cell range, I suspect you have set the data labels comming from a cell range. This is a new feature which was not present when Microsoft had published Office Open XML. So not even the low level CT... classes are able to support that feature.
The only way is to manipulate the XML using pure XML manupulating based on org.apache.xmlbeans.XmlObject.
The following shows this on sample of a template you seems to use according to your question.
ScatterChartSample.pptx:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.xddf.usermodel.chart.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.ss.util.CellRangeAddress;
import org.apache.poi.ss.util.AreaReference;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTTableColumns;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTTableColumn;
public class PowerPointChangeScatterChartData {
//patched version of XSSFTable.updateHeaders, see https://stackoverflow.com/questions/55532006/renaming-headers-of-xssftable-with-apache-poi-leads-to-corrupt-xlsx-file/55539181#55539181
static void updateHeaders(XSSFTable table) {
XSSFSheet sheet = (XSSFSheet)table.getParent();
CellReference ref = table.getStartCellReference();
if (ref == null) return;
int headerRow = ref.getRow();
int firstHeaderColumn = ref.getCol();
XSSFRow row = sheet.getRow(headerRow);
DataFormatter formatter = new DataFormatter();
if (row != null /*&& row.getCTRow().validate()*/) {
int cellnum = firstHeaderColumn;
CTTableColumns ctTableColumns = table.getCTTable().getTableColumns();
if(ctTableColumns != null) {
for (CTTableColumn col : ctTableColumns.getTableColumnList()) {
XSSFCell cell = row.getCell(cellnum);
if (cell != null) {
col.setName(formatter.formatCellValue(cell));
}
cellnum++;
}
}
}
}
static void updateScatterChart(XSLFChart chart, Object[][] data) throws Exception {
// get chart's data source which is a Excel sheet
XSSFWorkbook chartDataWorkbook = chart.getWorkbook();
String sheetName = chartDataWorkbook.getSheetName(0);
XSSFSheet chartDataSheet = chartDataWorkbook.getSheet(sheetName);
// current Office uses a table as data source
// so get that table if present
XSSFTable chartDataTable = null;
if (chartDataSheet.getTables().size() > 0) {
chartDataTable = chartDataSheet.getTables().get(0);
}
if (chart.getChartSeries().size() == 1) { // we will process only one chart data
XDDFChartData chartData = chart.getChartSeries().get(0);
if (chartData.getSeriesCount() == 1) { // we will process only templates having one series
int rMin = 1; // first row (0) is headers row
int rMax = data.length - 1;
// column 0 is X-Values
int c = 0;
// set new x data
XDDFDataSource xs = null;
for (int r = rMin; r <= rMax; r++) {
XSSFRow row = chartDataSheet.getRow(r); if (row == null) row = chartDataSheet.createRow(r);
XSSFCell cell = row.getCell(c); if (cell == null) cell = row.createCell(c);
cell.setCellValue((Double)data[r][c]); // in sheet
}
xs = XDDFDataSourcesFactory.fromNumericCellRange(chartDataSheet, new CellRangeAddress(rMin,rMax,c,c)); // in chart
// set new x-title in sheet
String xTitle = (String)data[0][c];
chartDataSheet.getRow(0).getCell(c).setCellValue(xTitle); // in sheet
// column 1 is Y-Values
c = 1;
// set new y data in sheet and in chart
XDDFNumericalDataSource<Double> ys = null;
for (int r = rMin; r <= rMax; r++) {
XSSFRow row = chartDataSheet.getRow(r); if (row == null) row = chartDataSheet.createRow(r);
XSSFCell cell = row.getCell(c); if (cell == null) cell = row.createCell(c);
cell.setCellValue((Double)data[r][c]); // in sheet
}
ys = XDDFDataSourcesFactory.fromNumericCellRange(chartDataSheet, new CellRangeAddress(rMin,rMax,c,c));
XDDFChartData.Series series1 = chartData.getSeries(0);
series1.replaceData(xs, ys); // in chart
// set new y-title in sheet and in chart
String yTitle = (String)data[0][c];
chartDataSheet.getRow(0).getCell(c).setCellValue(yTitle); // in sheet
series1.setTitle(yTitle, new CellReference(sheetName, 0, c, true, true)); // in chart
series1.plot();
// column 2 is data-labels-range
c = 2;
// set new data labels data in sheet and in chart
XDDFDataSource dataLabelsRangeSource = null;
for (int r = rMin; r <= rMax; r++) {
XSSFRow row = chartDataSheet.getRow(r); if (row == null) row = chartDataSheet.createRow(r);
XSSFCell cell = row.getCell(c); if (cell == null) cell = row.createCell(c);
cell.setCellValue((String)data[r][c]); // in sheet
}
dataLabelsRangeSource = XDDFDataSourcesFactory.fromStringCellRange(chartDataSheet, new CellRangeAddress(rMin,rMax,c,c)); // in chart
updateDataLabelsRange(chart, dataLabelsRangeSource); // in chart
// set new data-labels-title in sheet
String descrTitle = (String)data[0][c];
chartDataSheet.getRow(0).getCell(c).setCellValue(descrTitle); // in sheet
// update the table if present
if (chartDataTable != null) {
CellReference topLeft = new CellReference(chartDataSheet.getRow(0).getCell(0));
CellReference bottomRight = new CellReference(chartDataSheet.getRow(rMax).getCell(c));
AreaReference tableArea = chartDataWorkbook.getCreationHelper().createAreaReference(topLeft, bottomRight);
chartDataTable.setArea(tableArea);
updateHeaders(chartDataTable);
}
}
}
}
static void updateDataLabelsRange(XDDFChart chart, XDDFDataSource dataLabelsRangeSource) {
String declareNameSpaces = "declare namespace c='http://schemas.openxmlformats.org/drawingml/2006/chart'; "
+ "declare namespace c15='http://schemas.microsoft.com/office/drawing/2012/chart' ";
org.apache.xmlbeans.XmlObject[] selectedObjects = chart.getCTChart().selectPath(
declareNameSpaces
+ ".//c:ext[c15:datalabelsRange]"); // needs net.sf.saxon - Saxon-HE (Saxon-HE-10.6.jar)
if (selectedObjects.length > 0) { // we have at least one ext containing datalabelsRange
org.apache.xmlbeans.XmlObject ext = selectedObjects[0]; // get first ext containing datalabelsRange
// get dataLabelsRange
org.apache.xmlbeans.XmlObject[] datalabelsRanges = ext.selectChildren(new javax.xml.namespace.QName("http://schemas.microsoft.com/office/drawing/2012/chart", "datalabelsRange", "c15"));
org.apache.xmlbeans.XmlObject dataLabelsRange = datalabelsRanges[0];
// set formula
org.apache.xmlbeans.XmlObject[] formulas = dataLabelsRange.selectChildren(new javax.xml.namespace.QName("http://schemas.microsoft.com/office/drawing/2012/chart", "f", "c15"));
org.apache.xmlbeans.XmlObject formula = formulas[0];
((org.apache.xmlbeans.impl.values.XmlObjectBase)formula).setStringValue(dataLabelsRangeSource.getFormula());
// get dlblRangeCache
org.apache.xmlbeans.XmlObject[] dlblRangeCaches = dataLabelsRange.selectChildren(new javax.xml.namespace.QName("http://schemas.microsoft.com/office/drawing/2012/chart", "dlblRangeCache", "c15"));
org.apache.xmlbeans.XmlObject dlblRangeCache = dlblRangeCaches[0];
// empty the cache
dlblRangeCache.newCursor().removeXmlContents();
// create new cache from dataLabelsRangeSource
org.openxmlformats.schemas.drawingml.x2006.chart.CTStrData cache = org.openxmlformats.schemas.drawingml.x2006.chart.CTStrData.Factory.newInstance();
dataLabelsRangeSource.fillStringCache(cache);
// set new cache
dlblRangeCache.set(cache);
}
}
public static void main(String[] args) throws Exception {
String filePath = "ScatterChartSample.pptx"; // has template scatter chart
String filePathNew = "ScatterChartSample_New.pptx";
Object[][] data = new Object[][] { // new data 1 series, 6 x-y-values and data labels
{"X-Values", "Y-Values", "DataLabels"}, // series title
{0.7d, 1.7d, "aa"}, // x1
{1.8d, 3.2d, "bb"}, // x2
{2.6d, 2.8d, "cc"}, // x3
{1.7d, 3.7d, "dd"}, // x4
{2.8d, 4.2d, "ee"}, // x5
{3.6d, 1.8d, "ff"} // x6
};
XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream(filePath));
XSLFChart chart = slideShow.getCharts().get(0);
updateScatterChart(chart, data);
FileOutputStream out = new FileOutputStream(filePathNew);
slideShow.write(out);
out.close();
slideShow.close();
}
}
Resulting ScatterChartSample_New.pptx:
Note: Tested and works using current apache poi 5.2.0.
To be able to use XPath as .//c:ext[c15:datalabelsRange] it needs net.sf.saxon - Saxon-HE (Saxon-HE-10.6.jar in my case).
And it needs poi-ooxml-full-5.2.0.jar and not only the lite version of ooxml-schemas.

SXSSF with Excel table

I'm trying to create an Excel table in a streamed workbook (SXSSFWorkbook). It is not supported directly by the API but I've had some success by accessing the underlying XSSFWorkbook (workbook.getXSSFWorkbook).
When I open the file in Excel (2007), it says "Excel found unreadable content in "test.xlsx". Do you want to recover the contents of this workbook?". Clicking yes successfully repairs the workbook and I get the correct result.
Log says "Repaired Records: Table from /xl/tables/table1.xml part (Table)".
Anyone has an idea on how I could avoid the Excel error?
Below is an example:
public class SXSSFTest {
private static final int NB_ROWS = 5;
private static final int NB_COLS = 5;
public static void main(String[] args) throws Exception {
try (SXSSFWorkbook workbook = new SXSSFWorkbook();
FileOutputStream outputStream = new FileOutputStream("C:\\test.xlsx")) {
SXSSFSheet sheet = workbook.createSheet();
fillSheet(sheet);
String dataRange = new AreaReference(
new CellReference(0, 0),
new CellReference(NB_ROWS - 1, NB_COLS - 1))
.formatAsString();
CTTable cttable = workbook.getXSSFWorkbook()
.getSheetAt(0)
.createTable()
.getCTTable();
CTTableStyleInfo tableStyle = cttable.addNewTableStyleInfo();
tableStyle.setName("TableStyleMedium17");
cttable.setRef(dataRange);
cttable.setDisplayName("TABLE");
cttable.setName("TABLE");
cttable.setId(1L);
CTTableColumns columns = cttable.addNewTableColumns();
columns.setCount(NB_COLS);
for (int c = 0; c < NB_COLS; c++) {
CTTableColumn column = columns.addNewTableColumn();
column.setName("Column" + c);
column.setId(c + 1L);
}
cttable.setAutoFilter(CTAutoFilter.Factory.newInstance());
workbook.write(outputStream);
}
}
private static void fillSheet(SXSSFSheet sheet) {
for (int rowNb = 0; rowNb < NB_ROWS; rowNb++) {
SXSSFRow row = sheet.createRow(rowNb);
for (int colNb = 0; colNb < NB_COLS; colNb++) {
SXSSFCell cell = row.createCell(colNb);
cell.setCellValue("Cell-" + colNb);
}
}
}
}
The cell values in the first row of the table must correspond with the column names.
Your code in main method names the columns Column0 ... Column4 but your code in fillSheet method writes "Cell-0" ... "Cell-4" into the cells of first row. This does not match.
You could change the fillSheet method like this:
...
private static void fillSheet(SXSSFSheet sheet) {
for (int rowNb = 0; rowNb < NB_ROWS; rowNb++) {
SXSSFRow row = sheet.createRow(rowNb);
for (int colNb = 0; colNb < NB_COLS; colNb++) {
SXSSFCell cell = row.createCell(colNb);
if (rowNb==0) cell.setCellValue("Column" + colNb); //first row are column names
else cell.setCellValue("Cell-" + colNb);
}
}
}
...
Here's an updated version that fixes several usages of deprecated methods (tested with POI 4.1.2). Note that it doesn't require creating columns and settings IDs manually anymore, everything is done by createTable(dataRange):
import org.apache.poi.ss.SpreadsheetVersion;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFCell;
import org.apache.poi.xssf.streaming.SXSSFRow;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.*;
import java.io.FileOutputStream;
public class SXSSFTest {
private static final int NB_ROWS = 5;
private static final int NB_COLS = 5;
public static void main(String[] args) throws Exception {
try (SXSSFWorkbook workbook = new SXSSFWorkbook();
FileOutputStream outputStream = new FileOutputStream("C:\\test.xlsx")) {
SXSSFSheet sheet = workbook.createSheet();
fillSheet(sheet);
AreaReference dataRange = new AreaReference(
new CellReference(0, 0),
new CellReference(NB_ROWS - 1, NB_COLS - 1),
SpreadsheetVersion.EXCEL2007
);
CTTable cttable = workbook.getXSSFWorkbook()
.getSheetAt(0)
.createTable(dataRange)
.getCTTable();
CTTableStyleInfo tableStyle = cttable.addNewTableStyleInfo();
tableStyle.setName("TableStyleMedium17");
cttable.setDisplayName("TABLE");
cttable.setName("TABLE");
CTTableColumns columns = cttable.getTableColumns();
for (int c = 0; c < NB_COLS; c++) {
CTTableColumn column = columns.getTableColumnArray(c);
column.setName("Column title " + c);
}
cttable.setAutoFilter(CTAutoFilter.Factory.newInstance());
workbook.write(outputStream);
}
}
private static void fillSheet(SXSSFSheet sheet) {
for (int rowNb = 0; rowNb < NB_ROWS; rowNb++) {
SXSSFRow row = sheet.createRow(rowNb);
for (int colNb = 0; colNb < NB_COLS; colNb++) {
SXSSFCell cell = row.createCell(colNb);
if (rowNb == 0) {
cell.setCellValue("Column title " + colNb); //first row are column names
} else {
cell.setCellValue("Cell-" + colNb);
}
}
}
}
}
When adding a table, make sure that the names of the column headers are not repeated. If you have duplicates column names Excel throw this error and will fix it by renaming the 2nd duplicated column by adding a number a the end. See example below:
| QTY | UNIT # | QTY2 |
See that 2nd QTY was renamed to QTY2 by excel recovery process.

Modifying .xlsx file works instantly when tested in Eclipse, but when executed as a .jar file, it takes upwards of 20 seconds

I have the three methods that take the longest and then my run() method; the rest of the code takes a negligible amount of time. I'm just confused as to why it would run instantly in Eclipse but not be able to do the same when I run it from a .jar file.
It is around 120x slower when run from the .jar file. Is it my code or is there something else going on here?
public void modifyCell(String input, int rowNum, int column, Workbook wb) throws IOException {
Sheet sheet = wb.getSheet("Sheet1");
Row row = sheet.getRow(rowNum);
Cell cell = row.getCell(column);
cell = row.createCell(column);
CellStyle cellStyle = wb.createCellStyle();
cellStyle.setAlignment(CellStyle.ALIGN_CENTER);
cell.setCellStyle(cellStyle);
cell.setCellValue(input);
FormulaEvaluator evaluator = wb.getCreationHelper().createFormulaEvaluator();
cell = row.getCell(column+1);
evaluator.evaluateFormulaCell(cell);
row = sheet.getRow(3);
cell = row.getCell(getFinalFormula(site));
evaluator.evaluateFormulaCell(cell);
evaluateTopFormulas(wb);
FileOutputStream fileOut = new FileOutputStream(workbook);
wb.write(fileOut);
fileOut.close();
}
public void evaluateTopFormulas(Workbook wb){
FormulaEvaluator evaluator = wb.getCreationHelper().createFormulaEvaluator();
Sheet sheet = wb.getSheet("Sheet1");
Row row = sheet.getRow(1);
Cell cell = row.getCell(6);
evaluator.evaluateFormulaCell(cell);
row = sheet.getRow(2);
cell = row.getCell(6);
evaluator.evaluateFormulaCell(cell);
}
public int getRow(Workbook wb, int column){ // i also think this part is very poorly written
but I don't know how to go about it differently.
All I want is to get the next blank cell in a column after the 4th cell.
long startTime = System.nanoTime();
Sheet sheet1 = wb.getSheetAt(0);
int i = 4;
Row startRow = sheet1.getRow(i);
Cell cell = startRow.getCell(column);
Row row;
try {
while (!(cell == null || cell.getCellType() == Cell.CELL_TYPE_BLANK)){
row = sheet1.getRow(i++);
cell = row.getCell(column);
}
} catch (Exception e){
}
if (i == 4){
return i;
}
long estimatedTime = System.nanoTime() - startTime;
System.out.println(estimatedTime + " getRow");
return i-1;
}
public void run(){
try {
System.out.println("Adding to Invoice");
Workbook wb = openWorkbook();
int column = getColumn(site);
int row = getRow(wb, column);
modifyCell(input, row, column, wb);
System.out.println(input + " was added to " + workbook);
} catch (InvalidFormatException | IOException e) {
System.out.println(e);
}
}

Reading style names in table cell of doc file in Apache-POI

I am able to read the table cells, But I wanted also to read the applied style name of each cell of a row in a table. How can I achieve this?
EDIT
Following is the code snip which I have tried. By this I am able to read cell text also the applied pstyle(para style), but not able to read the rstyles.
private static void processDoc(String path) throws Exception {
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(path));
HWPFDocument wdDoc = new HWPFDocument(fis);
// list all style names and indexes in stylesheet
/*for (int j = 0; j < wdDoc.getStyleSheet().numStyles(); j++) {
if (wdDoc.getStyleSheet().getStyleDescription(j) != null) {
System.out.println(j + ": " + wdDoc.getStyleSheet().getStyleDescription(j).getName());
} else {
// getStyleDescription returned null
System.out.println(j + ": " + null);
}
}*/
// set range for entire document
Range range = wdDoc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph p = range.getParagraph(i);
// check if style index is greater than total number of styles
if (wdDoc.getStyleSheet().numStyles() > p.getStyleIndex()) {
//System.out.println(wdDoc.getStyleSheet().numStyles() + " -> " + p.getStyleIndex());
StyleDescription style = wdDoc.getStyleSheet().getStyleDescription(p.getStyleIndex());
String styleName = style.getName();
// write style name and associated text
System.out.println(styleName + " -> " + p.text().replaceAll("[\u0000-\u001f]", ""));
} else {
System.out.println("\n" + wdDoc.getStyleSheet().numStyles() + " ----> " + p.getStyleIndex());
}
}
}

reading Excel Open XML is ignoring blank cells

I am using the accepted solution here to convert an excel sheet into a datatable. This works fine if I have "perfect" data but if I have a blank cell in the middle of my data it seems to put the wrong data in each column.
I think this is because in the below code:
row.Descendants<Cell>().Count()
is number of populated cells (not all columns) AND:
GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
seems to find the next populated cell (not necessarily what is in that index) so if the first column is empty and i call ElementAt(0), it returns the value in the second column.
Here is the full parsing code.
DataRow tempRow = dt.NewRow();
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
}
This makes sense since Excel will not store a value for a cell that is null. If you open your file using the Open XML SDK 2.0 Productivity Tool and traverse the XML down to the cell level you will see that only the cells that have data are going to be in that file.
Your options are to insert blank data in the range of cells you are going to traverse or programmatically figure out a cell was skipped and adjust your index appropriately.
I made an example excel document with a string in cell reference A1 and C1. I then opened up the excel document in the Open XML Productivity Tool and here is the XML that was stored:
<x:row r="1" spans="1:3"
xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:c r="A1" t="s">
<x:v>0</x:v>
</x:c>
<x:c r="C1" t="s">
<x:v>1</x:v>
</x:c>
</x:row>
Here you will see that the data corresponds to the first row and that only two cells worth of data are saved for that row. The data saved corresponds to A1 and C1 and that no cells with null values are saved.
To get the functionality that you need, you can traverse over the Cells as you are doing above, but you will need to check what the value the Cell is referencing and determine if any Cells have been skipped. to do that you will need two utility functions to get the Column Name from the cell reference and to then translate that column name into a zero based index:
private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Create a regular expression to match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index), it will return the zero based column index.
/// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ).
/// A length of three can be implemented when needed.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful; otherwise null</returns>
public static int? GetColumnIndexFromName(string columnName)
{
int? columnIndex = null;
string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();
if (colLetters.Count() <= 2)
{
int index = 0;
foreach (string col in colLetters)
{
List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
int? indexValue = Letters.IndexOf(col1.ElementAt(index));
if (indexValue != -1)
{
// The first letter of a two digit column needs some extra calculations
if (index == 0 && colLetters.Count() == 2)
{
columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
}
else
{
columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
}
}
index++;
}
}
return columnIndex;
}
Then you can iterate over the Cells and check to see what the cell reference is compared to the columnIndex. If it is less than then you add blank data to your tempRow, otherwise just read in the value contained in the cell. (Note: I did not test the code below, but the general idea should help):
DataRow tempRow = dt.NewRow();
int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
// Gets the column index of the cell with data
int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));
if (columnIndex < cellColumnIndex)
{
do
{
tempRow[columnIndex] = //Insert blank data here;
columnIndex++;
}
while(columnIndex < cellColumnIndex);
}
tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);
if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
{
Console.Write(tempRow[i].ToString());
}
columnIndex++;
}
Here's a slightly modified version of Waylon's answer which also relied on other answers. It encapsulates his method in a class.
I changed
IEnumerator<Cell> GetEnumerator()
to
IEnumerable<Cell> GetRowCells(Row row)
Here's the class, you don't need to instantiate it, it just serves as an utility class:
public class SpreedsheetHelper
{
///<summary>returns an empty cell when a blank cell is encountered
///</summary>
public static IEnumerable<Cell> GetRowCells(Row row)
{
int currentCount = 0;
foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
{
string columnName = GetColumnName(cell.CellReference);
int currentColumnIndex = ConvertColumnNameToNumber(columnName);
for (; currentCount < currentColumnIndex; currentCount++)
{
yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
}
yield return cell;
currentCount++;
}
}
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
var regex = new System.Text.RegularExpressions.Regex("[A-Za-z]+");
var match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index),
/// it will return the zero based column index.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful</returns>
/// <exception cref="ArgumentException">thrown if the given string
/// contains characters other than uppercase letters</exception>
public static int ConvertColumnNameToNumber(string columnName)
{
var alpha = new System.Text.RegularExpressions.Regex("^[A-Z]+$");
if (!alpha.IsMatch(columnName)) throw new ArgumentException();
char[] colLetters = columnName.ToCharArray();
Array.Reverse(colLetters);
int convertedValue = 0;
for (int i = 0; i < colLetters.Length; i++)
{
char letter = colLetters[i];
int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
convertedValue += current * (int)Math.Pow(26, i);
}
return convertedValue;
}
}
Now you're able to get all rows' cells in this way:
// skip the part that retrieves the worksheet sheetData
IEnumerable<Row> rows = sheetData.Descendants<Row>();
foreach(Row row in rows)
{
IEnumerable<Cell> cells = SpreedsheetHelper.GetRowCells(row);
foreach (Cell cell in cells)
{
// skip part that reads the text according to the cell-type
}
}
It will contain all cells even if they are empty.
Here's an implementation of IEnumerable that should do what you want, compiled and unit tested.
///<summary>returns an empty cell when a blank cell is encountered
///</summary>
public IEnumerator<Cell> GetEnumerator()
{
int currentCount = 0;
// row is a class level variable representing the current
// DocumentFormat.OpenXml.Spreadsheet.Row
foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
{
string columnName = GetColumnName(cell.CellReference);
int currentColumnIndex = ConvertColumnNameToNumber(columnName);
for ( ; currentCount < currentColumnIndex; currentCount++)
{
yield return new DocumentFormat.OpenXml.Spreadsheet.Cell();
}
yield return cell;
currentCount++;
}
}
Here are the functions it relies on:
/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
/// <summary>
/// Given just the column name (no row index),
/// it will return the zero based column index.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful</returns>
/// <exception cref="ArgumentException">thrown if the given string
/// contains characters other than uppercase letters</exception>
public static int ConvertColumnNameToNumber(string columnName)
{
Regex alpha = new Regex("^[A-Z]+$");
if (!alpha.IsMatch(columnName)) throw new ArgumentException();
char[] colLetters = columnName.ToCharArray();
Array.Reverse(colLetters);
int convertedValue = 0;
for (int i = 0; i < colLetters.Length; i++)
{
char letter = colLetters[i];
int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
convertedValue += current * (int)Math.Pow(26, i);
}
return convertedValue;
}
Throw it in a class and give it a try.
See my implementation:
Row[] rows = worksheet.GetFirstChild<SheetData>()
.Elements<Row>()
.ToArray();
string[] columnNames = rows.First()
.Elements<Cell>()
.Select(cell => GetCellValue(cell, document))
.ToArray();
HeaderLetters = ExcelHeaderHelper.GetHeaderLetters((uint)columnNames.Count());
if (columnNames.Count() != HeaderLetters.Count())
{
throw new ArgumentException("HeaderLetters");
}
IEnumerable<List<string>> cellValues = GetCellValues(rows.Skip(1), columnNames.Count(), document);
//Here you can enumerate through the cell values, based on the cell index the column names can be retrieved.
HeaderLetters are collected using this class:
private static class ExcelHeaderHelper
{
public static string[] GetHeaderLetters(uint max)
{
var result = new List<string>();
int i = 0;
var columnPrefix = new Queue<string>();
string prefix = null;
int prevRoundNo = 0;
uint maxPrefix = max / 26;
while (i < max)
{
int roundNo = i / 26;
if (prevRoundNo < roundNo)
{
prefix = columnPrefix.Dequeue();
prevRoundNo = roundNo;
}
string item = prefix + ((char)(65 + (i % 26))).ToString(CultureInfo.InvariantCulture);
if (i <= maxPrefix)
{
columnPrefix.Enqueue(item);
}
result.Add(item);
i++;
}
return result.ToArray();
}
}
And the helper methods are:
private static IEnumerable<List<string>> GetCellValues(IEnumerable<Row> rows, int columnCount, SpreadsheetDocument document)
{
var result = new List<List<string>>();
foreach (var row in rows)
{
List<string> cellValues = new List<string>();
var actualCells = row.Elements<Cell>().ToArray();
int j = 0;
for (int i = 0; i < columnCount; i++)
{
if (actualCells.Count() <= j || !actualCells[j].CellReference.ToString().StartsWith(HeaderLetters[i]))
{
cellValues.Add(null);
}
else
{
cellValues.Add(GetCellValue(actualCells[j], document));
j++;
}
}
result.Add(cellValues);
}
return result;
}
private static string GetCellValue(Cell cell, SpreadsheetDocument document)
{
bool sstIndexedcell = GetCellType(cell);
return sstIndexedcell
? GetSharedStringItemById(document.WorkbookPart, Convert.ToInt32(cell.InnerText))
: cell.InnerText;
}
private static bool GetCellType(Cell cell)
{
return cell.DataType != null && cell.DataType == CellValues.SharedString;
}
private static string GetSharedStringItemById(WorkbookPart workbookPart, int id)
{
return workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(id).InnerText;
}
The solution deals with shared cell items (SST indexed cells).
All good examples. Here is the one I am using since I need to keep track of all rows, cells, values, and titles for correlation and analysis.
The method ReadSpreadsheet opens an xlxs file and goes through each worksheet, row, and column. Since the values are stored in a referenced string table, I also explicitly use that per worksheet. There are other classes used: DSFunction and StaticVariables. The latter holds oft used parameter values, such as the referenced 'quotdouble' ( quotdouble = "\u0022"; ) and 'crlf' (crlf = "\u000D" + "\u000A"; ).
The relevant DSFunction method GetIntColIndexForLetter is included below. It returns an integer value for the column index corresponding to letter names such as (A,B, AA, ADE, etc.). This is used along with the parameter 'ncellcolref' to determine if any columns have been skipped and to enter empty string values for each one that is missing.
I also do some cleaning of the values before storing temporarily in a List object (using Replace method).
Subsequently, I use the hash table (Dictionary) of column names to extract values across different worksheets, correlate them, create normalized values, and then create an object used in our product which is then stored as an XML file. None of this is shown but is why this approach is used.
public static class DSFunction {
/// <summary>
/// Creates an integer value for a column letter name starting at 1 for 'a'
/// </summary>
/// <param name="lettstr">Column name as letters</param>
/// <returns>int value</returns>
public static int GetIntColIndexForLetter(string lettstr) {
string txt = "", txt1="";
int n1, result = 0, nbeg=-1, nitem=0;
try {
nbeg = (int)("a".ToCharArray()[0]) - 1; //1 based
txt = lettstr;
if (txt != "") txt = txt.ToLower().Trim();
while (txt != "") {
if (txt.Length > 1) {
txt1 = txt.Substring(0, 1);
txt = txt.Substring(1);
}
else {
txt1 = txt;
txt = "";
}
if (!DSFunction.IsNumberString(txt1, "real")) {
nitem++;
n1 = (int)(txt1.ToCharArray()[0]) - nbeg;
result += n1 + (nitem - 1) * 26;
}
else {
break;
}
}
}
catch (Exception ex) {
txt = ex.Message;
}
return result;
}
}
public static class Extractor {
public static string ReadSpreadsheet(string fileUri) {
string msg = "", txt = "", txt1 = "";
int i, n1, n2, nrow = -1, ncell = -1, ncellcolref = -1;
Boolean haveheader = true;
Dictionary<string, int> hashcolnames = new Dictionary<string, int>();
List<string> colvalues = new List<string>();
try {
if (!File.Exists(fileUri)) { throw new Exception("file does not exist"); }
using (SpreadsheetDocument ssdoc = SpreadsheetDocument.Open(fileUri, true)) {
var stringTable = ssdoc.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
foreach (Sheet sht in ssdoc.WorkbookPart.Workbook.Descendants<Sheet>()) {
nrow = 0;
foreach (Row ssrow in ((WorksheetPart)(ssdoc.WorkbookPart.GetPartById(sht.Id))).Worksheet.Descendants<Row>()) {
ncell = 0;
ncellcolref = 0;
nrow++;
colvalues.Clear();
foreach (Cell sscell in ssrow.Elements<Cell>()) {
ncell++;
n1 = DSFunction.GetIntColIndexForLetter(sscell.CellReference);
for (i = 0; i < (n1 - ncellcolref - 1); i++) {
if (nrow == 1 && haveheader) {
txt1 = "-missing" + (ncellcolref + 1 + i).ToString() + "-";
if (!hashcolnames.TryGetValue(txt1, out n2)) {
hashcolnames.Add(txt1, ncell - 1);
}
}
else {
colvalues.Add("");
}
}
ncellcolref = n1;
if (sscell.DataType != null) {
if (sscell.DataType.Value == CellValues.SharedString && stringTable != null) {
txt = stringTable.SharedStringTable.ElementAt(int.Parse(sscell.InnerText)).InnerText;
}
else if (sscell.DataType.Value == CellValues.String) {
txt = sscell.InnerText;
}
else txt = sscell.InnerText.ToString();
}
else txt = sscell.InnerText;
if (txt != "") txt1 = txt.ToLower().Trim(); else txt1 = "";
if (nrow == 1 && haveheader) {
txt1 = txt1.Replace(" ", "");
if (txt1 == "table/viewname") txt1 = "tablename";
else if (txt1 == "schemaownername") txt1 = "schemaowner";
else if (txt1 == "subjectareaname") txt1 = "subjectarea";
else if (txt1.StartsWith("column")) {
txt1 = txt1.Substring("column".Length);
}
if (!hashcolnames.TryGetValue(txt1, out n1)) {
hashcolnames.Add(txt1, ncell - 1);
}
}
else {
txt = txt.Replace(((char)8220).ToString(), "'"); //special "
txt = txt.Replace(((char)8221).ToString(), "'"); //special "
txt = txt.Replace(StaticVariables.quotdouble, "'");
txt = txt.Replace(StaticVariables.crlf, " ");
txt = txt.Replace(" ", " ");
txt = txt.Replace("<", "");
txt = txt.Replace(">", "");
colvalues.Add(txt);
}
}
}
}
}
}
catch (Exception ex) {
msg = "notok:" + ex.Message;
}
return msg;
}
}
The letter code is a base 26 encoding so this should work to convert it into an offset.
// Converts letter code (i.e. AA) to an offset
public int offset( string code)
{
var offset = 0;
var byte_array = Encoding.ASCII.GetBytes( code ).Reverse().ToArray();
for( var i = 0; i < byte_array.Length; i++ )
{
offset += (byte_array[i] - 65 + 1) * Convert.ToInt32(Math.Pow(26.0, Convert.ToDouble(i)));
}
return offset - 1;
}
You can use this function to extract a cell from a row passing the header index:
public static Cell GetCellFromRow(Row r ,int headerIdx) {
string cellname = GetNthColumnName(headerIdx) + r.RowIndex.ToString();
IEnumerable<Cell> cells = r.Elements<Cell>().Where(x=> x.CellReference == cellname);
if (cells.Count() > 0)
{
return cells.First();
}
else {
return null;
}
}
public static string GetNthColumnName(int n)
{
string name = "";
while (n > 0)
{
n--;
name = (char)('A' + n % 26) + name;
n /= 26;
}
return name;
}
Okay, I'm not exactly an expert on this but the other answers do seem like over kill to me so here's my solution:
// Loop through each row in the spreadsheet, skipping the header row
foreach (var row in sheetData.Elements<Row>().Skip(1))
{
var i = 0;
string[] letters = new string[15] {"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O" };
List<String> cellsList = new List<string>();
foreach (var cell in row.Elements<Cell>().ToArray())
{
while (cell.CellReference.ToString()[0] != Convert.ToChar(letters[i]))
{//accounts for multiple consecutive blank cells
cellsList.Add("");
i++;
}
cellsList.Add(cell.CellValue.Text);
i++;
}
string[] cells = cellsList.ToArray();
foreach(var cell in cellsList)
{
//display contents of cell, depending on the datatype you may need to call each of the cells manually
}
}
Hope someone finds this useful!
With apologies for posting yet another answer to this question, here's the code I used.
I was having problems with OpenXML not working properly if a worksheet had a blank row at the top. It would sometimes just return a DataTable with 0 rows and 0 columns in it. The code below copes with this, and all other worksheets.
Here's how you would call my code. Just pass in a filename and the name of the Worksheet to read in:
DataTable dt = OpenXMLHelper.ExcelWorksheetToDataTable("C:\\SQL Server\\SomeExcelFile.xlsx", "Mikes Worksheet");
And here's the code itself:
public class OpenXMLHelper
{
// A helper function to open an Excel file using OpenXML, and return a DataTable containing all the data from one
// of the worksheets.
//
// We've had lots of problems reading in Excel data using OLEDB (eg the ACE drivers no longer being present on new servers,
// OLEDB not working due to security issues, and blatantly ignoring blank rows at the top of worksheets), so this is a more
// stable method of reading in the data.
//
public static DataTable ExcelWorksheetToDataTable(string pathFilename, string worksheetName)
{
DataTable dt = new DataTable(worksheetName);
using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathFilename, false))
{
// Find the sheet with the supplied name, and then use that
// Sheet object to retrieve a reference to the first worksheet.
Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
if (theSheet == null)
throw new Exception("Couldn't find the worksheet: " + worksheetName);
// Retrieve a reference to the worksheet part.
WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
Worksheet workSheet = wsPart.Worksheet;
string dimensions = workSheet.SheetDimension.Reference.InnerText; // Get the dimensions of this worksheet, eg "B2:F4"
int numOfColumns = 0;
int numOfRows = 0;
CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));
SheetData sheetData = workSheet.GetFirstChild<SheetData>();
IEnumerable<Row> rows = sheetData.Descendants<Row>();
string[,] cellValues = new string[numOfColumns, numOfRows];
int colInx = 0;
int rowInx = 0;
string value = "";
SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
// Iterate through each row of OpenXML data, and store each cell's value in the appropriate slot in our [,] string array.
foreach (Row row in rows)
{
for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
// *DON'T* assume there's going to be one XML element for each column in each row...
Cell cell = row.Descendants<Cell>().ElementAt(i);
if (cell.CellValue == null || cell.CellReference == null)
continue; // eg when an Excel cell contains a blank string
// Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
colInx = GetColumnIndexByName(cell.CellReference); // eg "C" -> 2 (0-based)
rowInx = GetRowIndexFromCellAddress(cell.CellReference)-1; // Needs to be 0-based
// Fetch the value in this cell
value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
cellValues[colInx, rowInx] = value;
}
}
// Copy the array of strings into a DataTable.
// We don't (currently) make any attempt to work out which columns should be numeric, rather than string.
for (int col = 0; col < numOfColumns; col++)
dt.Columns.Add("Column_" + col.ToString());
for (int row = 0; row < numOfRows; row++)
{
DataRow dataRow = dt.NewRow();
for (int col = 0; col < numOfColumns; col++)
{
dataRow.SetField(col, cellValues[col, row]);
}
dt.Rows.Add(dataRow);
}
#if DEBUG
// Write out the contents of our DataTable to the Output window (for debugging)
string str = "";
for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
{
for (colInx = 0; colInx < maxNumOfColumns; colInx++)
{
object val = dt.Rows[rowInx].ItemArray[colInx];
str += (val == null) ? "" : val.ToString();
str += "\t";
}
str += "\n";
}
System.Diagnostics.Trace.WriteLine(str);
#endif
return dt;
}
}
private static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
{
// How many columns & rows of data does this Worksheet contain ?
// We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
// eg "B1:F4" -> we'll need 6 columns and 4 rows.
//
// (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
try
{
string[] parts = dimensions.Split(':'); // eg "B1:F4"
if (parts.Length != 2)
throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");
numOfColumns = 1 + GetColumnIndexByName(parts[1]); // A=1, B=2, C=3 (1-based value), so F4 would return 6 columns
numOfRows = GetRowIndexFromCellAddress(parts[1]);
}
catch
{
throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
}
}
public static int GetRowIndexFromCellAddress(string cellAddress)
{
// Convert an Excel CellReference column into a 1-based row index
// eg "D42" -> 42
// "F123" -> 123
string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
return int.Parse(rowNumber);
}
public static int GetColumnIndexByName(string cellAddress)
{
// Convert an Excel CellReference column into a 0-based column index
// eg "D42" -> 3
// "F123" -> 5
var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
int number = 0, pow = 1;
for (int i = columnName.Length - 1; i >= 0; i--)
{
number += (columnName[i] - 'A' + 1) * pow;
pow *= 26;
}
return number - 1;
}
}
Using ClosedXML.Excel Instead of OpenXML:
public DataTable ImportTable(DataTable dt, string FileName)
{
Statics.currentProgressValue = 0;
Statics.maxProgressValue = 100;
Statics.cancelProgress = false;
try
{
bool fileExist = File.Exists(FileName);
if (fileExist)
{
using (XLWorkbook workBook = new XLWorkbook(FileName))
{
IXLWorksheet workSheet = workBook.Worksheet(1);
var rowCount = workSheet.RangeUsed().RowCount();
if (rowCount > 0)
{
var colCount = workSheet.Row(1).CellsUsed().Count();
if (dt.Columns.Count < colCount)
throw new Exception($"Expects at least {dt.Columns.Count} columns.");
//Loop through the Worksheet rows.
Statics.maxProgressValue = rowCount;
for (int i = 1; i < rowCount; i++)
{
Statics.currentProgressValue += 1;
dt.Rows.Add();
for (int j = 2; j < dt.Columns.Count; j++)
{
var cell = (workSheet.Rows().ElementAt(i).Cell(j));
if (!string.IsNullOrEmpty(cell.Value.ToString()))
dt.Rows[i - 1][j] = cell.Value.ToString().Trim();
else
dt.Rows[i - 1][j] = "";
}
if (Statics.cancelProgress == true)
break;
}
}
return dt;
}
}
}
catch (Exception ex)
{
Statics.cancelProgress = true;
throw new Exception("Error exporting data." +
Environment.NewLine + ex.Message);
}
return dt;
}
I can't resist optimizing the subroutines from Amurra's answer to remove need for Regex's.
The first function isn't actually needed since the second one can will accept a cell reference (C3) or a column name (C) (but still a nice helper function). The indices are also one-based (only because our implementation used one-based for the rows to match visually with Excel).
/// <summary>
/// Given a cell name, return the cell column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
/// <exception cref="ArgumentOutOfRangeException">cellReference</exception>
public static string GetColumnName(string cellReference)
{
// Advance from L to R until a number, then return 0 through previous position
//
for (int lastCharPos = 0; lastCharPos <= 3; lastCharPos++)
if (Char.IsNumber(cellReference[lastCharPos]))
return cellReference.Substring(0, lastCharPos);
throw new ArgumentOutOfRangeException("cellReference");
}
/// <summary>
/// Return one-based column index given a cell name or column name
/// </summary>
/// <param name="columnNameOrCellReference">Column Name (ie. A, AB3, or AB44)</param>
/// <returns>One based index if the conversion was successful; otherwise null</returns>
public static int GetColumnIndexFromName(string columnNameOrCellReference)
{
int columnIndex = 0;
int factor = 1;
for (int pos = columnNameOrCellReference.Length - 1; pos >= 0; pos--) // R to L
{
if (Char.IsLetter(columnNameOrCellReference[pos])) // for letters (columnName)
{
columnIndex += factor * ((columnNameOrCellReference[pos] - 'A') + 1);
factor *= 26;
}
}
return columnIndex;
}
Added yet another implementation, this time where the number of columns is known in advance:
/// <summary>
/// Gets a list cells that are padded with empty cells where necessary.
/// </summary>
/// <param name="numberOfColumns">The number of columns expected.</param>
/// <param name="cells">The cells.</param>
/// <returns>List of padded cells</returns>
private static IList<Cell> GetPaddedCells(int numberOfColumns, IList<Cell> cells)
{
// Only perform the padding operation if existing column count is less than required
if (cells.Count < numberOfColumns - 1)
{
IList<Cell> padded = new List<Cell>();
int cellIndex = 0;
for (int paddedIndex = 0; paddedIndex < numberOfColumns; paddedIndex++)
{
if (cellIndex < cells.Count)
{
// Grab column reference (ignore row) <seealso cref="https://stackoverflow.com/a/7316298/674776"/>
string columnReference = new string(cells[cellIndex].CellReference.ToString().Where(char.IsLetter).ToArray());
// Convert reference to index <seealso cref="https://stackoverflow.com/a/848552/674776"/>
int indexOfReference = columnReference.ToUpper().Aggregate(0, (column, letter) => (26 * column) + letter - 'A' + 1) - 1;
// Add padding cells where current cell index is less than required
while (indexOfReference > paddedIndex)
{
padded.Add(new Cell());
paddedIndex++;
}
padded.Add(cells[cellIndex++]);
}
else
{
// Add padding cells when passed existing cells
padded.Add(new Cell());
}
}
return padded;
}
else
{
return cells;
}
}
Call using:
IList<Cell> cells = GetPaddedCells(38, row.Descendants<Cell>().ToList());
Where 38 is the required number of columns.
To read blank cells, I am using a variable named "CN" assigned outside the row reader and in while loop, I am checking if column index is greater than or not from my variable as it is being incremented after each cell read. if this does not match, I am filling my column with value I want to. This is the trick I used to catch up the blank cells into my respecting column value. Here is the code:
public static DataTable ReadIntoDatatableFromExcel(string newFilePath)
{
/*Creating a table with 20 columns*/
var dt = CreateProviderRvenueSharingTable();
try
{
/*using stream so that if excel file is in another process then it can read without error*/
using (Stream stream = new FileStream(newFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(stream, false))
{
var workbookPart = spreadsheetDocument.WorkbookPart;
var workbook = workbookPart.Workbook;
/*get only unhide tabs*/
var sheets = workbook.Descendants<Sheet>().Where(e => e.State == null);
foreach (var sheet in sheets)
{
var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);
/*Remove empty sheets*/
List<Row> rows = worksheetPart.Worksheet.Elements<SheetData>().First().Elements<Row>()
.Where(r => r.InnerText != string.Empty).ToList();
if (rows.Count > 1)
{
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
int i = 0;
int BTR = 0;/*Break the reader while empty rows are found*/
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
/*ignoring first row with headers and check if data is there after header*/
if (i < 2)
{
i++;
continue;
}
reader.ReadFirstChild();
DataRow row = dt.NewRow();
int CN = 0;
if (reader.ElementType == typeof(Cell))
{
do
{
Cell c = (Cell)reader.LoadCurrentElement();
/*reader skipping blank cells so data is getting worng in datatable's rows according to header*/
if (CN != 0)
{
int cellColumnIndex =
ExcelHelper.GetColumnIndexFromName(
ExcelHelper.GetColumnName(c.CellReference));
if (cellColumnIndex < 20 && CN < cellColumnIndex - 1)
{
do
{
row[CN] = string.Empty;
CN++;
} while (CN < cellColumnIndex - 1);
}
}
/*stopping execution if first cell does not have any value which means empty row*/
if (CN == 0 && c.DataType == null && c.CellValue == null)
{
BTR++;
break;
}
string cellValue = GetCellValue(c, workbookPart);
row[CN] = cellValue;
CN++;
/*if any text exists after T column (index 20) then skip the reader*/
if (CN == 20)
{
break;
}
} while (reader.ReadNextSibling());
}
/*reader skipping blank cells so fill the array upto 19 index*/
while (CN != 0 && CN < 20)
{
row[CN] = string.Empty;
CN++;
}
if (CN == 20)
{
dt.Rows.Add(row);
}
}
/*escaping empty rows below data filled rows after checking 5 times */
if (BTR > 5)
break;
}
reader.Close();
}
}
}
}
}
catch (Exception ex)
{
throw ex;
}
return dt;
}
private static string GetCellValue(Cell c, WorkbookPart workbookPart)
{
string cellValue = string.Empty;
if (c.DataType != null && c.DataType == CellValues.SharedString)
{
SharedStringItem ssi =
workbookPart.SharedStringTablePart.SharedStringTable
.Elements<SharedStringItem>()
.ElementAt(int.Parse(c.CellValue.InnerText));
if (ssi.Text != null)
{
cellValue = ssi.Text.Text;
}
}
else
{
if (c.CellValue != null)
{
cellValue = c.CellValue.InnerText;
}
}
return cellValue;
}
public static int GetColumnIndexFromName(string columnNameOrCellReference)
{
int columnIndex = 0;
int factor = 1;
for (int pos = columnNameOrCellReference.Length - 1; pos >= 0; pos--) // R to L
{
if (Char.IsLetter(columnNameOrCellReference[pos])) // for letters (columnName)
{
columnIndex += factor * ((columnNameOrCellReference[pos] - 'A') + 1);
factor *= 26;
}
}
return columnIndex;
}
public static string GetColumnName(string cellReference)
{
/* Advance from L to R until a number, then return 0 through previous position*/
for (int lastCharPos = 0; lastCharPos <= 3; lastCharPos++)
if (Char.IsNumber(cellReference[lastCharPos]))
return cellReference.Substring(0, lastCharPos);
throw new ArgumentOutOfRangeException("cellReference");
}
Code works for:
This code reads blank cells
skip empty rows after reading complete.
read the sheet from first in ascending order
if excel file is being used by another process, OpenXML still reads that.
Here is my solution. I found the above didn't seem to work well when the missing fields where at the end of a row.
Assuming the first row in the Excel sheet has ALL the columns (via headers), then grab the number of columns expected per row (row == 1). Then loop through the data rows (row > 1). The key to processing the missing cells is in method getRowCells, where the known number of column cells is passed in as well as the current row to process.
int columnCount = worksheetPart.Worksheet.Descendants<Row>().Where(r => r.RowIndex == 1).FirstOrDefault().Descendants<Cell>().Count();
IEnumerable<Row> rows = worksheetPart.Worksheet.Descendants<Row>().Where(r => r.RowIndex > 1);
List<List<string>> docData = new List<List<string>>();
foreach (Row row in rows)
{
List<Cell> cells = getRowCells(columnCount, row);
List<string> rowData = new List<string>();
foreach (Cell cell in cells)
{
rowData.Add(getCellValue(workbookPart, cell));
}
docData.Add(rowData);
}
Method getRowCells has a current limitation of only being able to support a sheet (row) that has less an 26 columns. A loop based on the known column count is used to find missing columns (cells). If found, a new Cell value is inserted into the cells collection, with the new Cell having a default value of "" instead of 'null'. The modified Cell collection is then returned.
private static List<Cell> getRowCells(int columnCount, Row row)
{
const string COLUMN_LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
if (columnCount > COLUMN_LETTERS.Length)
{
throw new ArgumentException(string.Format("Invalid columnCount ({0}). Cannot be greater than {1}",
columnCount, COLUMN_LETTERS.Length));
}
List<Cell> cells = row.Descendants<Cell>().ToList();
for (int i = 0; i < columnCount; i++)
{
if (i < cells.Count)
{
string cellColumnReference = cells.ElementAt(i).CellReference.ToString();
if (cellColumnReference[0] != COLUMN_LETTERS[i])
{
cells.Insert(i, new Cell() { CellValue = new CellValue("") }); }
}
else
{
cells.Insert(i, new Cell() { CellValue = new CellValue("") });
}
}
return cells;
}
private static string getCellValue(WorkbookPart workbookPart, Cell cell)
{
SharedStringTablePart stringTablePart = workbookPart.SharedStringTablePart;
string value = (cell.CellValue != null) ? cell.CellValue.InnerXml : string.Empty;
if ((cell.DataType != null) && (cell.DataType.Value == CellValues.SharedString))
{
return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
}
else
{
return value;
}
}
it run success with this code:
string filePath = "test.xlsx"//your file path
//Open the Excel file using ClosedXML.
using (XLWorkbook workBook = new XLWorkbook(filePath))
{
//Read the first Sheet from Excel file.
IXLWorksheet workSheet = workBook.Worksheet(1);
//Create a new DataTable.
DataTable dt = new DataTable();
//Loop through the Worksheet rows.
bool firstRow = true;
foreach (IXLRow row in workSheet.Rows())
{
//Use the first row to add columns to DataTable.
if (firstRow)
{
foreach (IXLCell cell in row.Cells())
{
dt.Columns.Add(cell.Value.ToString());
}
firstRow = false;
}
else
{
//Add rows to DataTable.
dt.Rows.Add();
int i = 0;
//for (IXLCell cell in row.Cells())
for (int j = 1; j <= dt.Columns.Count; j++)
{
if (string.IsNullOrEmpty(row.Cell(j).Value.ToString()))
dt.Rows[dt.Rows.Count - 1][i] = "";
else
dt.Rows[dt.Rows.Count - 1][i] =
row.Cell(j).Value.ToString();
i++;
}
}
}
}

Resources