Reading Excel File from Google Cloud Storage and converting to CSV - apache-poi

I have implemented reading and converting to CSV from Excel File using Apache POI library successfully (where file is present in any of my local Windows folder) .
In my current scenario, I have to read the file from cloud storage bucket (gs://xyz/abc.xlsx),convert to CSV and store the file back to GCS.
Is there any way by which I can make my code identify the file present in Google Cloud?
Any leads will be appreciated.
There are Java APIs to communicate with GCS but I'm finding any way around to provide the path of GCS File to parameterized constructor of 'XSSFWorkbook'. Please find the snippet below which I'm using for converting an Excel file to CSV.
Workbook wb = new XSSFWorkbook(new File("C:\\Users\\balajeev\\Desktop\\abc.xlsx"));
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream("C:\\Users\\balajeev\\Desktop\\abc.csv"),
true);
for (Sheet sheet : wb) {
for (Row row : sheet) {
boolean firstCell = true;
for (Cell cell : row) {
if ( ! firstCell ) out.print(',');
String text = formatter.formatCellValue(cell);
out.print(text);
firstCell = false;
}
out.println();
}
}
How to make this code recognise a file present in GCS,convert it to CSV and post conversion put the converted file in GCS.

Related

How do I add a script to a Google Sheet to automatically take data from an XLSX file on Google Drive?

I'm trying to automatically update a Google Sheet from a separate XLSX file, since the XLSX file gets regularly updated, but I need to do some data cleaning. I tried doing a query and importrange neither of which can get data from an xlsx file.
It seems like I need to write a script on the Google Sheet to automatically take the data from the xlsx. Where do I add this, and how would I go about getting started? I have access to both files, so permissions shouldn't be an issue.
Suggestion: Temporarily Convert the Excel File to Google Sheets File to Extract Data
Unfortunately, there is no direct way to extract data from Excel files to Google Sheets using Google Apps Script. As a workaround, you need to first convert your excel file to Google Sheets and then extract the data from the converted file to your output Google Sheets file. You may use the following script as a basis for yours:
function importData() {
var xlsxName = "Test 1.xlsx"; //Change source file name accordingly
var convertID = convert(xlsxName).toString();
var xLSX = SpreadsheetApp.openById(convertID).getSheetByName("Input");
var ss = SpreadsheetApp.openById("<output Sheet ID>").getSheetByName("Output"); //Change output sheet ID
var lastColumn = xLSX.getLastColumn();
var lastRow = xLSX.getLastRow();
ss.getRange(1, 1, lastRow, lastColumn).setValues(xLSX.getDataRange().getValues()); //Sets values from converted xlsx data to output sheet
DriveApp.getFileById(convertID).setTrashed(true); //deletes temporary file
}
function convert(excelFileName) {
var files = DriveApp.getFilesByName(excelFileName);
var excelFile = (files.hasNext()) ? files.next() : null;
var blob = excelFile.getBlob();
var config = {
title: "[Converted File] " + excelFile.getName(), //sets the title of the converted file
parents: [{ id: excelFile.getParents().next().getId() }],
mimeType: MimeType.GOOGLE_SHEETS
};
var spreadsheet = Drive.Files.insert(config, blob);
return (spreadsheet.id); //Returns the ID of the converted file
}
This script involves:
Converting the Excel file to a temporary Google Sheets file.
Importing the data from the temporary Google Sheets file to the desired/output Google Sheets file.
Deleting the temporary Google Sheets file.
NOTE:
Expect a longer runtime when applying this script to a bigger excel file.
You may modify the script to be suitable for your current issue.
The script should be added to your desired output Google Sheets.
Do not forget to add the Drive API service to your script.
Sample Test Case:
Input:
Expected Output:

Convert array to blob, blob to legacy XLS file

I have an apps script that generates a 2D array. I would like to export this array to a folder on my Google Drive in legacy .XLS format, ideally without first creating a Google Sheet and then converting that sheet.
I thought I could turn my array into a CSV string and convert that to blob with the appropriate MimeType, and save that in Drive.
However, when I download the file from Drive and open it, the values aren't separated (tried "," and ";" as delimiter).
My script below, with a simplified array for example.
function createXls() {
var data = [["a","b","c"],["d","e","f"]];
var csvString = toCsv(data);
var xlsName = "here goes the filename";
var driveFolder = DriveApp.getFolderById("hereGoesTheFolderId");
var blob = Utilities.newBlob(csvString, MimeType.MICROSOFT_EXCEL_LEGACY);
blob.setName(xlsName + ".xls");
driveFolder.createFile(blob);
};
function toCsv(arr) {
return arr.map(row =>
row.map(val => val).join(';')
).join('\n');
};
Am I missing something here, or is there no wat around putting the data in a sheet first and converting that sheet to xls?
Thank you!

Not to read hidden sheets of xlsb file using XSSFBReader in apache poi

I am reading xlsb file using XSSFBReader class from apache poi library. I do not find any way to read only visible sheets of xlsb file. Currently, code reads all the sheets of xlsb file.
How to read only visible sheets using XSSFBReader ?
pkg = OPCPackage.open(filename, PackageAccess.READ_WRITE);
XSSFBReader r = new XSSFBReader(pkg);
XSSFBReader.SheetIterator it = (XSSFBReader.SheetIterator) r.getSheetsData();
while (it.hasNext()) { //This iterates over all sheets of xlsb file
InputStream is = it.next();
String sheetName = it.getSheetName();
//some processing
}

CSV generation possible with Apache POI?

I need to generate csv files and I stumbled on a module in our project itself which uses Apache POI to generate excel sheets aleady. So I thought I could use the same to generate csv. So I asked google brother, but he couldnt find anything for sure that says Apache POI can be used for CSV file generation. I was checking on the following api too and it only talks about xls sheets and not csv anywhere. Any ideas?
http://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/Workbook.html
Apache Poi will not output to CSV for you. However, you have a couple good options, depending on what kind of data you are writing into the csv.
If you know that none of your cells will contain csv markers such as commas, quotes, line endings, then you can loop through your data rows and copy the text into a StringBuffer and send that to regular java IO.
Here is an example of writing an sql query to csv along those lines: Poi Mailing List: writing CSV
Otherwise, rather than figure out how to escape the special characters yourself, you should check out the opencsv project
If you check official web site Apache POI, you can find lots of example there. There is also an example that shows how you can have csv formatted output by using apache POI.
ToCSV example
Basic strategy:
1) Apache Commons CSV is the standard library for writing CSV values.
2) But we need to loop through the Workbook ourselves, and then call Commons CSV's Printer on each cell value, with a newline at the end of each row. Unfortunately this is custom code, it's not automatically available in XSSF. But it's easy:
// In this example we construct CSVPrinter on a File, can also do an OutputStream
Reader reader = Files.newBufferedReader(Paths.get(SAMPLE_CSV_FILE_PATH));
CSVPrinter csvPrinter = new CSVPrinter(reader, CSVFormat.DEFAULT);
if (workbook != null) {
XSSFSheet sheet = workbook.getSheetAt(0); // Sheet #0
Iterator<Row> rowIterator = sheet.rowIterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
csvPrinter.print(cell.getStringCellValue()); // Call Commons CSV here to print
}
// Newline after each row
csvPrinter.println();
}
}
// at the end, close and flush CSVPrinter
csvPrinter.flush();
csvPrinter.close();
An improved and tested version of gene b's response is this:
/**
* Saves all rows from a single Excel sheet in a workbook to a CSV file.
*
* #param excelWorkbook path to the Excel workbook.
* #param sheetNumber sheet number to export.
* #param csvFile CSV file path for output.
* #throws IOException if failed to read the Excel file or create/write to a CSV file.
*/
public static void excelToCsv(String excelWorkbook, int sheetNumber, String csvFile) throws IOException {
try (Workbook workbook = WorkbookFactory.create(new File(excelWorkbook), null, true); // Read-only: true
BufferedWriter writer = new BufferedWriter(new FileWriter(csvFile));
CSVPrinter csvPrinter = new CSVPrinter(writer, CSVFormat.DEFAULT)) {
Sheet sheet = workbook.getSheetAt(sheetNumber);
DataFormatter format = new DataFormatter();
for (Row row : sheet) {
for (int c = 0; c < row.getLastCellNum(); c++) {
// Null cells returned as blank
Cell cell = row.getCell(c, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
String cellValue = format.formatCellValue(cell);
csvPrinter.print(cellValue);
}
csvPrinter.println();
}
csvPrinter.flush();
}
}
The following improvements were made:
NullPointerException won't be thrown if a cell in an Excel Row was never edited. A blank value will be written to the CSV instead.
Excel values are rendered using DataFormatter allowing the CSV to match the visual representation of the Excel sheet.
try-with-source used for auto-close of the file objects.
The workbook is opened in the read-only mode.

How can I programatically convert .xls and .csv files to .xlsx?

Is there a programmatic solution to this that does not involve having Office on the server?
Update:
This solution will be deployed in a .Net shop, so for now PHP and Java approaches aren't on the table (though I was impressed with the libraries themselves).
We will be receiving documents in csv, .xls, and .xlsx formats that need to be parsed and their data shoved into a DB. We're planning on using the OpenXML SDK for all of the parsing goodness and want to operate over only one file type.
You can achieve this using the Apache POI library for Java.
HSSF is the POI Project's pure Java implementation of the Excel '97(-2007) file format.
XSSF is the POI Project's pure Java implementation of the Excel 2007 OOXML (.xlsx) file format.
I've used it to read in a complete mix of .xls and .xlsx files, and I always output .xlsx.
For .csv files, import using the Super CSV library and export using the Apache POI library above.
The main motivation for Super Csv is to be the best, fastest and most programmer friendly free CSV package for Java.
Or use PHPExcel ( http://www.phpexcel.net ) if you want a PHP solution rather than java
For csv files i would recommend a combination of http://kbcsv.codeplex.com/ to read the csv file into a datatable and EPPPLUS to use its .FromDataTable Method to convert it to an xlsx file.
I works great for me and is very fast.
For reading xls files there is no free Implementation that I know of :(
and you can use for parse columns.
object columnValue = ws.Cells[i, ColIndex, i, ColIndex].Value; // get Specific cell.
you can use below method for .csv, xlsx, .txt files.
public yourReturnType compute()
{
#region .XLSX Section
if (FilePath.FullName.Contains(".xlsx") || FilePath.FullName.Contains(".xls"))
{
// Open and read the XlSX file.
using (var package = new ExcelPackage(FilePath))
{
ExcelWorkbook wb = package.Workbook; // Get the work book in the file
if (wb != null)
{
if (wb.Worksheets.Count > 0)
{
ExcelWorksheet ws = wb.Worksheets.First(); // Get the first worksheet
yourParseCode(ws);
}
} // if End.
} // using end.
}
#endregion
#region .CSV Section
if (FilePath.FullName.Contains(".csv") || FilePath.FullName.Contains(".txt"))
{
CSVParser c = new CSVParser(FilePath);
DataTable dt = c.ReadCSVFile();
using (ExcelPackage pck = new ExcelPackage())
{
ExcelWorksheet ws = pck.Workbook.Worksheets.Add("temporary");
ws.Cells["A1"].LoadFromDataTable(dt, true);
yourParseCode (ws);
////pck.Save(); // no need to save this temporary sheet.
}
}
#endregion
return (yourReturnType );
}

Resources