How to extract pdf data into excel?

How to extract pdf data into excel? - excel

I want to convert pdf data into excel data. I have converted pdf to text file and have removed unnecessary text inside .txt file but they are now in rows but I want them to become columnwise.
PDF file: chemistry-chemists.com/chemister/Spravochniki/handbook-of-aqueous-solubility-data-2010.pdf
Current state of excel file :
Required state of excel file:

PDFtables.com specialises in extracting tables from PDFs into Excel. This should be able to do what you are looking for :)

Have a look to Tabula a very efficient tool to convert table from pdf: https://github.com/tabulapdf/tabula

In ASP.NET you can use that code by the way
<div>
Upload PDF File :<asp:FileUpload ID="fuPdfUpload" runat="server" />
<asp:Button ID="btnExportToExcel" Text="Export To Excel" OnClick="ExportToExcel" runat="server" />
</div>
!!You have to implement iTextSharp from NuGet!!
protected void ExportToExcel(object sender, EventArgs e)
{
if (this.fuPdfUpload.HasFile)
{
string file = Path.GetFullPath(fuPdfUpload.PostedFile.FileName);
this.ExportPDFToExcel(file);
}
}
private void ExportPDFToExcel(string fileName)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
Response.Clear();
Response.Buffer = true;
Response.AddHeader("content-disposition", "attachment;filename=ReceiptExport.xls");
Response.Charset = "";
Response.ContentType = "application/vnd.ms-excel";
Response.Write(text);
Response.Flush();
Response.End();
}

Related

Excel download using Angular and Spring Boot produces corrupt xls file

Iam trying to write a xls download for my Spring boot application. In order to generate the file Iam using POI. If I download the file directly from my Controller without passing it to the front-end like so:
//Wrote this one just for testing if the file is already corrupt here.
FileOutputStream fos = new FileOutputStream("C:\\dev\\directDownload.xls");
fos.write(byteArray);
fos.flush();
fos.close();
}
It works just fine and the file looks like this:
xls when downloaded from the backend without passing it to angular:
However thats not my goal. I intend to pass the Outputstream to my angular component. The Component calls a function from a service class. This class gets the response from the controller and passes it back to my component. In the UI the downloading dialog opens. The problem is, that the downloaded file looks like this (doesnt matter if its opened via excel or open office):
Currupt xls:
My Java Controller:
#CrossOrigin(exposedHeaders = "Content-Disposition")
#RequestMapping(value = "/report/file", produces = "application/vnd.ms-excel;charset=UTF-8")
public void getReportFile(#RequestParam(name = "projectNumber") final String projectNumber,
#RequestParam(name = "month") final int month, #RequestParam(name = "year") final int year,
#RequestParam(name = "employee") final int employee,
#RequestParam(name = "tsKey") final String tsKey,
final HttpServletResponse response) throws IOException {
response.setContentType("application/vnd.ms-excel;charset=UTF-8");
String excelFileName = "test.xls";
String headerKey = "Content-Disposition";
String headerValue = String.format("attachment; filename=\"%s\"",
excelFileName);
response.setHeader(headerKey, headerValue);
//Here I create the workbook that I want to download
ProjectMonthReport report = reportService.getReport(projectNumber, month, year);
//ExcelService builts the workbook using POI
Workbook workbook = excelService.exportExcel(report, employee, tsKey);
//The response is stored in an outputstream
OutputStream out = response.getOutputStream();
response.setContentType("application/vnd.ms-excel");
byte[] byteArray = ((HSSFWorkbook)workbook).getBytes();
out.write(byteArray);
out.flush();
out.close();
//Wrote this one just for testing if the file is already corrupt here. --> It's fine.
FileOutputStream fos = new FileOutputStream("C:\\dev\\directDownload.xls");
fos.write(byteArray);
fos.flush();
fos.close();
}
The Java Service method that builds the file using POI:
public Workbook exportExcel(final ProjectMonthReport report, final int employee, final String tsKey) throws IOException,
InvalidFormatException {
Workbook workbook = new HSSFWorkbook();
CreationHelper createHelper = workbook.getCreationHelper();
// Create a Sheet
Sheet sheet = workbook.createSheet("Employee");
// Create a Font for styling header cells
Font headerFont = workbook.createFont();
headerFont.setBold(true);
headerFont.setFontHeightInPoints((short) 14);
headerFont.setColor(IndexedColors.RED.getIndex());
// Create a CellStyle with the font
CellStyle headerCellStyle = workbook.createCellStyle();
headerCellStyle.setFont(headerFont);
Row headeRow = sheet.createRow(0);
Cell dateHeader = headeRow.createCell(0);
dateHeader.setCellValue("Datum");
Cell startHeader = headeRow.createCell(1);
startHeader.setCellValue("Beginn");
Cell endHeader = headeRow.createCell(2);
endHeader.setCellValue("Ende");
Cell activityHeader = headeRow.createCell(3);
activityHeader.setCellValue("Tätigkeitsbereit");
Cell cardHeader = headeRow.createCell(4);
cardHeader.setCellValue("Kartennummer");
List<WorkDescriptionDetail> details = report.getEmployees().get(employee).getKeyDetailMap().get(Integer.valueOf(tsKey)).getDetailList();
int counter = 1;
for (WorkDescriptionDetail detail : details) {
List <String> stringList= detail.toStringList();
Row row = sheet.createRow(counter);
Cell cellDate = row.createCell(0);
cellDate.setCellValue(stringList.get(0));
Cell cellStart = row.createCell(1);
cellStart.setCellValue(stringList.get(1));
Cell cellEnd = row.createCell(2);
cellEnd.setCellValue(stringList.get(2));
Cell cellActivity = row.createCell(3);
cellActivity.setCellValue(stringList.get(3));
counter ++;
}
return workbook;
}
My angular component:
saveFile(employee: string, tsKey:string) {
this.subscription = this.reportService.saveXlsFile(this.projectNumber, this.year, this.month, employee, tsKey)
.subscribe(response=> {
console.log(response);
let mediatype = 'application/vnd.ms-excel;charset=UTF-8';
const data = new Blob(["\ufeff",response.arrayBuffer()], {type: mediatype});
console.log(data);
saveAs(data, 'test.xls');
},
error => console.log("error downloading the file"));
}
The Ts Service Function that is called:
saveXlsFile(projectNumber:string, year:string, month:string, empId: string, tsKey:string) {
let params:URLSearchParams = new URLSearchParams();
params.set('projectNumber', projectNumber);
console.log(projectNumber);
params.set('month', month);
console.log(month);
params.set( 'year', year);
console.log(year);
params.set('employee', empId);
console.log(empId);
params.set('tsKey', tsKey);
console.log(tsKey);
return this.http.get(this.baseUrl + "/file", { search: params } );
}
I tried to retrieve the response via Postman and directly download the file. When I do that the file can't be opened by excel (Excel just crashed), however I can open the file in the OpenOffice version and it works fine. Its also not corrupted.
I've been searching the web for the last couple of days and I think it may be an enconding problem caused in the frontend. But maybe it is also SpringBoot thats playing me here. Any suggestions?
Thank you for your help!

Hey I found the solution to this problem yesterday myself. Adding the following in the angular service:
return this.http.get(this.baseUrl + "/file", { search: params, responseType: ResponseContentType.Blob }).map(
(res) => {
return new Blob([res.blob()], { type: 'application/vnd.ms-excel' });
});
After that you'll need to modify the component like so:
saveFile(employee: string, tsKey:string) {
this.subscription = this.reportService.saveXlsFile(this.projectNumber, this.year, this.month, employee, tsKey)
.subscribe(response=> {
console.log(response);
let mediatype = 'application/vnd.ms-excel';
saveAs(response, 'test.xlsx');
},
error => console.log("error downloading the file"));
}
So the Problem was that I was not getting a blob Object in my response....

Export Rich Text to plain text c#

Good day to Stackoverflow community,
I am in need of some expert assistance. I have an MVC4 web app that has a few rich text box fields powered by TinyMCE. Up until now the system is working great. Last week my client informed me that they want to export the data stored in Microsoft SQL to Excel to run custom reports.
I am able to export the data to excel with the code supplied. However it is exporting the data in RTF rather than Plain text. This is causing issues when they try to read the content.
Due to lack of knowledge and or understanding I am unable to figure this out. I did read that it is possible to use regex to do this however I have no idea how to implement this. So I turn to you for assistance.
public ActionResult ExportReferralData()
{
GridView gv = new GridView();
gv.DataSource = db.Referrals.ToList();
gv.DataBind();
Response.ClearContent();
Response.Buffer = true;
Response.AddHeader("content-disposition", "attachment; filename=UnderwritingReferrals.xls");
Response.ContentType = "application/ms-excel";
Response.AddHeader("Content-Type", "application/vnd.ms-excel");
Response.Charset = "";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter htw = new HtmlTextWriter(sw);
gv.RenderControl(htw);
Response.Output.Write(sw.ToString());
Response.Flush();
Response.End();
return RedirectToAction("Index");
}
I would really appreciate any assistance. and thank you in advance.
I have looked for solutions on YouTube and web forums with out any success.
Kind Regards
Francois Muller

One option you can perform is to massage the Data you write to the XML file.
For example, idenfity in your string and replace it with string.Empty.
Similarly can be replaced with string.Empty.
Once you have identified all the variants of the Rich Text HTML tags, you can just create a list of the Tags, and inside a for FOR loop replace each of them with a suitable string.

Did you try saving the file as .xslx and sending over to the client.
The newer Excel format might handle the data more gracefully?

Add this function to your code, and then you can invoke the function passing it in the HTML string. The return output will be HTML free.
Warning: This does not work for all cases and should not be used to process untrusted user input. Please test it with variants of your input string.
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{ inside = true; continue; }
if (let == '>') { inside = false; continue; }
if (!inside) { array[arrayIndex] = let; arrayIndex++; }
}
return new string(array, 0, arrayIndex);
}

So I managed to resolve this issue by changing the original code as follow:
As I'm only trying to convert a few columns, I found this to be working well. This will ensure each records is separated by row in Excel and converts the Html to plain text allowing users to add column filters in Excel.
I hope this helps any one else that has a similar issue.
GridView gv = new GridView();
var From = RExportFrom;
var To = RExportTo;
if (RExportFrom == null || RExportTo == null)
{
/* The actual code to be used */
gv.DataSource = db.Referrals.OrderBy(m =>m.Date_Logged).ToList();
}
else
{
gv.DataSource = db.Referrals.Where(m => m.Date_Logged >= From && m.Date_Logged <= To).OrderBy(m => m.Date_Logged).ToList();
}
gv.DataBind();
foreach (GridViewRow row in gv.Rows)
{
if (row.Cells[20].Text.Contains("<"))
{
row.Cells[20].Text = Regex.Replace(row.Cells[20].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[21].Text.Contains("<"))
{
row.Cells[21].Text = Regex.Replace(row.Cells[21].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[22].Text.Contains("<"))
{
row.Cells[22].Text = Regex.Replace(row.Cells[22].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[37].Text.Contains("<"))
{
row.Cells[37].Text = Regex.Replace(row.Cells[37].Text, "<(?<tag>.+?)(>|>)", " ");
}
if (row.Cells[50].Text.Contains("<"))
{
row.Cells[50].Text = Regex.Replace(row.Cells[37].Text, "<(?<tag>.+?)(>|>)", " ");
}
}
Response.ClearContent();
Response.Buffer = true;
Response.AddHeader("content-disposition", "attachment; filename=Referrals " + DateTime.Now.ToString("dd/MM/yyyy") + ".xls");
Response.ContentType = "application/ms-excel";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.AddHeader("Content-Type", "application/vnd.ms-excel");
Response.Charset = "";
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter htw = new HtmlTextWriter(sw);
gv.RenderControl(htw);
//This code will export the data to Excel and remove all HTML Tags to pass everything into Plain text.
//I am using HttpUtility.HtmlDecode twice as the first instance changes null values to "Â" the second time it will run the replace code.
//I am using Regex.Replace to change the headings to more understandable headings rather than the headings produced by the Model.
Response.Write(HttpUtility.HtmlDecode(sw.ToString())
.Replace("Cover_Details", "Referral Detail")
.Replace("Id", "Identity Number")
.Replace("Unique_Ref", "Reference Number")
.Replace("Date_Logged", "Date Logged")
.Replace("Logged_By", "File Number")
.Replace("Date_Referral", "Date of Referral")
.Replace("Referred_By", "Name of Referrer")
.Replace("UWRules", "Underwriting Rules")
.Replace("Referred_To", "Name of Referrer")
);
Response.Flush();
Response.End();
TempData["success"] = "Data successfully exported!";
return RedirectToAction("Index");
}

Java iText Pdf Writer and PrimeFaces Not Refreshing PDF content

Good morning!
I'm using the iText library to create a pdf template and Primefaces to display the content on a web application.
When I ran the first test to see if all the libraries were all set, it was displayed normally. But then I made some changes, and it seems that something is caching my first test in memory and it is the only thing displayed, no matter what changes I make it keeps the same first content. I´ve already cleaned up my netbeans project, closed the IDE and also restarted the computer.
Thats is my tag on the jsf page:
<p:media value="#{atividadeController.pdfContent}" player="pdf" width="100%" height="700px"/>
And here is my method in the managed bean, which is a SessionScoped:
public String preparePdf()
{
try {
ByteArrayOutputStream output = new ByteArrayOutputStream();
Font fontHeader = new Font(Font.FontFamily.HELVETICA, 20, Font.BOLD);
Font fontLine = new Font(Font.FontFamily.TIMES_ROMAN, 14);
Font fontLineBold = new Font(Font.FontFamily.TIMES_ROMAN, 14, Font.BOLD);
Document document = new Document();
PdfWriter.getInstance(document, output);
document.open();
//Writing document
Chunk preface = new Chunk("GERAL", fontHeader);
document.add(preface);
Calendar cal = Calendar.getInstance();
cal.setTime(current.getData());
int year = cal.get(Calendar.YEAR);
int month = 1 + cal.get(Calendar.MONTH);
int day = cal.get(Calendar.DAY_OF_MONTH);
String dateStr = day+"/"+month+"/"+year;
Paragraph dataAndHour = new Paragraph(dateStr, fontLine);
document.add(dataAndHour);
document.close();
pdfContent = new DefaultStreamedContent(new ByteArrayInputStream(output.toByteArray()), "application/pdf");
} catch (Exception e) {
e.printStackTrace();
}
return "/views/view_atividade_pdf";
}
There is no exception on the server log.
I really aprecciate any help. Thanks in advance

Replace and delete PDF File

I am using the following piece of code to delete the old PDF and replace the old one with the new one but with no result. Is is possible to perform this operation on PDF files? As, throughout the net I see that these functions are used for .txt,.xls.doc...etc file types. Is there anything wrong with my code? Please help...
private void ListFieldNames(string s)
{
try
{
string pdfTemplate = #"z:\TEMP\PDF\PassportApplicationForm_Main_English_V1.0.pdf";
//var newFile = pdfTemplate;
string newFile = #"z:\TEMP\PDF\_PassportApplicationForm_Main_English_V1.0.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
//ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)pdfTemplate);
//PdfStamper stamper = new PdfStamper(reader, new FileStream(newFile, FileMode.Create));
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(newFile, FileMode.Create)))
{
AcroFields form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
//Replace Address Form field with my custom data
if (fieldKey.Contains("Surname"))
{
form.SetField(fieldKey, s);
}
}
// set form fields
//form.SetField("Address", s);
stamper.FormFlattening = true;
stamper.Close();
}
}
File.Copy(newFile, pdfTemplate);
File.Delete(pdfTemplate);
}

Everything looks good to me, just change:
File.Copy(newFile, pdfTemplate);
File.Delete(pdfTemplate);
change to:
File.Delete(pdfTemplate);
File.Copy(newFile, pdfTemplate);
You can't copy a file if a file already exists at its location with the same name as it.
Delete existing file first.

How can we read protected password excel file (.xls) with POI API

I've just learned POI and find the HSSF is very simple to read and create excel file (.xls).
However, I found some problem when want to read excel protected with password.
It took me an hour to find this solution on internet.
Please could you help me to solve this problem.
I'm very glad if you could give me a code snippet.
Thank you.

See http://poi.apache.org/encryption.html - if you're using a recent enough copy of Apache POI (eg 3.8) then encrypted .xls files (HSSF) and .xlsx files (XSSF) can be decrypted (proving you have the password!)
At the moment you can't write out encrypted excel files though, only un-encrypted ones

At the time you wrote your question, it wasn't easy to do with Apache POI. Since then, support has come on a long way
These days, if you want to open a password protected Excel file, whether .xls or .xlsx, for which you know the password, all you need to do is use WorkbookFactory.create(File,Password), eg
File input = new File("password-protected.xlsx");
String password = "nice and secure";
Workbook wb = WorkbookFactory.create(input, password);
That'll identify the type of the file, decrypt it with the given password, and open it for you. You can then read the contents as normal

Here is a complete example code that reads in a protected excel file, decrypts using a password and writes out unprotected excel file
public static void readProtectedBinFile() {
try {
InputStream inp = new FileInputStream("c:\\tmp\\protectedFile.xls");
org.apache.poi.hssf.record.crypto.Biff8EncryptionKey.setCurrentUserPassword("abracadabra");
Workbook wb;
wb = WorkbookFactory.create(inp);
// Write the output to a file
FileOutputStream fileOut;
fileOut = new FileOutputStream("c:\\tmp\\unprotectedworkbook.xlsx");
wb.write(fileOut);
fileOut.close();
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

That is the code for Read Excel file with checking of .xls and .xlsx (with password protected or without password protected) as complete example code.
private Workbook createWorkbookByCheckExtension() throws IOException, InvalidFormatException {
Workbook workbook = null;
String filePath = "C:\\temp\\TestProtectedFile.xls";
String fileName = "TestProtectedFile.xls";
String fileExtensionName = fileName.substring(fileName.indexOf("."));
if (fileExtensionName.equals(".xls")) {
try {
FileInputStream fileInputStream = new FileInputStream(new File(filePath));
workbook = new HSSFWorkbook(fileInputStream);
} catch (EncryptedDocumentException e) {
// Checking of .xls file with password protected.
FileInputStream fileInputStream = new FileInputStream(new File(filePath));
Biff8EncryptionKey.setCurrentUserPassword("password");
workbook = new HSSFWorkbook(fileInputStream);
}
} else if (fileExtensionName.equals(".xlsx")){
// Checking of .xlsx file with password protected.
String isWorkbookLock = "";
InputStream is = null;
is = new FileInputStream(new File(filePath));
if (!is.markSupported()) {
is = new PushbackInputStream(is, 8);
}
if (POIFSFileSystem.hasPOIFSHeader(is)) {
POIFSFileSystem fs = new POIFSFileSystem(is);
EncryptionInfo info = new EncryptionInfo(fs);
Decryptor d = Decryptor.getInstance(info);
try {
d.verifyPassword("password");
is = d.getDataStream(fs);
workbook = new XSSFWorkbook(OPCPackage.open(is));
isWorkbookLock = "true";
} catch (GeneralSecurityException e) {
e.printStackTrace();
}
}
if (isWorkbookLock != "true") {
FileInputStream fileInputStream = new FileInputStream(new File(filePath));
workbook = new XSSFWorkbook(fileInputStream);
}
}
return workbook;
}

POI will not be able to read encrypted workbooks - that means that if you have protected the entire workbook (and not just a sheet), then it won't be able to read it. Otherwise, it should work.

Ravi is right. It seems you can read password protected, but not encrypted files with POI. See http://osdir.com/ml/user-poi.apache.org/2010-05/msg00118.html. The following code prints out a trace of the file
POIFSLister lister = new POIFSLister();
lister.viewFile(spreadsheetPath, true);
If you get an output mentioning encryption then you cannot open the file with POI.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract pdf data into excel? - excel

PDFtables.com specialises in extracting tables from PDFs into Excel. This should be able to do what you are looking for :)

Have a look to Tabula a very efficient tool to convert table from pdf: https://github.com/tabulapdf/tabula

Related

Excel download using Angular and Spring Boot produces corrupt xls file

Export Rich Text to plain text c#

Java iText Pdf Writer and PrimeFaces Not Refreshing PDF content

Replace and delete PDF File

How can we read protected password excel file (.xls) with POI API

Categories

Resources