Azure Databricks Read Text and tables from PDF files-python

Azure Databricks Read Text and tables from PDF files-python - python-3.x

we have pdf documents that contain scanned images. These images contain free text and table data. I would like to read the data from the tables and also the free text using python. End goal is to parse the entire pdf file and convert that to a json and store in table. SO far I have tried the following
Libraries (1.) through (4.) although they are free they are very inconsistent in reading the pdf files mostly because our pdf files are scanned images and tables have no borders.
pip install camelot-py(free)
pip install tabula-py(free)
pip install PyPDF2(free)
fitz - pdf to json(free)
FormRecognizer(License)
Johnsnowlabs(License)
Options 5. and 6. worked fine but due to the image brightness it is not able to parse the headers and certain column data.
Is there any other OCR tool that I can try that integrates well with Azure Databricks. Licensed version is fine too.

Related

How to redact texts in a pdf file in NodeJs

I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.

How to access to iCloud Notes with pyiCloud

I'm trying to access to my iCloud Notes with a python script using pyiCloud framework, but when I try to list the notes it seems that Documents folder is empty. Does anyone know how I should make that?
>>> api.files['com~apple~Notes']['Documents'].dir() It returns:
>>> []

It sounds like you have an authentication problem that you can't get access to your Notes from the file storage (the UbiquityService). This issue might give you some more clues.
On the other note (!), I found the following a better way to get my Notes exported. I have tried a couple of ways mentioned around the web. Although there is no in-app solution to export Notes in a format other than PDF files, I have stumbled upon the following two solutions:
Export in Markdown (or in other formats in the paid version) via the Bear app. I found this way easier and of more quality in terms of keeping the formatting, attachments, etc:
Download the Bear migration Workflow script from here and follow the instructions.
[optional] At this point, you have the HTML files with inline encoded images. Use my script to decode images to get regular HTML files with the images in an accompanying directory.
Install Bear and import the exported files from Notes.
Export the files as Markdown, HTML, or whatever format you desire from File -> Export Notes within the Bear app. Don't forget to check the "Export attachments" box in the export dialog.
Export in HTML (and then convert to Markdown if you want) via the Notes Exporter app. The app gives you HTML files with inline encoded Base64 images saved with .txt extension (?!). Although I personally like this way as it generates output files that mimic closely the original Notes, the hyperlinks are missing in the exported files (it still keeps the hyperlink coloring though):
Download Notes Exporter from here.
Export Notes to the path you choose.
[optional] Rename file extensions to .html.
Voilà, now you have your Notes as HTML files with the same formatting and images.
Decode inline Base64 images and save HTML files with images saved in a separate adjacent directory using the script that I wrote for this:
https://gist.github.com/SHi-ON/945ea2272ea4bb29e13bd0305370da90
Hope this helps to give you an idea!

Convert any document, image, text file into PDF

I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?

I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.

If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.

How to read all excel and csv formats using node

I am working in building a project management tool,using MEAN(Mongodb,Expressjs,Angularjs,Nodejs) Stack.
I have a requirement in my project, where users will upload any kind of excel or csv format file and i need to parse each row from the file(excel|csv) and map it to my database model and save it has a mongodb document.I am trying to find an excel and csv parser library to accomplish my task.I also came accross xlsx, it looks good but it doesnt support reading csv files.It will be really helpful if any one could suggest a node.js library that can read all kinds of excel and csv file formats efficiently.Thanks in advance

At one point, I used Node CSV https://github.com/wdavidw/node-csv
to get the data inputted, it's really easy to use. Most of my users were fine with just having the CSV format option.....but you could combine the functionality of each library depending on the file type entered.....

How to open spss data files in Excel?

I want to open spss .sav data files in Excel without opening the spss files (I don't want to convert spss data file into Excel file). I know this is possible using OLDB connection, but I don't know how to do this.

I converted sav to csv online: http://pspp.benpfaff.org/

(Not exactly an answer for you, since do you want avoid opening the files, but maybe this helps others).
I have been using the open source GNU PSPP package to convert the sav tile to csv. You can download the Windows version at least from SourceForge [1]. Once you have the software, you can convert sav file to csv with following command line:
pspp-convert <input.sav> <output.csv>
[1] http://sourceforge.net/projects/pspp4windows/files/?source=navbar

In order to download that driver you must have a license to SPSS. For those who do not, there is an open source tool that is very much like SPSS and will allow you to import SAV files and export them to CSV.
Here's the software
And here are the steps to export the data.

I help develop the Colectica for Excel addin, which opens SPSS and Stata data files in Excel. This does not require ODBC configuration; it reads the file and then inserts the data and metadata into your worksheet.
The addin is downloadable from
http://www.colectica.com/software/colecticaforexcel

You can do it via ODBC. The steps to do it:
Install IBM SPSS Statistics Data File Driver. Standalone Driver is enough.
Create DNS via ODBC manager.
Use the data importer in Excel via ODBC by selecting created DNS.

You can use online converter, developed by me at N'counter.
This is the easiest way to open SPSS file in Excel.
1) You just have to upload your file to SPSS coN'verter at https://secure.ncounter.de/SpssConverter
2) Select some options
3) And your converted Excel file will be downloaded
No information about your file contents is retained on our server. The file travels to our server, is converted in-memory, and is immediately discarded: We don't peer into your data at any time!

I tried the below and it worked well,
Install Dimensions Data Model and OLE DB Access
and follow the below steps in excel
Data->Get External Data ->From Other sources -> From Data Connection Wizard -> Other/Advanced-> SPSS MR DM-2 OLE DB Provider-> Metadata type as SPSS File(SAV)-> SPSS data file in Metadata Location->Finish

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string