Tabula font warnings result in table not getting parsed from document. Is this how it is supposed to work? - tabula-py

I parsed 3 documents to fetch tables. The results as follow:
Document 1: Perfect parsing.
Document 2: got Jul 16, 2019 5:25:42 PM org.apache.pdfbox.pdmodel.font.PDType1Font
WARNING: Using fallback font NimbusSanL-Bold for Univers-Bold
Not sure if this is related but the second page was parsed and the first one was not.
Document 3: Got Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font
WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.
These are the current tabula parsing settings:
rows = tabula.read_pdf(filename,
pages='all',
silent=True,
pandas_options={
'header': None,
'error_bad_lines': False,
'warn_bad_lines': False
})
Are there other settings that might solve this particular problem.

The warnings came from PDFBox which is depended by tabula-java. Unfortunately, the problem itself comes from PDF itself and no way to workaround with tabula-py.

Related

The dictionary does not contain required key: Pages

I am trying to convert pdf to pdf/a using PDFNetPython3. However I am getting following errors.
Main error message:
The dictionary does not contain required key: Pages
According to PDFNetPython3 docs.
from PDFNetPython3 import PDFNet, PDFACompliance
# ... some necessary code like temp_file_path_in (this is not null and has values of file_object)
pdf_a = PDFACompliance(True, tmp_file_path_in, None, PDFACompliance.e_Level2B, 0, 0, 10)
Also tried using this(got same error):
pdf_a = PDFACompliance(True, filename, None, PDFACompliance.e_Level2B, 0, 10)
I wanted to know does this Pages related to pdf page numbers or total page count. I am merging a blank pdf page with other pdf pages and converting to pdfa !
Refrence: https://www.pdftron.com/documentation/python/guides/features/pdfa/convert/
Thanks in advance!!!
The exception indicates that the document you are processing does not contain any pages. Since you are merging a blank PDF, it is likely you missed the PDFDoc.PagePushBack(page) call.
If this does not help, please share your code for creating and merging the PDF.

from invalid argument: 'value' must be a single Unicode code point error using ActionChains class of Selenium through Python

File "C:/Users/User/Test.py", line 58, in <module>
.send_keys(DTD) \
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument
from invalid argument: 'value' must be a single Unicode code point
This is the error I encountered when I send_keys to Date filed on Chrome browser.
The followings are my data and part of code.
Data
Part of code
wb = pandas.read_excel(excel.xlsx)
Journal = wb.values.tolist()
for JV in Journal:
DTD = str(JV[0]) #Date
Actions(driver) \ #Make entry to the filed on google chrome browser
.send_keys(DTD) \
.perform()
This error message...
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument
from invalid argument: 'value' must be a single Unicode code point
...implies that there was a compatibility issue while converting a non w3c command to w3c standard command.
As per the discussion in ActionChains perform returns exception 'value' must be a single unicode point this issue was observed with appium Version 1.11.1 when used along with ChromeDriver v2.45 setting the standards mode with:
goog:chromeOptions.w3c:true
Excert from release notes:
Resolved issue 2536: Make standards mode (goog:chromeOptions.w3c:true) the default [Pri-2]
Solution
An immediate solution would be to:
Update ChromeDriver to current ChromeDriver v78.0 level.
Update Chrome to current Chrome Version 78.0 level. (as per ChromeDriver v78.0.3904.105 release notes)
tl; dr
A couple of relevant discussions are as follows:
perform() action chain - Getting exception
ActionsChains key_action.pause causes "exception 'value' must be a single unicode point" in Appium Webview or Chromium

how to save file as pdf from binary output using logic app

I have HTTP output for body('HTTP') is -
{"statusCode":200,"headers":{"dataserviceversion":"2.0","sap-metadata-last-modified":"Fri, 09 Aug 2019 10:41:57 GMT",
"Cache-Control":"no-store, no-cache","Date":"Fri,
09 Aug 2019 11:26:58 GMT","Content-Type":"application/atom+xml; type=feed; charset=utf-8","Content-Length":"1365811"},
"body":{"$content-type":"application/atom+xml; type=feed; charset=utf-8","$content":"PGZlZWQgeG1sbn
note - content-type is application/atom+xml
$content is having binary data which is nothing but pdf file.
I want to get this data and convert it into pdf file
into sharepoint connector while create pdf file i send body('HTTP')?['content'] still unable to create a valid pdf file. even tried only body('HTTP') same error.
while opening file it throws error -
in sharepoint create file connector i'm passing that binary string like - base64ToBinary(thatBinaryValue) but in sharepoint type is
"$content-type": "application/octet-stream",
I tried with HTTP trigger to process PDF, and I suppose it's caused by the content expression. If I use the #triggerBody()['$content'] to process pdf content, the file couldn't be open too.
So I change it to #triggerBody() to get the content and it works, so yours should be just body('HTTP'). You could have a try, hope this could help you.
Update: I tried with HTTP action, and got the same result, it's the expression problem. So just change the expression.

Exceptions when reading PDF tables with Tabula

I am using tabula-0.9.2 with Python 3.6.1 & java version "1.8.0_45" to extract tables from some PDFs as follows:
from tabula import read_pdf_table
read_pdf_table(pdf_file, pages=1, silent=True)
For the most part, this works, but I encounter several of these exceptions. Anyone knows how to find out the root cause for this? Is there an argument to read_pdf_table that I am missing which should potentially this this issue? I think I have all the dependency versions correct, unless I am missing something? Please advise. Thanks.
Jul 13, 2017 3:52:31 PM org.apache.pdfbox.pdfviewer.PageDrawer processTextPosition
SEVERE: java.io.IOException: Problem reading font data.
java.io.IOException: Problem reading font data.
at java.awt.Font.createFont0(Font.java:1000)
at java.awt.Font.createFont(Font.java:877)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:471)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:110)
at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:260)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
at technology.tabula.CommandLineApp$TableExtractor.extractTablesBasic(CommandLineApp.java:372)
at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:359)
at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:166)
at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:123)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Using fileDialog2 to open files in Maya

In the documentation for fileDialog2 (http://download.autodesk.com/us/maya/2011help/pymel/generated/functions/pymel.core.system/pymel.core.system.fileDialog2.html), it says acceptMode (am) can be set to 0 or 1 to tell it if it should be opening or saving images.
However, upon setting this to 0 or 1, nothing actually happens and None is returned, and just leaving it empty will result in save dialog box. I'm currently using fileDialog to get around the problem, but it's an earlier version without as much functionality, and when one newer function should cover both, it seems pointless having to use an old one at the same time.
Here's a quick example of what to do:
import pymel.core as pm
pm.fileDialog2()
#brings up a save file window
pm.fileDialog2( am = 1 )
pm.fileDialog2( acceptMode = 0 )
#nothing happens
Also, using help(pm.fileDialog2) just comes up with help for NoneType or list depending on if a file is selected or not.
You need to specify the fileMode option:
import pymel.core as pm
test = pm.fileDialog2(fileMode=1)
print test

Resources