Exceptions when reading PDF tables with Tabula - python-3.x

I am using tabula-0.9.2 with Python 3.6.1 & java version "1.8.0_45" to extract tables from some PDFs as follows:
from tabula import read_pdf_table
read_pdf_table(pdf_file, pages=1, silent=True)
For the most part, this works, but I encounter several of these exceptions. Anyone knows how to find out the root cause for this? Is there an argument to read_pdf_table that I am missing which should potentially this this issue? I think I have all the dependency versions correct, unless I am missing something? Please advise. Thanks.
Jul 13, 2017 3:52:31 PM org.apache.pdfbox.pdfviewer.PageDrawer processTextPosition
SEVERE: java.io.IOException: Problem reading font data.
java.io.IOException: Problem reading font data.
at java.awt.Font.createFont0(Font.java:1000)
at java.awt.Font.createFont(Font.java:877)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:471)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:110)
at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:260)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
at technology.tabula.CommandLineApp$TableExtractor.extractTablesBasic(CommandLineApp.java:372)
at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:359)
at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:166)
at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:123)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Related

Pyspark -No module named coverage_daemon

I am trying to execute this simple code in my dataframe:
import ast rddAlertsRdd = df.rdd.map(lambda message: ast.literal_eval(message['value'])) rddAlerts= rddAlertsRdd.collect()
But I´m getting the error below:
Versions:
Spark: 3.3.1
Hadoop: 2.7
Python: 3.7
Pyspark: 3.3.1
Py4j: 0.10.9.5
OpenJDK: 8
Can it be a problem related to compatibility versions? Appreciate your help!
In order to solve the problem I tried to change Spark environment variables in my Dockerfile.
This is what I have in my Dockerfile:
tl;dr No idea what could be wrong but giving you a little more about the possible cause while reading the source code. Hope this helps.
The only place with coverage_daemon is python/test_coverage/conf/spark-defaults.conf which (as you may've guessed already) is for test coverage and does not seem to be used in production.
It appears that for some reason python/run-tests-with-coverage got executed.
It looks as if you're using Jupyter environment that seems misconfigured.

Error with adding text components in psychopy procedure

I have downloaded the procedure I have on my laptop on my lab’s computer from my Google Drive, and I had to take some edits to test it.In order to this, I have to add some text components with some specific code. However when I tried to add and use the $ symbol to read the text I would like to refer to, I get the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/psychopy/app/builder/dialogs/paramCtrls.py", line 31, in validate
validate(self, self.valType)
File "/usr/local/lib/python3.8/dist-packages/psychopy/app/builder/dialogs/paramCtrls.py", line 550, in validate
val = str(obj.GetValue())
RecursionError: maximum recursion depth exceeded while calling a Python object
Could anyone possibly have some good piece of advice or workaround to manage with it?
Thanks
What version of PsychoPy are you using? I believe this was fixed in Feb 2022 so probably updating to a new version of psychopy should fix it for you.
https://github.com/psychopy/psychopy/pull/4569

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Trouble importing VTK files into Paraview (Error reading ascii data)

I am very new to using Paraview, and I'm trying to import a few VTK files and view them. However, I'm receiving the following errors:
Generic Warning: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkDataReader.cxx, line 1436
Error reading ascii data. Possible mismatch of datasize with declaration.
ERROR: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkUnstructuredGridReader.cxx, line 346
vtkUnstructuredGridReader (0x7fb15582bd10): Unrecognized keyword: ,
I can't seem to figure out what's wrong, I've tried converting them to other formats to no avail.
I don't think there's a problem with the files. I can open them with Paraview 5.6. Maybe they were generated with a version of VTK that is more recent than the one used for your version of Paraview. You should install the latest version of Paraview (or at least 5.6).
The big file results in some visible geometry, the smaller one does not. But I have no error message, everything seems ok.

error uploading csv file on cloud jupyter notebook

I have set up a google cloud account
I want to perform my deep learning much more faster on a jupyter notebook, but
I cannot find a way to read my csv file
I downloaded it with wget from my github account and afterwards I tried
dataset = pd.read_csv('/home/user/.jupyter/SIEMENSTRAIN.csv')
but I get the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Why? When I read it on my laptop using my jupyter notebooks, everything runs well
Any suggestions?
I tried the recommended solutions for this error and I got the next warning
/home/user/anaconda3/lib/python3.5/site-packages/ipykernel/main.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if name == 'main':
When I ran dataset.head() this is what appeared
Any help please?
There are a number of possibilities that could be causing the problem... I would first always make sure that Pandas (pd)'s version is updated and compatible.
The more likely cause is that the CSV itself is not right, so pd.read_csv() is not able to work correctly (thus a Parse Error). This may have something to do with the headers, though I'm not sure what your original CSV file looks like. It's worth playing around with read_csv, for example:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
This tampers with 2 things - the delimiter, and if pd is reading a header from CSV or not.
I go through some pd.read_csv() stuff in my book about Stock Prediction (another cool Machine Learning problem) and Deep Learning, feel free to check it out.
Good Luck!
I tried what you proposed and this is what I got
So, any suggestions?
I suppose that the path is ok, but it just won't be read properly, or am I wrong?

Resources