Can't get the johnsnow OCR notebook run on databricks - apache-spark

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to
{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
println("---------------")
println(chunk)}
}
Error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI
Anyone knows why? Much appreciated!

To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:
import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()
val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")
Update: There is a new doc for Spark OCR and special instructions for Databricks:
https://nlp.johnsnowlabs.com/docs/en/ocr

Related

to_date conversion failing in PySpark on Spark 3.0

Having known about calendar change in Spark 3.0, I am trying to understand why the cast is failing in this particular instance. Spark 3.0 has issues with dates before year 1582. However, in this example, year is greater than 1582.
rdd = sc.parallelize(["3192016"])
df = rdd.map(row).toDF()
df.createOrReplaceTempView("date_test")
sqlDF = spark.sql("SELECT to_date(date, 'yyyymmdd') FROM date_test")
Fails with
Py4JJavaError: An error occurred while calling o1519.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 167.0 failed 4 times, most recent failure: Lost task 10.3 in stage 167.0 (TID 910) (172.36.189.123 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
You just need to turn spark.sql.legacy.timeParserPolicy to LEGACY to get the behaviour from previous versions
There is an error that shows:
SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Here how you can do it with python
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
Check quick example in the image below

Record larger than the Split size in AWS GLUE?

I'm Newbie in AWS Glue and Spark.
I build my ETL in this.
When connect my s3 with files of 200mb approximately not read this.
The error is that
An error was encountered:
An error occurred while calling o99.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 16) (91ec547edca7 executor driver): com.amazonaws.services.glue.util.NonFatalException: Record larger than the Split size: 67108864
Update 1:
When split my json file(200mb) with jq, in two parts AWS GLUE, read with normally both parts
My solution is a lambda splitting file, but i want to know how aws glue split works
Thanks and Regards

org.apache.spark.SparkException: Job aborted due to stage failure in pyspark

Sorry for the duplicate post. I'm creating again another post as those posts are unable to solve my problem.
I'm running ML Regression on pyspark 3.0.1. I'm running it on a cluster of 640 GB memory & 32 worker node.
I have a data set with 33751 rows & 63 columns. I'm trying to prepare the data set for ML regression. So I wrote following code
from pyspark.ml.feature import VectorAssembler, StandardScaler
input_col=[...]
vector_assembler=VectorAssembler(inputCols=input_col,outputCol='ss_feature')
temp_train=vector_assembler.transform(train)
standard_scaler=StandardScaler(inputCol='ss_feature',outputCol='scaled')
train=standard_scaler.fit(temp_train).transform(temp_train)
But I'm getting error message at the last line execution
org.apache.spark.SparkException: Job aborted due to stage failure: Task 169 in stage 57.0 failed 4
times, most recent failure: Lost task 169.3 in stage 57.0 (TID 5522, 10.8.64.22, executor 11):
org.apache.spark.SparkException: Failed to execute user defined
function(VectorAssembler$$Lambda$6296/1890764576:
Can you suggest me how do I solve this issue?

convert pickle (.pck) file into spark data frame using python

Hello!
Dear members i want to train model using Bigdl, i have data set of Medical images in the form of pickle object files (,pck).that pickle file is a 3D image(3D array)
i have tried to convert this into spark datafram by using BigDl python API
pickleRdd = sc.pickleFilehome/student/BigDL-
trainings/elephantscale/data/volumetric_data/329637-8.pck
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(pickleRdd)
it throws error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver)
: java.io.IOException: file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck not a SequenceFile
i have executed this code on python 3.5 as well as 2.7 in both cases i am getting error

Spark and SparklyR error "grows beyond 64 KB"

I am getting the following error on Spark after calling logistic regression using SparklyR and Spark 2.0.2.
ml_logistic_regression(Data, ml_formula)
The dataset that I read into Spark is relatively large (2.2GB). Here is the error message:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
13 in stage 64.0 failed 1 times, most recent failure:
Lost task 13.0 in stage 64.0 (TID 1132, localhost):
java.util.concurrent.ExecutionException:
java.lang.Exception:
failed to compile: org.codehaus.janino.JaninoRuntimeException:
Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z"
of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate"
grows beyond 64 KB
Others have had a similar issue: https://github.com/rstudio/sparklyr/issues/298 but I cannot find a resolution. Any ideas?
What happens when you subset the data and try running the model? You might need to change your configuration settings to deal with the size of the data:
library(dplyr)
library(sparklyr)
#configure the spark session and connect
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "XXG" #change depending on the size of the data
config$`sparklyr.shell.executor-memory` <- "XXG"
sc <- spark_connect(master='yarn-client', spark_home='/XXXX/XXXX/XXXX',config = config)
There are other settings in spark_config() that you could change as well to deal with performance. This is just an example of a couple.

Resources