convert pickle (.pck) file into spark data frame using python - apache-spark

Hello!
Dear members i want to train model using Bigdl, i have data set of Medical images in the form of pickle object files (,pck).that pickle file is a 3D image(3D array)
i have tried to convert this into spark datafram by using BigDl python API
pickleRdd = sc.pickleFilehome/student/BigDL-
trainings/elephantscale/data/volumetric_data/329637-8.pck
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(pickleRdd)
it throws error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver)
: java.io.IOException: file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck not a SequenceFile
i have executed this code on python 3.5 as well as 2.7 in both cases i am getting error

Related

org.apache.spark.SparkException: Job aborted due to stage failure in pyspark

Sorry for the duplicate post. I'm creating again another post as those posts are unable to solve my problem.
I'm running ML Regression on pyspark 3.0.1. I'm running it on a cluster of 640 GB memory & 32 worker node.
I have a data set with 33751 rows & 63 columns. I'm trying to prepare the data set for ML regression. So I wrote following code
from pyspark.ml.feature import VectorAssembler, StandardScaler
input_col=[...]
vector_assembler=VectorAssembler(inputCols=input_col,outputCol='ss_feature')
temp_train=vector_assembler.transform(train)
standard_scaler=StandardScaler(inputCol='ss_feature',outputCol='scaled')
train=standard_scaler.fit(temp_train).transform(temp_train)
But I'm getting error message at the last line execution
org.apache.spark.SparkException: Job aborted due to stage failure: Task 169 in stage 57.0 failed 4
times, most recent failure: Lost task 169.3 in stage 57.0 (TID 5522, 10.8.64.22, executor 11):
org.apache.spark.SparkException: Failed to execute user defined
function(VectorAssembler$$Lambda$6296/1890764576:
Can you suggest me how do I solve this issue?

Cannot read parquet using PySpark

I'm currently writing a program in PySpark, which involves writing dataframe to parquet using a loop. New data is appended to the parquet in each cycle. The parquet is stored in S3 bucket.
I was able to write the parquet, but when I load the parquet to a dataframe, and try to read it using
df.take(5)
I encounter the following error message
An error occurred while calling o461.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 57, ip-10-0-2-219.ec2.internal, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
...
However, I can run the following commands on the dataframe:
df.count()
df.printSchema()
Any idea why this error occurs?

Can't get the johnsnow OCR notebook run on databricks

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to
{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
println("---------------")
println(chunk)}
}
Error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI
Anyone knows why? Much appreciated!
To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:
import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()
val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")
Update: There is a new doc for Spark OCR and special instructions for Databricks:
https://nlp.johnsnowlabs.com/docs/en/ocr

I am trying to take count of no of rows fetched from a dataframe using spark sql. But I am getting the below error

I am trying to take count of no of rows fetched from a dataframe using spark sql. But I am getting the below error. Can some one help me with resolving this issue?
My sql query
collisions is my temperoray table name and i am trying to show the results.
outputDataframe = spark.sql("select count(*) from collisions where YEAR(date) = 2016")
outputDataframe.show()
Error :
Py4JJavaError: An error occurred while calling o395.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver): java.lang.IllegalArgumentException: PARKING LOT)."

Read, sort and count 20GB CSV file stored in HDFS by using pyspark RDD

I am new in Spark Hadoop. I got one use case in which I am trying to read, count number of records and sort data in 20GB CSV file. The problem is when I am using these functions its not working. Here is my code please have a look and suggest me the approach to handle the large file with spark RDD.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
APP_NAME = 'My Spark Application'
file = 0
conf = SparkConf().setAppName("APP_NAME").setMaster("local")
sc = SparkContext(conf=conf)
val_file = sc.textFile("hdfs://localhost:50000/yottaa/transactions.csv")
val_file.count() ### Its taking 10 mins to execute and produce result.
val_file.count() ---> It's taking 10 mins time to count rows, How can I increase speed ?. I'm using 16GB RAM laptop and when I am giving val_file.collect() statement it shows me following error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) at
java.nio.CharBuffer.allocate(CharBuffer.java:331) at
java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777) at
org.apache.hadoop.io.Text.decode(Text.java:412) at
org.apache.hadoop.io.Text.decode(Text.java:389) at
org.apache.hadoop.io.Text.toString(Text.java:280) at

Resources