Not able to use JohnSnowLabs pretrained model in Zeppelin - apache-spark

I want to use the JohnSnowLabs pretrained spell check module in my Zeppelin notebook. As mentioned here I have added com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.3 to the Zeppelin dependency section as shown below:
However, when I try to run the following simple code
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.Finisher
val df = Seq("tiolt cde", "eefg efa efb").toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected"),
new Finisher().setInputCols("corrected")
))
df.transform(df => nlpPipeline.fit(df).transform(df)).show(false)
it gives an error as follows:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, xxx.xxx.xxx.xxx, executor 0): java.io.FileNotFoundException: File file:/root/cache_pretrained/spell_fast_en_1.6.2_2_1534781328404/metadata/part-00000 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
...
How can I add this JohnSnowLabs spelling check pretrained model in Zeppelin? The above code works when directly ran on the Spark-shell.

Whenever you have problem with auto download of pre-trained models/pipelines due to your environment, you can always load them manually.
Here is an example for loading a French model (same concept for any other annotator):
val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
.setInputCols("document", "token")
.setOutputCol("pos")
Source:
https://nlp.johnsnowlabs.com/docs/en/models

Related

Getting error while doing Standardization after Window Partitioning of Pyspark Dataframe

Dataframe:
Above is my dataframe, I want to add a new column with value 1, if first transaction_date for an item is after 01.01.2022, else 0.
To do this i use the below window.partition code:
windowSpec = Window.partitionBy("article_id").orderBy("transaction_date")
feature_grid = feature_grid.withColumn("row_number",row_number().over(windowSpec)) \
.withColumn('new_item',
when(
(f.col('row_number') == 1) & (f.col('transaction_date') >= ‘01.01.2022’), 1) .otherwise(0))\
.drop('row_number')
I want to perform clustering on the dataframe, for which I am using VectorAssembler with the below code:
from pyspark.ml.feature import VectorAssembler
input_cols = feature_grid.columns
assemble=VectorAssembler(inputCols= input_cols, outputCol='features')
assembled_data=assemble.transform(feature_grid)
For standardisation;
from pyspark.ml.feature import StandardScaler
scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
display(data_scale_output)
The standardisation code chunk gives me the below error, only when I am using the above partitioning method, without that partitioning method, the code is working fine.
Error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 182.0 failed 4 times, most recent failure: Lost task
0.3 in stage 182.0 (TID 3635) (10.205.234.124 executor 1): org.apache.spark.SparkException: Failed to execute user defined
function (VectorAssembler$$Lambda$3621/907379691
Can someone tell me what am I doing wrong here, or what is the cause of the error ?
This error is triggered by the null values in columns, which are assembled when using the spark VectorAssembler. Please fill the null column before transform your dataframe.

PySpark textFile replace text

The following is a few rows from an example file which is ~ 30GB
### s3://mybucket/tmp/file_in.txt
"one"|"mike"|"456"|"2010-01-04"
"two"|"lisa"|"789"|"2011-03-08"
"three"|"ann"|"845"|"2012-06-11"
I'd like to use PySpark to...
read the text file using spark's parallelism
replace the "n" character with "X"
output the updated text to a new text file with the same format
so the resulting file would look like this:
### s3://mybucket/tmp/file_out.txt
"oXe"|"mike"|"456"|"2010-01-04"
"two"|"lisa"|"789"|"2011-03-08"
"three"|"aXX"|"845"|"2012-06-11"
I have tried a variety of ways to achieve this seemingly simple task...
data = sc.textFile('s3://mybucket/tmp/file_in.txt')
def make_replacement(row):
result = row.replace("n", "X")
return result
out_data = data.map(make_replacement).collect()
#out_data = data.map(lambda line: make_replacement(line)).collect()
out_data.coalesce(1).write.format("text").option("header", "false").save("s3://mybucket/tmp/file_out.txt")
but I continue to see the following errors:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 21, <<my_server>>, executor 9): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application....
at org.apache.spark.api.python.VirtualEnvFactory.execCommand(VirtualEnvFactory.scala:120)
Note: solutions using read.csv or dataframe will not be applicable to this problem
Any recommendations on how to solve this?
You can create an expression and call the expression in select
from pyspark.sql import functions as F
df = spark.read.csv('s3://mybucket/tmp/file_in.txt','\t')
expr = [F.regexp_replace(F.col(column), pattern="n", replacement="X").alias(column) for column in df.columns]
df = df.select(expr)
df.write.csv.format("text").option("header", "false").save("s3://mybucket/tmp/file_out.txt")
If you need not to play with data set then why you even looking for spark .
use python file read and write code and replace the character.
sample code

Handling corrupt JSON rows in Spark 2.11 - different behaviour than 1.6

We have snappy files that we read with sql context. e.g.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("s3://bucket/problemfile.snappy")
In spark 1.6 we would handle corrupt records by something like the below:
invalidJSON = rawEvents.select("*").where("_corrupt_record is not null");
validJSON = rawEvents.select("*").where("_corrupt_record is null");
In Spark 2.11, we are not even able to read the corrupted record e.g
scala> df.select("*").where("_corrupt_record is null").count()
18/03/31 00:45:06 ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-31-48-73.ec2.internal, executor 2):
java.io.CharConversionException: Unsupported UCS-4 endianness (3412) detected
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.reportWeirdUCS4(ByteSourceJsonBootstrapper.java:469)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.checkUTF32(ByteSourceJsonBootstrapper.java:434)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:141)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1287)
I know we can set spark.sql.files.ignoreCorruptFiles=true in 2.X but that we'd potentially lose records depending on where the corrupted record was.
Is there any other way we can skip over the corrupted record?
Thanks
You could do something like this:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read
.option("mode", "DROPMALFORMED")
.json("s3://bucket/problemfile.snappy")
This way Spark will drop invalid JSON for you, but you won't see any corrupt record.

How to read/write a hive table from within the spark executors

I have a requirement wherein I am using DStream to retrieve the messages from Kafka. Now after getting message or RDD now i use a map operation to process the messages independently on the executors. The one challenge I am facing is i need to read/write to a hive table from within the executors and for this i need access to SQLContext. But as far as i know SparkSession is available at driver side only and should not be used within the executors. Now without the spark session (in spark 2.1.1) i can't get hold of SQLContext. To summarize
My driver codes looks something like:
if (inputDStream_obj.isSuccess) {
val inputDStream = inputDStream_obj.get
inputDStream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
val rdd1 = rdd.map(idocMessage => SegmentLoader.processMessage(props, idocMessage.value(), true))
}
}
So after this rdd.map the next code is executed on the executors and there I have something like:
val sqlContext = spark.sqlContext
import sqlContext.implicits._
spark.sql("USE " + databaseName)
val result = Try(df.write.insertInto(tableName))
Passing sparksession or sqlcontext gives error when they are used on the executor:
When I try to obtain the existing sparksession: org.apache.spark.SparkException: A master URL must be set in your configuration
When I broadcast session variable:User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
When i pass sparksession object: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
Let me know if you can suggest how to query/update a hive table from within the executors.
Thanks,
Ritwick

Handle Null Values when using CustomSchema in apache spark

I am importing data based on a customSchema which I have defined in the following way
import org.apache.spark.sql.types.{StructType, StructField,DoubleType,StringType };
val customSchema_train = StructType(Array(
StructField("x53",DoubleType,true),
StructField("x95",DoubleType,true),
StructField("x88",DoubleType,true),
StructField("x30",DoubleType,true),
StructField("x42",DoubleType,true),
StructField("x28",DoubleType,true)
))
val train_orig = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(customSchema_train).option("nullValue","null").load("/....../train.csv").cache
Now I know there are null values in my data which are there as "null" and I have tried to handle that accordingly. The import happens without any error but when I try to describe the data I get this error
train_df.describe().show
SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 56, localhost): java.text.ParseException: Unparseable number: "null"
How Can handle this error?

Resources