I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))
Related
I'm trying to create a DataFrame from 2 custom sentences, just to test. But from the code I made I'm unable to create it.
spark = SparkSession.builder.appName('first').getOrCreate()
df = spark.createDataFrame(
[
(0, "Hi this is a Spark tutorial"),
(1, "This tutorial is made in Python language")
], ['id', 'sentence']
)
df.show()
This gives me this error:
Py4JJavaError: An error occurred while calling o73.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
I tried to create a schema
schema = StructType(
[StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True)]
)
and pass it like an argument schema=schema but it is the same roadend.
My code was working fine earlier but now it is showing cache memory issue.My program involves DataFrame loading,transformation and processing running on jupyter notebook connected with pyspark-shell.
I am not understanding what is the main issue and how to tackle it.Any Help is highly appreciated.
My code is:
import time
start = time.time()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('spark://172.16.12.200:7077').appName('new').getOrCreate()
ndf = spark.read.json("Musical_Instruments.json")
pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])
spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files/checkpoints")
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=5,regParam=0.09,rank=25,userCol="reviewerID_index",itemCol="asin_index",ratingCol="overall",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())
user_recs=model.recommendForAllUsers(10).show(20)
end = time.time()
print("execution time",end-start)
Error code is:
Error:
Py4JJavaError: An error occurred while calling o40.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 172.16.12.208, executor 1): java.io.FileNotFoundException: File file:/home/npproject/jupyter_files /Musical_Instruments.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
This question already has an answer here:
VectorUDT usage
(1 answer)
Closed 4 years ago.
Error while converting Vector to a data frame
The code mentioned in first part works well but it is a non-intuitive way to convert vector data into a data frame.
I would like to solve this with what I know i.e. Code mentioned in second part.
Could you please assist
val data = Seq(
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0),
Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)
val tupleList = data.map(Tuple1.apply)
val df = rdd.toDF("features")
Can't we do simply like below
val rdd = sc.parallelize(data).map(a => Row(a))
rdd.take(1)
val fields = "features".split(" ").map(fields => StructField(fields,DoubleType, nullable =true))
val df = spark.createDataFrame(rdd, StructType(fields))
df.count()
But I am getting an error as below
df: org.apache.spark.sql.DataFrame = [features: double]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 357.0 failed 4 times, most recent failure: Lost task 1.3 in stage 357.0 (TID 1243, datacouch, executor 3): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: org.apache.spark.ml.linalg.DenseVector is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, features), DoubleType) AS features#6583
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
As clearly explained in VectorUDT usage and in exception you get, the correct DataType for Vector is org.apache.spark.ml.linalg.SQLDataTypes.VectorType:
spark.createDataFrame(
rdd,
StructType(Seq(
StructField("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)
))
)
I have a Spark dataframe I would like to use to run a simple PCA example. I have looked at this example and notice this works because they transpose the features into vectors:
from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features", outputCol="pca_features")
I am trying to reproduce the same kind of simple PCA by using a Spark Dataframe I have created my self. How would I transform my Spark DataFrame into a form similar to the above so I could run it with one input column and one output column?
I looked into using RowMatrix as shown here but I am not understanding if this is the way to go (see error below).
>>>from pyspark.mllib.linalg import Vectors
>>>from pyspark.mllib.linalg.distributed import RowMatrix
>>>from pyspark.ml.feature import PCA
>>>master = pd.read_parquet('master.parquet',engine='fastparquet')
>>>A = sc.parallelize(master)
>>>mat = RowMatrix(A)
>>>pc = mat.computePrincipalComponents(4)
Py4JJavaError: An error occurred while calling
o382.computePrincipalComponents. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost,
executor driver): org.apache.spark.api.python.PythonException:
Traceback (most recent call last)
In Pyspark for mllib library you need to convert all the features into a single feature vector.
You can do the same using a Vector Assembler:
https://spark.apache.org/docs/latest/ml-features.html#vectorindexer
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=inputColumnsList,outputCol='features')
assembler.transform(df)
Where inputColsList contains a list of all the features u want to use
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))