This question already has an answer here:
VectorUDT usage
(1 answer)
Closed 4 years ago.
Error while converting Vector to a data frame
The code mentioned in first part works well but it is a non-intuitive way to convert vector data into a data frame.
I would like to solve this with what I know i.e. Code mentioned in second part.
Could you please assist
val data = Seq(
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0),
Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)
val tupleList = data.map(Tuple1.apply)
val df = rdd.toDF("features")
Can't we do simply like below
val rdd = sc.parallelize(data).map(a => Row(a))
rdd.take(1)
val fields = "features".split(" ").map(fields => StructField(fields,DoubleType, nullable =true))
val df = spark.createDataFrame(rdd, StructType(fields))
df.count()
But I am getting an error as below
df: org.apache.spark.sql.DataFrame = [features: double]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 357.0 failed 4 times, most recent failure: Lost task 1.3 in stage 357.0 (TID 1243, datacouch, executor 3): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: org.apache.spark.ml.linalg.DenseVector is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, features), DoubleType) AS features#6583
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
As clearly explained in VectorUDT usage and in exception you get, the correct DataType for Vector is org.apache.spark.ml.linalg.SQLDataTypes.VectorType:
spark.createDataFrame(
rdd,
StructType(Seq(
StructField("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)
))
)
Related
I'm trying to create a DataFrame from 2 custom sentences, just to test. But from the code I made I'm unable to create it.
spark = SparkSession.builder.appName('first').getOrCreate()
df = spark.createDataFrame(
[
(0, "Hi this is a Spark tutorial"),
(1, "This tutorial is made in Python language")
], ['id', 'sentence']
)
df.show()
This gives me this error:
Py4JJavaError: An error occurred while calling o73.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
I tried to create a schema
schema = StructType(
[StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True)]
)
and pass it like an argument schema=schema but it is the same roadend.
The spark svd code example looks like this
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val rows = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(rows)
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)
now the problam is how can I create a data rdd[Vectors.sparse] object in order to fit the svd function. The data looks like this, which is type of Vectors.sparse
{"type":0,"size":205209,"indices":[24119,32380,201090],"values":
[1.8138314440983385,1.6036455249478836,1.3787660101958308]}
{"type":0,"size":205209,"indices":[24119,32380,176747,201090],"values":
[5.441494332295015,3.207291049895767,3.2043056252302478,2.7575320203916616]}
So far I tried this
val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))
I got this error
[error] /home/lujunchen/project/spark_code/src/main/scala/svd_feature_engineer.scala:39:71: type mismatch;
[error] found : (size: Int, elements: Iterable[(Integer, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, elements: Seq[(Int, Double)])org.apache.spark.mllib.linalg.Vector <and> (size: Int, indices: Array[Int], values: Array[Double])org.apache.spark.mllib.linalg.Vector
[error] required: org.apache.spark.sql.Row => ?
[error] val rows = df_raw_features.select("raw_features").rdd.map(Vectors.sparse).map(Row(_))
I want to convert RDD to DataFrame using StructType. But item "Broken,Line," would cause error. Is there an elegant way to process record like this? Thanks.
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val mySchema = StructType(Array(
StructField("colA", StringType, true),
StructField("colB", StringType, true),
StructField("colC", StringType, true)))
val x = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val inputx = sc.parallelize(x).
| map((x:String) => Row.fromSeq(x.split(",").slice(0,mySchema.size).toSeq))
val df = spark.createDataFrame(inputx, mySchema)
df.show
Error would be like this:
Name: org.apache.spark.SparkException Message: Job aborted due to
stage failure: Task 0 in stage 14.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor
driver): java.lang.RuntimeException: Error while encoding:
java.lang.ArrayIndexOutOfBoundsException: 2
I'm using:
Spark: 2.2.0
Scala: 2.11.8
And I ran the code in spark-shell.
Row.fromSeq on which we apply your schema throws the error that you are getting. Your third element in your list contains just 2 elements. You can't transform it into a Row with three elements unless you add a null value instead of the missing value.
And when creating your DataFrame, Spark is expecting 3 elements per Row on which to apply the schema, thus the error.
A quick and dirty solution would be to use scala.util.Try to get fields separately :
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
import scala.util.Try
val mySchema = StructType(Array(StructField("colA", StringType, true), StructField("colB", StringType, true), StructField("colC", StringType, true)))
val l = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val rdd = sc.parallelize(l).map {
x => {
val fields = x.split(",").slice(0, mySchema.size)
val f1 = Try(fields(0)).getOrElse("")
val f2 = Try(fields(1)).getOrElse("")
val f3 = Try(fields(2)).getOrElse("")
Row(f1, f2, f3)
}
}
val df = spark.createDataFrame(rdd, mySchema)
df.show
// +------+-----+----+
// | colA| colB|colC|
// +------+-----+----+
// | 97573|Start| eee|
// | 9713| END|Good|
// |Broken| Line| |
// +------+-----+----+
I wouldn't say that it's an elegant solution like you've asked. Parsing strings is never elegant ! You ought using the csv source to read it correctly (or spark-csv for < 2.x).
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))