I'm trying to create a DataFrame from 2 custom sentences, just to test. But from the code I made I'm unable to create it.
spark = SparkSession.builder.appName('first').getOrCreate()
df = spark.createDataFrame(
[
(0, "Hi this is a Spark tutorial"),
(1, "This tutorial is made in Python language")
], ['id', 'sentence']
)
df.show()
This gives me this error:
Py4JJavaError: An error occurred while calling o73.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
I tried to create a schema
schema = StructType(
[StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True)]
)
and pass it like an argument schema=schema but it is the same roadend.
Related
I am trying to solve the problems from O'Reilly book of Learning Spark.
Below part of code is working fine
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# define schema for our data
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("First", StringType(), False),
StructField("Last", StringType(), False),
StructField("Url", StringType(), False),
StructField("Published", StringType(), False),
StructField("Hits", IntegerType(), False),
StructField("Campaigns", ArrayType(StringType()), False)])
#create our data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter", "LinkedIn"]],
[2, "Brooke","Wenig","https://tinyurl.2", "5/5/2018", 8908, ["twitter", "LinkedIn"]],
[3, "Denny", "Lee", "https://tinyurl.3","6/7/2019",7659, ["web", "twitter", "FB",
"LinkedIn"]],
[4, "Tathagata", "Das","https://tinyurl.4", "5/12/2018", 10568, ["twitter", "FB"]],
[5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web", "twitter", "FB",
"LinkedIn"]],
[6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568, ["twitter", "LinkedIn"]]
]
# main program
if __name__ == "__main__":
# create a SparkSession
spark = (SparkSession
.builder
.appName("Example-3_6")
.getOrCreate())
# create a DataFrame using the schema defined above
blogs_df = spark.createDataFrame(data, schema)
But when I am trying to execute .show(), I am getting java error. Can somebody help me on how do I resolve this error ?
blogs_df.show()
Error :
Py4JJavaError: An error occurred while calling o95.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (<>.<>.com executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:165)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
At the same time when I am executing below code, I am getting the result of df.show()
from pyspark.sql.types import StructType, IntegerType, StringType
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
schema = StructType() \
.add("city", StringType(), True) \
.add("state", StringType(), True) \
.add("pop", IntegerType(), True)
df_with_schema1 = spark.read.format("csv") \
.option("delimiter", ",") \
.option("header", True) \
.schema(schema) \
.load("<directory>\\pyspark-test.csv")
df_with_schema1.show()
You most likely don't have Python installed properly. Also gave this a try
spark = (SparkSession
.builder
.master('local[*]') # add this
.appName("Example-3_6")
.getOrCreate())
My code was working fine earlier but now it is showing cache memory issue.My program involves DataFrame loading,transformation and processing running on jupyter notebook connected with pyspark-shell.
I am not understanding what is the main issue and how to tackle it.Any Help is highly appreciated.
My code is:
import time
start = time.time()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('spark://172.16.12.200:7077').appName('new').getOrCreate()
ndf = spark.read.json("Musical_Instruments.json")
pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])
spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files/checkpoints")
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=5,regParam=0.09,rank=25,userCol="reviewerID_index",itemCol="asin_index",ratingCol="overall",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())
user_recs=model.recommendForAllUsers(10).show(20)
end = time.time()
print("execution time",end-start)
Error code is:
Error:
Py4JJavaError: An error occurred while calling o40.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 172.16.12.208, executor 1): java.io.FileNotFoundException: File file:/home/npproject/jupyter_files /Musical_Instruments.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
This question already has an answer here:
VectorUDT usage
(1 answer)
Closed 4 years ago.
Error while converting Vector to a data frame
The code mentioned in first part works well but it is a non-intuitive way to convert vector data into a data frame.
I would like to solve this with what I know i.e. Code mentioned in second part.
Could you please assist
val data = Seq(
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0),
Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)
val tupleList = data.map(Tuple1.apply)
val df = rdd.toDF("features")
Can't we do simply like below
val rdd = sc.parallelize(data).map(a => Row(a))
rdd.take(1)
val fields = "features".split(" ").map(fields => StructField(fields,DoubleType, nullable =true))
val df = spark.createDataFrame(rdd, StructType(fields))
df.count()
But I am getting an error as below
df: org.apache.spark.sql.DataFrame = [features: double]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 357.0 failed 4 times, most recent failure: Lost task 1.3 in stage 357.0 (TID 1243, datacouch, executor 3): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: org.apache.spark.ml.linalg.DenseVector is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, features), DoubleType) AS features#6583
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
As clearly explained in VectorUDT usage and in exception you get, the correct DataType for Vector is org.apache.spark.ml.linalg.SQLDataTypes.VectorType:
spark.createDataFrame(
rdd,
StructType(Seq(
StructField("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)
))
)
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))