Getting requirement failed exception on Applying K Means to PCA data - apache-spark

I was trying K means from Spark ML, on some scaled features which are a column of a spark dataframe. It works as expected. However, as soon as I apply PCA on the same feature set, and try K means on that, I get the error
java.lang.IllegalArgumentException: requirement failed
My original data before PCA has 108 features, which have been Indexed and Assembled. K means runs fine on this and there are no nulls/NANs. After this, I'm setting the number of principal components as 6,running PCA. On the PCA data, K means is being run, and I am getting the above error.
A look at the stacktrace shows that the error is in fastSquaredDistance in MLUtils
.......at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:486)
at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:589)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:563)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:557)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:557)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:580)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This throws an exception when the dimensions of the vectors are not the same, or when the vector norm is negative
I checked the size of all the vectors in the PCA column and all of the, have size 6( as expected) . As for non negative norm, I can't figure out how that's even possible.
Does anyone have leads into what could be happening?
If my question isn't formatted well, I apologize, it's my first time asking a question here
The code that's throwing the error is below
....import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(6).setSeed(42L)
val model = kmeans.fit(scaledDataPcaNN).setFeaturesCol("pcaFeatures")
val predictions = model.transform(scaledDataPca)
val WSSSE = model.computeCost(scaledDataPca)

Related

Spark exception when inserting dataframe results into a hive table

This is my code snippet. I am getting following exception when spar.sql(query) is getting executed.
My table_v2 has 262 columns. My table_v3 has 9 columns.
Can someone faced similar issue and help to resolve this? TIA
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc=spark.sparkContext
df1 = spark.sql("select * from myDB.table_v2")
df2 = spark.sql("select * from myDB.table_v3")
result_df = df1.join(df2, (df1.id_c == df2.id_c) & (df1.cycle_r == df2.cycle_r) & (df1.consumer_r == df2.consumer_r))
final_result_df = result_df.select(df1["*"])
final_result_df.distinct().createOrReplaceTempView("results")
query = "INSERT INTO TABLE myDB.table_v2_final select * from results"
spark.sql(query);
I tried to set the parameter in conf and it did not help to resolve the issue:
spark.sql.debug.maxToStringFields=500
Error:
20/12/16 19:28:20 ERROR FileFormatWriter: Job job_20201216192707_0002 aborted.
20/12/16 19:28:20 ERROR Executor: Exception in task 90.0 in stage 2.0 (TID 225)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<>
at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:293)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:326)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializer.scala:226)
at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.<init>(OrcSerializer.scala:36)
at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:36)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
I have dropped my myDB.table_v2_final and modified the below line in my code and it worked.
I suspect there might be some issue in the way I created the table.
query = "create external table myDB.table_v2_final as select * from results"

ColumnarBatch DataSource failing with Pushdown columns

I am writing a DataSource that implements SupportsScanColumnarBatch, SupportsPushDownFilters, and SupportsPushDownRequiredColumns.
I am getting with an ArrayIndexOutOfBoundsException deep inside of Spark after populating a ColumnarBatch with the same number of ColumnVectors as the length of requiredSchema provided in the pruneColumns override.
I suspect that Spark is looking for as many ColumnVectors as the column schema returned by readSchema override instead of using the schema provided by pruneColumns.
Doing a "select * from dft" works fine since the schema lengths are the same -- 15 columns in my test case. Anything less (e.g., "select col1, col2 from dft") returns the following stack trace, where it is clear that Spark is looking for more columns.
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:98)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.datasourcev2scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any clues how I can get around this? For the time being, in order to get things to continue to run, I am ignoring the pruneColumns call and returning everything.
I solved it, but it seems like a bit of a kludge.
What I did was create a ColumnVector array of the same length as the original schema (not the pruned columns), and ONLY populated the pruned columns, leaving the others in their original allocated state.
For example, if only columns with indexes 0, 5, and 9 of the original schema are in the pruned list, this is all that is required.
var cva = new Array[ColumnVector](schema.length)
cva(0).putLongs(...)
cva(5).putInts(...)
cva(9).putFloats(...)
var batch = new ColumnarBatch(cva)
...
Found a saner approach...
In your implementation of SupportsPushDownRequiredColumns let the readSchema() method return the same StructType you are getting in the pruneColumns() call!
Basically feedback what you got from Spark!
HTH

Getting Null Pointer exception while performing operations on dataframe spark

I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty.
I tried out following two approaches.
With both I am getting same exception.
Approach 1: Build dataset using sparkSession.createDataframe().
System.out.println("RDD Count: " + rdd.count());
Dataset<Row> rows = applicationSession
.getSparkSession().createDataFrame(rdd, data.getSchema()).toDF(data.convertListToSeq(data.getColumnNames()));
rows.createOrReplaceTempView(createStagingTableName(sparkTableName));
rows.show();
rows.printSchema();
Approach 2: Use Hive Context to create dataset.
System.out.println("RDD Count: " + rdd.count());
System.out.println("Create view using HiveContext..");
Dataset<Row> rows = applicationSession.gethiveContext().applySchema(rdd, data.getSchema());
I am able to print schema for above dataset using both apporaches.
Not sure what exactly causing null pointer exception.
Show() method internally invokes take() method which is throwing null pointer exception.
But why this dataset is populated as NULL? if RDD contains values then it should not be null.
This is a strange behaviour.
Below are logs for the same.
RDD Count: 35
Also I am able to run above code in local mode without any exception it is working fine.
As soon as I deploy this code on Yarn I start getting following exception.
I am able to create dataframe even I am able to register view for the same.
As soon as I perfrom rows.show() or rows.count() operation on this dataset I am getting following error.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Am I doing anything wrong here?
Please suggest.
Can you post the schema for dataframe? Issue is with schema string you are using and separator that you are using to split the schema string.

Can't figure out cause of Spark LinearRegression Error

I'm trying to do a very simple LinearRegression in PySpark using a housing data set I found on Kaggle. There are a bunch of columns, but to make this (virtually) as simple as possible, I'm retaining just two of the columns (after starting with all of them), and still no luck getting a model trained. Here is what the data frame looks like before going through the regression step:
2016-09-07 17:12:08,804 root INFO [Row(price=78000.0, sqft_living=780.0, sqft_lot=16344.0, features=DenseVector([780.0, 16344.0])), Row(price=80000.0, sqft_living=430.0, sqft_lot=5050.0, features=DenseVector([430.0, 5050.0])), Row(price=81000.0, sqft_living=730.0, sqft_lot=9975.0, features=DenseVector([730.0, 9975.0])), Row(price=82000.0, sqft_living=860.0, sqft_lot=10426.0, features=DenseVector([860.0, 10426.0])), Row(price=84000.0, sqft_living=700.0, sqft_lot=20130.0, features=DenseVector([700.0, 20130.0])), Row(price=85000.0, sqft_living=830.0, sqft_lot=9000.0, features=DenseVector([830.0, 9000.0])), Row(price=85000.0, sqft_living=910.0, sqft_lot=9753.0, features=DenseVector([910.0, 9753.0])), Row(price=86500.0, sqft_living=840.0, sqft_lot=9480.0, features=DenseVector([840.0, 9480.0])), Row(price=89000.0, sqft_living=900.0, sqft_lot=4750.0, features=DenseVector([900.0, 4750.0])), Row(price=89950.0, sqft_living=570.0, sqft_lot=4080.0, features=DenseVector([570.0, 4080.0]))]
I'm using the following code to train the model:
standard_scaler = StandardScaler(inputCol='features',
outputCol='scaled')
lr = LinearRegression(featuresCol=standard_scaler.getOutputCol(), labelCol='price', weightCol=None,
maxIter=100, tol=1e-4)
pipeline = Pipeline(stages=[standard_scaler, lr])
grid = (ParamGridBuilder()
.baseOn({lr.labelCol: 'price'})
.addGrid(lr.regParam, [0.1, 1.0])
.addGrid(lr.elasticNetParam, elastic_net_params or [0.0, 1.0])
.build())
ev = RegressionEvaluator(metricName="rmse", labelCol='price')
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=grid,
evaluator=ev,
numFolds=5)
model = cv.fit(data).bestModel
The error I'm getting is:
2016-09-07 17:12:08,805 root INFO Training regression model...
2016-09-07 17:12:09,530 root ERROR An error occurred while calling o60.fit.
: java.lang.NullPointerException
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:164)
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:70)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Any thoughts?
You cannot use a Pipeline in this case. When you call pipeline.fit it translates to (roughly)
standard_scaler_model = standard_scaler.fit(dataframe)
lr_model = lr.fit(dataframe)
But you actually need
standard_scaler_model = standard_scaler.fit(dataframe)
dataframe = standard_scaler_model.transform(dataframe)
lr_model = lr.fit(dataframe)
The error is because your lr.fit cannot find the output (i.e., the result of the transformation) of your StandardScaler model.

Spark. ~100 million rows. Size exceeds Integer.MAX_VALUE?

(This is with Spark 2.0 running on a small three machine Amazon EMR cluster)
I have a PySpark job that loads some large text files into a Spark RDD, does count() which successfully returns 158,598,155.
Then the job parses each row into a pyspark.sql.Row instance, builds a DataFrame, and does another count. This second count() on the DataFrame causes an exception in Spark internal code Size exceeds Integer.MAX_VALUE. This works with smaller volumes of data. Can someone explain why/how this would happen?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1.0 failed 4 times, most recent failure: Lost task 22.3 in stage 1.0 (TID 77, ip-172-31-97-24.us-west-2.compute.internal): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:439)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:604)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
PySpark code:
raw_rdd = spark_context.textFile(full_source_path)
# DEBUG: This call to count() is expensive
# This count succeeds and returns 158,598,155
logger.info("raw_rdd count = %d", raw_rdd.count())
logger.info("completed getting raw_rdd count!!!!!!!")
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
data_frame = spark_sql_context.createDataFrame(row_rdd, MySchemaStructType)
data_frame.cache()
# This will trigger the Spark internal error
logger.info("row count = %d", data_frame.count())
The error comes not from the data_frame.count() itself but rather because parsing the rows via row_parse_function yields some integers which don't fit into the specified integer type in MySchemaStructType.
Try to increase the integer types in your schema to pyspark.sql.types.LongType() or alternatively let spark infer the types by omitting the schema (this however can slow down the evaluation).

Resources