I am writing a DataSource that implements SupportsScanColumnarBatch, SupportsPushDownFilters, and SupportsPushDownRequiredColumns.
I am getting with an ArrayIndexOutOfBoundsException deep inside of Spark after populating a ColumnarBatch with the same number of ColumnVectors as the length of requiredSchema provided in the pruneColumns override.
I suspect that Spark is looking for as many ColumnVectors as the column schema returned by readSchema override instead of using the schema provided by pruneColumns.
Doing a "select * from dft" works fine since the schema lengths are the same -- 15 columns in my test case. Anything less (e.g., "select col1, col2 from dft") returns the following stack trace, where it is clear that Spark is looking for more columns.
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:98)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.datasourcev2scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any clues how I can get around this? For the time being, in order to get things to continue to run, I am ignoring the pruneColumns call and returning everything.
I solved it, but it seems like a bit of a kludge.
What I did was create a ColumnVector array of the same length as the original schema (not the pruned columns), and ONLY populated the pruned columns, leaving the others in their original allocated state.
For example, if only columns with indexes 0, 5, and 9 of the original schema are in the pruned list, this is all that is required.
var cva = new Array[ColumnVector](schema.length)
cva(0).putLongs(...)
cva(5).putInts(...)
cva(9).putFloats(...)
var batch = new ColumnarBatch(cva)
...
Found a saner approach...
In your implementation of SupportsPushDownRequiredColumns let the readSchema() method return the same StructType you are getting in the pruneColumns() call!
Basically feedback what you got from Spark!
HTH
Related
import org.apache.spark.sql.functions.lit
val containsString = (haystack:String, needle:String) =>{
if (haystack.contains(needle)){
1
}
else{
0
}
}
val containsStringUDF = udf(containsString _)
val new_df = df.withColumn("nameContainsxyz", containsStringUDF($"name"),lit("xyz")))
I am new to Spark scala. The above code seems to compile successfully. However, when I try to run
new_df.groupBy("nameContainsxyz").sum().show()
The error throws out. Could someone help me out? The error message is below.
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string, string) => int)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$15$$anon$2.hasNext(WholeStageCodegenExec.scala:655)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Caused by: java.lang.NullPointerException
at $anonfun$1.apply(<console>:41)
at $anonfun$1.apply(<console>:40)
... 15 more
Just an update: the error throws out because some of the rows in the specified column is null. Adding null check in the UDF completely solved the issue.
Thanks
If I understand what you're trying to do correctly, you want to know how many rows there are in which 'xyz' appears in column name?
You can do that without using a UDF:
df.filter('name.contains("xyz")). count
I am using the spark cosmosdb connector to write data in bulk to cosmosdb container. Since this is a bulk upload/write, and there are read operations happening at the same time. I want to restrict the RUs used by the write operation by spark connector. As per the wiki of the connector i found the config WriteThroughputBudget can be used to limit the write RUs consumption.
https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
As per the wiki, WriteThroughputBudget is an integer value defining the RU budget the ingestion operations in a certain Spark job should not exceed.
I tried setting this config using the option in write dataframe as below
inputDataset.write.mode(SaveMode.Overwrite).format("com.microsoft.azure.cosmosdb.spark").option("WriteThroughputBudget", 400).options(writeConfig).save()
Also, tried using the options in write dataframe
val writeConfig: Map[String, String] = Map(
"Endpoint" -> accountEndpoint,
"Masterkey" -> accountKey,
"Database" -> database,
"Collection" -> collection,
"Upsert" -> "true",
"ConnectionMode" -> "Gateway",
"WritingBatchSize" -> "1000",
"WriteThroughputBudget" -> "1000")
In both the cases write operation fails with excpetion
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnectionCache$.getOrCreateBulkExecutor(CosmosDBConnectionCache.scala:104)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnection.getDocumentBulkImporter(CosmosDBConnection.scala:57)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.bulkImport(CosmosDBSpark.scala:264)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.com$microsoft$azure$cosmosdb$spark$CosmosDBSpark$$savePartition(CosmosDBSpark.scala:389)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I understand the value of the config WriteThroughputBudget, is Integer value. Even though I am passing the Integer as String, it should have implicitly casted it but if fails.
Is there any other way I can specify WriteThroughputBudget option.
There is something wrong with the specific release version 3.0.5. azure-cosmosdb-spark_2.4.0_2.11-3.0.5-uber.jar.
WriteThroughputBudget can work with below versions:
azure-cosmosdb-spark_2.4.0_2.11-3.4.0-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-3.6.1-uber.jar
I was trying K means from Spark ML, on some scaled features which are a column of a spark dataframe. It works as expected. However, as soon as I apply PCA on the same feature set, and try K means on that, I get the error
java.lang.IllegalArgumentException: requirement failed
My original data before PCA has 108 features, which have been Indexed and Assembled. K means runs fine on this and there are no nulls/NANs. After this, I'm setting the number of principal components as 6,running PCA. On the PCA data, K means is being run, and I am getting the above error.
A look at the stacktrace shows that the error is in fastSquaredDistance in MLUtils
.......at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:486)
at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:589)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:563)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:557)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:557)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:580)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This throws an exception when the dimensions of the vectors are not the same, or when the vector norm is negative
I checked the size of all the vectors in the PCA column and all of the, have size 6( as expected) . As for non negative norm, I can't figure out how that's even possible.
Does anyone have leads into what could be happening?
If my question isn't formatted well, I apologize, it's my first time asking a question here
The code that's throwing the error is below
....import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(6).setSeed(42L)
val model = kmeans.fit(scaledDataPcaNN).setFeaturesCol("pcaFeatures")
val predictions = model.transform(scaledDataPca)
val WSSSE = model.computeCost(scaledDataPca)
I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty.
I tried out following two approaches.
With both I am getting same exception.
Approach 1: Build dataset using sparkSession.createDataframe().
System.out.println("RDD Count: " + rdd.count());
Dataset<Row> rows = applicationSession
.getSparkSession().createDataFrame(rdd, data.getSchema()).toDF(data.convertListToSeq(data.getColumnNames()));
rows.createOrReplaceTempView(createStagingTableName(sparkTableName));
rows.show();
rows.printSchema();
Approach 2: Use Hive Context to create dataset.
System.out.println("RDD Count: " + rdd.count());
System.out.println("Create view using HiveContext..");
Dataset<Row> rows = applicationSession.gethiveContext().applySchema(rdd, data.getSchema());
I am able to print schema for above dataset using both apporaches.
Not sure what exactly causing null pointer exception.
Show() method internally invokes take() method which is throwing null pointer exception.
But why this dataset is populated as NULL? if RDD contains values then it should not be null.
This is a strange behaviour.
Below are logs for the same.
RDD Count: 35
Also I am able to run above code in local mode without any exception it is working fine.
As soon as I deploy this code on Yarn I start getting following exception.
I am able to create dataframe even I am able to register view for the same.
As soon as I perfrom rows.show() or rows.count() operation on this dataset I am getting following error.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Am I doing anything wrong here?
Please suggest.
Can you post the schema for dataframe? Issue is with schema string you are using and separator that you are using to split the schema string.
(This is with Spark 2.0 running on a small three machine Amazon EMR cluster)
I have a PySpark job that loads some large text files into a Spark RDD, does count() which successfully returns 158,598,155.
Then the job parses each row into a pyspark.sql.Row instance, builds a DataFrame, and does another count. This second count() on the DataFrame causes an exception in Spark internal code Size exceeds Integer.MAX_VALUE. This works with smaller volumes of data. Can someone explain why/how this would happen?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1.0 failed 4 times, most recent failure: Lost task 22.3 in stage 1.0 (TID 77, ip-172-31-97-24.us-west-2.compute.internal): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:439)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:604)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
PySpark code:
raw_rdd = spark_context.textFile(full_source_path)
# DEBUG: This call to count() is expensive
# This count succeeds and returns 158,598,155
logger.info("raw_rdd count = %d", raw_rdd.count())
logger.info("completed getting raw_rdd count!!!!!!!")
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
data_frame = spark_sql_context.createDataFrame(row_rdd, MySchemaStructType)
data_frame.cache()
# This will trigger the Spark internal error
logger.info("row count = %d", data_frame.count())
The error comes not from the data_frame.count() itself but rather because parsing the rows via row_parse_function yields some integers which don't fit into the specified integer type in MySchemaStructType.
Try to increase the integer types in your schema to pyspark.sql.types.LongType() or alternatively let spark infer the types by omitting the schema (this however can slow down the evaluation).