How to pass "WriteThroughputBudget" config in Spark cosmosdb connector - apache-spark

I am using the spark cosmosdb connector to write data in bulk to cosmosdb container. Since this is a bulk upload/write, and there are read operations happening at the same time. I want to restrict the RUs used by the write operation by spark connector. As per the wiki of the connector i found the config WriteThroughputBudget can be used to limit the write RUs consumption.
https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references
As per the wiki, WriteThroughputBudget is an integer value defining the RU budget the ingestion operations in a certain Spark job should not exceed.
I tried setting this config using the option in write dataframe as below
inputDataset.write.mode(SaveMode.Overwrite).format("com.microsoft.azure.cosmosdb.spark").option("WriteThroughputBudget", 400).options(writeConfig).save()
Also, tried using the options in write dataframe
val writeConfig: Map[String, String] = Map(
"Endpoint" -> accountEndpoint,
"Masterkey" -> accountKey,
"Database" -> database,
"Collection" -> collection,
"Upsert" -> "true",
"ConnectionMode" -> "Gateway",
"WritingBatchSize" -> "1000",
"WriteThroughputBudget" -> "1000")
In both the cases write operation fails with excpetion
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnectionCache$.getOrCreateBulkExecutor(CosmosDBConnectionCache.scala:104)
at com.microsoft.azure.cosmosdb.spark.CosmosDBConnection.getDocumentBulkImporter(CosmosDBConnection.scala:57)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.bulkImport(CosmosDBSpark.scala:264)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$.com$microsoft$azure$cosmosdb$spark$CosmosDBSpark$$savePartition(CosmosDBSpark.scala:389)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at com.microsoft.azure.cosmosdb.spark.CosmosDBSpark$$anonfun$1.apply(CosmosDBSpark.scala:152)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I understand the value of the config WriteThroughputBudget, is Integer value. Even though I am passing the Integer as String, it should have implicitly casted it but if fails.
Is there any other way I can specify WriteThroughputBudget option.

There is something wrong with the specific release version 3.0.5. azure-cosmosdb-spark_2.4.0_2.11-3.0.5-uber.jar.
WriteThroughputBudget can work with below versions:
azure-cosmosdb-spark_2.4.0_2.11-3.4.0-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar
azure-cosmosdb-spark_2.4.0_2.11-3.6.1-uber.jar

Related

a Spark Scala String Matching UDF

import org.apache.spark.sql.functions.lit
val containsString = (haystack:String, needle:String) =>{
if (haystack.contains(needle)){
1
}
else{
0
}
}
val containsStringUDF = udf(containsString _)
val new_df = df.withColumn("nameContainsxyz", containsStringUDF($"name"),lit("xyz")))
I am new to Spark scala. The above code seems to compile successfully. However, when I try to run
new_df.groupBy("nameContainsxyz").sum().show()
The error throws out. Could someone help me out? The error message is below.
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string, string) => int)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$15$$anon$2.hasNext(WholeStageCodegenExec.scala:655)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Caused by: java.lang.NullPointerException
at $anonfun$1.apply(<console>:41)
at $anonfun$1.apply(<console>:40)
... 15 more
Just an update: the error throws out because some of the rows in the specified column is null. Adding null check in the UDF completely solved the issue.
Thanks
If I understand what you're trying to do correctly, you want to know how many rows there are in which 'xyz' appears in column name?
You can do that without using a UDF:
df.filter('name.contains("xyz")). count

Cannot Query external Hive table from Spark

I have a Hive external table as TEXTFILE FORMAT
hive> SHOW CRAETE TABLE customers;
CREATE EXTERNAL TABLE `customers`(
`id` int,
`name` string,
`age` int,
`address` string,
`salary` double)
COMMENT 'Customer Details'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'=',','line.delim'='\n','serialization.format'=',')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/shopping.db/customers'
TBLPROPERTIES ('skip.header.line.count'='1','transient_lastDdlTime'='1605818870')
and a spark session as follows :
val warehouseLocation:String = "hdfs://localhost:9000/user/hive/warehouse/"
val spark = SparkSession.builder()
.appName("Spark Extract")
.enableHiveSupport()
.master("local[*]")
.getOrCreate()
another line that extracts the tables from hive using :
val df: DataFrame = spark.sql("SELECT * FROM database.table")
extracting the table with spark.sql("QUERY") gives the following exception:
java.lang.RuntimeException: hdfs://localhost:9000/user/hive/warehouse/shopping.db/customers/customers.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 48, 48, 10]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:445)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:404)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:345)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried various methods even:
hive> alter table customers SET FILEFORMAT PARQUET;
Please assist
I tried various methods even:
hive> alter table customers SET FILEFORMAT PARQUET;
You need to remove the customers.txt file from the table if it is not parquet or you need to redefine the table to have a text fileformat

ColumnarBatch DataSource failing with Pushdown columns

I am writing a DataSource that implements SupportsScanColumnarBatch, SupportsPushDownFilters, and SupportsPushDownRequiredColumns.
I am getting with an ArrayIndexOutOfBoundsException deep inside of Spark after populating a ColumnarBatch with the same number of ColumnVectors as the length of requiredSchema provided in the pruneColumns override.
I suspect that Spark is looking for as many ColumnVectors as the column schema returned by readSchema override instead of using the schema provided by pruneColumns.
Doing a "select * from dft" works fine since the schema lengths are the same -- 15 columns in my test case. Anything less (e.g., "select col1, col2 from dft") returns the following stack trace, where it is clear that Spark is looking for more columns.
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.vectorized.ColumnarBatch.column(ColumnarBatch.java:98)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.datasourcev2scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any clues how I can get around this? For the time being, in order to get things to continue to run, I am ignoring the pruneColumns call and returning everything.
I solved it, but it seems like a bit of a kludge.
What I did was create a ColumnVector array of the same length as the original schema (not the pruned columns), and ONLY populated the pruned columns, leaving the others in their original allocated state.
For example, if only columns with indexes 0, 5, and 9 of the original schema are in the pruned list, this is all that is required.
var cva = new Array[ColumnVector](schema.length)
cva(0).putLongs(...)
cva(5).putInts(...)
cva(9).putFloats(...)
var batch = new ColumnarBatch(cva)
...
Found a saner approach...
In your implementation of SupportsPushDownRequiredColumns let the readSchema() method return the same StructType you are getting in the pruneColumns() call!
Basically feedback what you got from Spark!
HTH

Getting Null Pointer exception while performing operations on dataframe spark

I am using following code to create dataframe from RDD. I am able to perform operations on RDD and RDD is not empty.
I tried out following two approaches.
With both I am getting same exception.
Approach 1: Build dataset using sparkSession.createDataframe().
System.out.println("RDD Count: " + rdd.count());
Dataset<Row> rows = applicationSession
.getSparkSession().createDataFrame(rdd, data.getSchema()).toDF(data.convertListToSeq(data.getColumnNames()));
rows.createOrReplaceTempView(createStagingTableName(sparkTableName));
rows.show();
rows.printSchema();
Approach 2: Use Hive Context to create dataset.
System.out.println("RDD Count: " + rdd.count());
System.out.println("Create view using HiveContext..");
Dataset<Row> rows = applicationSession.gethiveContext().applySchema(rdd, data.getSchema());
I am able to print schema for above dataset using both apporaches.
Not sure what exactly causing null pointer exception.
Show() method internally invokes take() method which is throwing null pointer exception.
But why this dataset is populated as NULL? if RDD contains values then it should not be null.
This is a strange behaviour.
Below are logs for the same.
RDD Count: 35
Also I am able to run above code in local mode without any exception it is working fine.
As soon as I deploy this code on Yarn I start getting following exception.
I am able to create dataframe even I am able to register view for the same.
As soon as I perfrom rows.show() or rows.count() operation on this dataset I am getting following error.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:469)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Am I doing anything wrong here?
Please suggest.
Can you post the schema for dataframe? Issue is with schema string you are using and separator that you are using to split the schema string.

Spark. ~100 million rows. Size exceeds Integer.MAX_VALUE?

(This is with Spark 2.0 running on a small three machine Amazon EMR cluster)
I have a PySpark job that loads some large text files into a Spark RDD, does count() which successfully returns 158,598,155.
Then the job parses each row into a pyspark.sql.Row instance, builds a DataFrame, and does another count. This second count() on the DataFrame causes an exception in Spark internal code Size exceeds Integer.MAX_VALUE. This works with smaller volumes of data. Can someone explain why/how this would happen?
org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1.0 failed 4 times, most recent failure: Lost task 22.3 in stage 1.0 (TID 77, ip-172-31-97-24.us-west-2.compute.internal): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:439)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:604)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
PySpark code:
raw_rdd = spark_context.textFile(full_source_path)
# DEBUG: This call to count() is expensive
# This count succeeds and returns 158,598,155
logger.info("raw_rdd count = %d", raw_rdd.count())
logger.info("completed getting raw_rdd count!!!!!!!")
row_rdd = raw_rdd.map(row_parse_function).filter(bool)
data_frame = spark_sql_context.createDataFrame(row_rdd, MySchemaStructType)
data_frame.cache()
# This will trigger the Spark internal error
logger.info("row count = %d", data_frame.count())
The error comes not from the data_frame.count() itself but rather because parsing the rows via row_parse_function yields some integers which don't fit into the specified integer type in MySchemaStructType.
Try to increase the integer types in your schema to pyspark.sql.types.LongType() or alternatively let spark infer the types by omitting the schema (this however can slow down the evaluation).

Resources