Apache Ignite Spark Integration not working with Schema Name - apache-spark

I'm using Apache Ignite Spark Connector (ignite-spark-2.7.5) to persist my DataFrame to Ignite table using the below code.
val ignite = Ignition.start(CONFIG);
catalog_opportunities_agg.write
.format(FORMAT_IGNITE)
.option(OPTION_CONFIG_FILE, CONFIG)
.option(OPTION_TABLE, "s1.club")
.option("user", "ignite")
.option("password", "ignite")
.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, "club_id")
.option(OPTION_CREATE_TABLE_PARAMETERS, "template=replicated")
.mode(SaveMode.Overwrite)
.save()
Ignition.stop(false);
The code is working fine for public schema (without mentioning the schema name) but it starts to fail as soon as I add the schema name(s1) to it.
Error stack:
19/09/04 10:24:06 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 208)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.ignite.spark.impl.QueryHelper$.org$apache$ignite$spark$impl$QueryHelper$$savePartition(QueryHelper.scala:155)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:117)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:116)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/09/04 10:24:06 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 202)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.ignite.spark.impl.QueryHelper$.org$apache$ignite$spark$impl$QueryHelper$$savePartition(QueryHelper.scala:155)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:117)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:116)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/09/04 10:24:06 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 206)
Kindly suggest what I'm doing wrong.

I think it doesn't understand the schema syntax. Instead of:
.option(OPTION_TABLE, "s1.club")
try:
.option(OPTION_SCHEMA, "s1")
.option(OPTION_TABLE, "club")
Note that, as long as the table name is unique, you shouldn't need to specify the schema:
If this is not specified, all schemata will be scanned for a table name which matches the given table name and the first matching table will be used. This option can be used when there are multiple tables in different schemata with the same table name to disambiguate the tables.

Related

SOLR Read timed out (socket connection timeout) when indexing large amount of data in a collection

We are trying to index around 5 billion records existing in hdfs (parquet file) to a collection on solr. We are using solr 7.2.1. We have spawned an emr cluster of 7 data nodes (16 VCores, 128 GB each) and using the same hdfs nodes for storing solr data as well (solr collection created with 7 shards). After running for sometime we see that the rate of indexing becomes slower and some tasks start failing because of 2 main errors (listed below). After indexing around 550 mill or so the job failed because some task failed too many times.
Error 1:
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Error 2:
scala.MatchError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr1:8983/solr/connection-2637-config6: 2 Async exceptions during distributed update:
Broken pipe (Write failed)
Broken pipe (Write failed) (of class org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Then we tried to divide the larger file in 300 million record files each and then index with some cooling off period in between. We saw that each subsequent task is taking more time (3 hrs for 1st 300 mill, 4.5 hrs for second 300 mill, 6 hr for 3rd. etc.).
Now when we have indexed over 2 billion records even after waiting for 5-6 hours we are still getting a lot of socket timeout exceptions and jobs fails. We believe it might be happening because even after the job shows completed the merging of indexes continues on the backend (we have seen reduction in the number of files present in respective core directories, so thats our guess). Is this true? and for either yes/no is there any way to solve this and improve performance.
These are some other solr configs we have set.
SOLR_JAVA_MEM="-Xms30g -Xmx30g"
GC_TUNE="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts -XX:MaxDirectMemorySize=20g"
solrconfig.xml
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">25</int>
<int name="segmentsPerTier">25</int>
<double name="noCFSRatio">0.1</double>
</mergePolicyFactory>
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:1200000}</maxTime>
</autoSoftCommit>
Would really appreciate any help at all.
*Checked in executor logs, found more detailed stacktrace.
21/08/31 11:35:53 ERROR BaseCloudSolrClient: Request to collection [connection-2637-config6-65536] failed due to (0) java.net.SocketTimeoutException: Read timed out, retry=0 commError=false errorCode=0
21/08/31 11:35:53 INFO BaseCloudSolrClient: request was not communication error it seems
21/08/31 11:35:53 ERROR SolrSupport$: Send batch to collection connection-2637-config6-65536 failed due to: org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:676)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolr(SolrSupport.scala:389)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:354)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at shaded.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at shaded.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at shaded.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at shaded.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at shaded.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at shaded.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at shaded.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at shaded.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at shaded.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at shaded.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at shaded.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at shaded.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at shaded.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
... 24 more
21/08/31 11:35:53 ERROR Executor: Exception in task 75.0 in stage 1.0 (TID 77)
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Spark EMR Job Failing: Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0

I'm running a Spark Scala EMR Job with 12 Executors, m5.4xlarge instance type, 51GB of Heap, using Spark 2.4.6 and EMR 5.31.0. In my EMR cluster I also have the following configuration:
[{"classification":"spark", "properties":{"maximizeResourceAllocation":"true"}, "configurations":[]}]
I dont understand where my OOM error is coming from and what I can do to fix it. Before I write I do a join, groupBy,groupBy, and another join. How can I figure out what is causing the OOM error?
DF.withColumn("hour", hour(col("start")))
.withColumn("day", dayofmonth(col("start")))
.withColumn("month", month(col("start")))
.withColumn("year", year(col("start")))
.write
.partitionBy("year", "month", "day", "hour")
.mode(SaveMode.Overwrite)
.parquet(outputPath)
Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:161)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:128)
at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.add(ExternalAppendOnlyUnsafeRowArray.scala:115)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.fetchNextPartition(WindowExec.scala:343)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:369)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:303)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
DAG Visulization of Failed Stage
Aggregated Metrics By Executor of Failed Stage
Executors Tab

Read all partitioned parquet files in PySpark

I want to load all parquet files that are stored in a folder structure in S3 AWS.
The folder structure is as follows: S3/bucket_name/folder_1/folder_2/folder_3/year=2019/month/day
What I want is to read all parquet files at once, so I want PySpark to read all data from 2019 for all months and days that are available and then store it in one dataframe (so you get a concatenated/unioned dataframe with all days in 2019).
I am told that these are partitioned files (though I am not sure of this).
Is this possible in PySpark and if so how?
When I try spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019')
it works. However, when I want to take a look at the Spark dataframe using spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019').show()
it says:
An error occurred while calling o934.showString.
: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times,
most recent failure:
Lost task 0.3 in stage 36.0 (TID 718, executor 7):
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I want to be able to show the dataframe.
Please refer the "Partition discovery" part of the documentation:
https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery
In PySpark, you can do this simply as follows:
from pyspark.sql.functions import col
(
spark.read
.parquet('S3/bucket_name/folder_1/folder_2/folder_3')
.filter(col('year') == 2019)
)
So you will point the path to the folder where it is partitioned into some subfolders and you apply the partition filter which should take the data only from the given year subfolder.

Spark 2.3.0 SQL unable insert data into hive hbase table

Use Spark 2.3 thriftserver integrated with hive 2.2.0. running from spark beeline. Try to insert data into hive hbase table (a hive table with hbase as storage). Insert into hive native table is ok. When inserting into hive hbase table, it throws the following exception:
ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.outputFormat$lzycompute(HiveFileFormat.scala:93)
k failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.outputFormat$lzycompute(HiveFileFormat.scala:93)
Seems to be a known hive issue. There are several pull requests addressing it but I haven't found an actual fix yet.
https://issues.apache.org/jira/browse/SPARK-6628

Group By in Spark with Error java.lang.IllegalStateException: There is no space for new record

I was trying to implement my Hive query codes in spark SQL.
(I'm pretty sure those work in Hive but I don't know why spark throw me an error.)
I have quite a big table which is generated from Hive:
val df= spark.sql("select ... from a join b ... group by d...")
When I show the df or output it into Hive, it works fine
df.show()
df.write.mode("overwrite").saveAsTable("tableName")
But, when I do this:
val df2 = df.groupBy("colA").agg(sum("colB"))
df2.show()
//or
df2.write.mode("overwrite").saveAsTable("tableDF2")
There is an error after running job for long time
17/06/14 06:39:50 WARN RowBasedKeyValueBatch: Calling spill() on
RowBasedKeyValueBatch. Will not spill but return 0.
17/06/14 06:39:50 ERROR Executor: Exception in task 0.0 in stage 62.0 (TID 1982)
java.lang.IllegalStateException: There is no space for new record
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:130)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/06/14 06:39:50 WARN TaskSetManager: Lost task 0.0 in stage 62.0 (TID 1982, localhost, executor driver): java.lang.IllegalStateException: There is no space for new record
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:130)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/06/14 06:39:50 ERROR TaskSetManager: Task 0 in stage 62.0 failed 1 times; aborting job
17/06/14 06:39:50 WARN BlockManager: Putting block rdd_300_15 failed due to an exception
17/06/14 06:39:50 WARN BlockManager: Block rdd_300_15 could not be removed as it was not found on disk or in memory
17/06/14 06:39:50 WARN TaskSetManager: Lost task 15.0 in stage 62.0 (TID 1997, localhost, executor driver): TaskKilled (killed intentionally)
17/06/14 06:39:50 WARN TaskSetManager: Lost task 12.0 in stage 62.0 (TID 1994, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 62.0 failed 1 times, most recent failure: Lost task 0.0 in stage 62.0 (TID 1982, localhost, executor driver): java.lang.IllegalStateException: There is no space for new record
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:130)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1920)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1946)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
... 48 elided
Caused by: java.lang.IllegalStateException: There is no space for new record
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:130)
at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any idea about that?
Or any parameter/memory management I can tune to tackle it? Thanks!
BTW, I'm using Spark-2.1.0.3
You are most likely running into an issue with the amount of RAM available on your executors.
It's perfectly understandable that the first command works without issue but the second does not: they are two different computations.
In the first, you are reading in the table to a DataFrame, with the number of partitions set by whatever default is used. Then, you're simply writing that same table to disk, which will create a separate file for each partition.
In the second, you're performing a .groupBy and an aggregation, which means that each reducer is being sent the keys from "colA" to aggregate. If these reducer tasks can't hold all the data they receive in memory, they'll spill to disk (as this error seems to suggest).
You can solve this problem by either allocating more memory for each executor, or by increasing the number of partitions used for the aggregation. The latter option is usually easier -- you just need to set spark.sql.shuffle.partitions to a larger number, and try your .groupBy/agg operation again. By default, spark.sql.shuffle.partitions is set to 200, and for big tables, this is sometimes not enough to aggregate, which is why you're having this problem. As a start, try quadrupling the number of shuffle partitions by setting this param to 800.

Resources