StackOverflow Error while joining two Data frames - apache-spark

I am facing an issue while joining two data frames based on a column. I am getting a Stackoverflow error but am unable to find out the reason behind that. Is there any way to know why facing the same...
var joinedDf = event.join(event_status_df, Seq("_id"), "left")
I did a count of both data frames first one has 150 counts and the second, has 15, the first data frame has 80 columns whereas the second data frame has only 5 columns.
This join gives me
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.findAliases(Analyzer.scala:1517)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1183)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$7(Analyzer.scala:1217)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.collectConflictPlans$1(Analyzer.scala:1217)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$dedupRight$7(Analyzer.scala:1217)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
I have the first data frame i.e event DF cached but not sure why the above error. I get that by providing -Xss value I can resolve the issue but want to understand why the issue is.
Any ideas or thoughts which I can check and resolve in another way?

Related

Spark dataset shows schema but throws UnsupportedOperation exception for show() method

Have created a Spark dataset using a bean encoder of custom java class
Encoder<CustomJavaType> customJavaEncoder = Encoders.bean(CustomJavaType.class);
Dataset<CustomJavaType> customJavaTypeDataset = sparkRunner.getSparkConfig().getSparkSession()
.createDataset(listofCustomJavaTypeObjects, customJavaEncoder);
customJavaTypeDataset.printschema() works just fine. It shows the schema correctly.
However, customJavaTypeDataset.show() throws the following exception
java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 0 because the size after growing exceeds size limitation 2147483647
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:65)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply2_2$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:41)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:41)
at org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.LocalTableScanExec.executeTake(LocalTableScanExec.scala:72)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at org.apache.spark.sql.Dataset.show(Dataset.scala:637)
at org.apache.spark.sql.Dataset.show(Dataset.scala:596)
at org.apache.spark.sql.Dataset.show(Dataset.scala:605)
All the nested classes of CustomJavaType implements serializable.
Number of objects in the list is 5.
printSchema is as expected.
This isn't really a solution to the problem (see my comment above), but it may help someone get a bit closer ...
I believe I have tracked down the point in code that triggers this error. It is in spark-catalyst_2.11-2.2.0:/.../org/apache/spark/sql/catalyst/expressions/UnsafeRow.java:getUTF8String line 418. On that line a "long" is cast to an "int", but the value is too large for an int, and the wrapped value results in a negative number which is then used in an attempt to grow a byte buffer (somewhere along the line, a java.lang.NegativeArraySizeException is thrown and swallowed/ignored).
Ultimately we arrive at spark-catalyst_2.11-2.2.0:/.../org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java:grow line 64 where an if() statement mistakes the negative value for a too-big value, thus throwing the UnsupportedOperationException.
I'm not sure what to do with this info. Maybe somebody knows. Is this the sort of thing that should be reported as a bug?
Here are a couple of visuals from my debugger to show the detail:

Pyspark randomly fails to write tos3

Writing my word2vec model to S3 as following:
model.save(sc, "s3://output/folder")
I does it without problems usually, so no AWS credentials problem, but I randomly get the following error.
17/01/30 20:35:21 WARN ConfigurationUtils: Cannot create temp dir with
proper permission: /mnt2/s3 java.nio.file.AccessDeniedException: /mnt2
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
at java.nio.file.Files.createDirectory(Files.java:674)
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
at java.nio.file.Files.createDirectories(Files.java:767)
at com.amazon.ws.emr.hadoop.fs.util.ConfigurationUtils.getTestedTempPaths(ConfigurationUtils.java:216)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:447)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:111)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:88)
at org.apache.parquet.hadoop.ParquetOutputCommitter.(ParquetOutputCommitter.java:41)
at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:339)
Have tried in various clusters and haven't managed to figure it out. Is this a known problem with pyspark?
This is probably related to SPARK-19247. As of today (Spark 2.1.0), ML writers repartition all data to a single partition and it can result in failures in case of large models. If this is indeed the source of the problem you can try to patch your distribution manually using code from the corresponding PR.

Failed to apply mutation locally : {}

Cassandra version 3.9 (https://github.com/docker-library/cassandra/blob/4bb926527d4a9eb534508fe0bbae604dee81f40a/3.9/Dockerfile)
It happened when I added 2 node to cluster, and this error occurs only in this 2 node with periodicity every 2 minutes. I done repair for all cluster, but it didn't help. It's occur while Cassandra is running.
See this error on 2 of 3 nodes in a cluster.
ERROR 07:13:44 Failed to apply mutation locally : {}
java.nio.BufferOverflowException: null
at org.apache.cassandra.io.util.DataOutputBufferFixed.doFlush(DataOutputBufferFixed.java:52) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:132) ~[apache-cassandra-3.9.jar:3.9]
...
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Even I came across the same error. The answer lies here
Mutation of bytes is too large for the maxiumum size of
It was not present in datastax help center when the question was posted, but added a year later. Hope it helps

why the rpc timeout occur in cassandra

I tried using cqlsh -3 version on my keyspace and used select query on a column family.
It's return data in some causes and throws RPC time out in some other causes,I don't know the exact root cause.
I used select query with single where condition
select * FROM date where date='2013-10-11 00:00:00+0000';
In this date column has secondary index with datatype text in UTF8 format
Request did not complete within rpc_timeout.
I checked with cassandra log.it throws
ERROR [ReadStage:117] 2013-12-03 19:21:46,813 CassandraDaemon.java (line 192) Exception in thread Thread[ReadStage:117,5,main]
at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:119)
at org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNamesIterator.java:60)
at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:81)
at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68)
at org.apache.cassandra.db.CollationController.collectTimeOrderedData(CollationController.java:132)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1390)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1213)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1125)
at org.apache.cassandra.db.index.keys.KeysSearcher$1.computeNext(KeysSearcher.java:191)
at org.apache.cassandra.db.index.keys.KeysSearcher$1.computeNext(KeysSearcher.java:109)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.ColumnFamilyStore.filter(ColumnFamilyStore.java:1499)
at org.apache.cassandra.db.index.keys.KeysSearcher.search(KeysSearcher.java:82)
at org.apache.cassandra.db.index.SecondaryIndexManager.search(SecondaryIndexManager.java:548)
at org.apache.cassandra.db.ColumnFamilyStore.search(ColumnFamilyStore.java:1487)
at org.apache.cassandra.service.RangeSliceVerbHandler.executeLocally(RangeSliceVerbHandler.java:44)
at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1055)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1547)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
why this happening?
I am checking in my local with single seed?
Update 1:
my date table Structure
CREATE TABLE date (
key text PRIMARY KEY,
date text,
date_id text,
day bigint,
day_name text
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
compression={'chunk_length_kb': '64', 'sstable_compression': 'SnappyCompressor'};
I checked with cassandra log,Its shows
ERROR [ReadStage:94] 2013-12-03 22:07:17,116 CassandraDaemon.java (line 192) Exception in thread Thread[ReadStage:94,5,main]
java.lang.AssertionError: DecoratedKey(-8665312888645846270,.......................<!--some bytes of numbers------->
/var/lib/cassandra/data/keyspace/columnfamily/keyspace-columnfamily-ic-1-Data.db
at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:119)
at org.apache.cassandra.db.columniterator.SSTableNamesIterator.<init>(SSTableNamesIterator.java:60)
at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:81)
at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68)
at org.apache.cassandra.db.CollationController.collectTimeOrderedData(CollationController.java:132)
at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1390)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1213)
at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1125)
at org.apache.cassandra.db.Table.getRow(Table.java:347)
at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:64)
at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1033)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1547)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
currently i am using cassandra-1.2.6.
I check with this link,is this cassandra issue?
https://issues.apache.org/jira/browse/CASSANDRA-4687
If your query is expensive this can lead to rpc-timeouts. There's a bunch of SO questions along this line for different types of query, e.g., Fetching all the records for a partitionID in cassandra gives RPC timeout, RPC timeout in cqlsh - Cassandra (select count(*) queries). However, your question relates specifically to secondary indices.
Querying on secondary indices should be avoided if the number of unique indexed entries is high, as they are much more expensive than querying by key (i suspect this is the case if you are using dates). Perhaps there is a better way of modelling your data? (If you add the data model to your question I can try to elaborate on this answer.)

Cassandra 1.1.1 crashes while inserting heavy data using Hector 1.0.5

I am using Cassandra 1.1.1 and Using Hector 1.0.5, am trying to insert data (heavy volume) in to a column family. During execution of my program, the cassandra server crashes and displays the Out-of-memory error. After that I am left with no option than quitting the server. This gets repeated for one column family where I am trying to store html file(s) content and I never get a chance to complete it. The html file contents varies from 225 KB data to 700 KB data for one row and I am trying to insert almost 1000 records.
In the program it throws the below
Exception in thread "main" me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:393)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:249)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at com.epocrates.soa.rx.util.DiseaseImporter.insertDisease(DiseaseImporter.java:207)
at com.epocrates.soa.rx.util.DiseaseImporter.batchProcess(DiseaseImporter.java:81)
at com.epocrates.soa.rx.
util.DiseaseImporter.main(DiseaseImporter.java:37)
In System.log, I find the below
java.io.IOError: java.io.IOException: Map failed
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:127)
at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:80)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:244)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:104)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758)
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
... 6 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:755)
... 7 more
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:457)
at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:314)
at org.apache.cassandra.service.StorageProxy$2.apply(StorageProxy.java:119)
at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:260)
at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:193)
at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:637)
at org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:587)
at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:595)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3112)
at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3100)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
This means that you've run out of address space to map commitlog segments into.
Best solution: upgrade to a 64bit JVM.
Worse solution: in cassandra.yaml, set commitlog_segment_size_in_mb and commitlog_total_space_in_mb both to 16.
This isn't the first time this has come up; I've opened https://issues.apache.org/jira/browse/CASSANDRA-4422 to improve the defaults.

Resources