Spark and Cassandra parallel processing - apache-spark

I have a following task ahead of me.
User provides set of IP addresses a config file while executing spark submit command.
Lets say that array looks like this :
val ips = Array(1,2,3,4,5)
There can be up to 100.000 values in array..
For all elements in array, I should read data for Cassandra, perform some computation and insert data back to Cassandra.
If I do:
ips.foreach(ip =>{
- read data from Casandra for specific "ip" // for each IP there is different amount of data to read (within the functions I determine start and end date for each IP)
- process it
- save it back to Cassandra})
this works fine.
I believe that process runs sequentially; I don't exploit parallelism.
On the other hand if I do:
val IPRdd = sc.parallelize(Array(1,2,3,4,5))
IPRdd.foreach(ip => {
- read data from Cassandra // I need to use spark context to make the query
-process it
save it back to Cassandra})
I get serialization exception, because spark is trying to serialize spark context, which is not serializable.
How to make this work, but still exploit parallelism.
Thanks
Edited
This is the execption I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:59)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:54)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$.main(WibeeeBatchJob.scala:54)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob.main(WibeeeBatchJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#311ff287)
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, )
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)

Easiest thing to do is to use the Spark Cassandra Connector which can handle connection pooling and serialization.
With that you could do something like
sc.parallelize(inputData, numTasks)
.mapPartitions { it =>
val con = CassandraConnection(yourConf)
con.withSessionDo{ session =>
//Use the session
}
//Do any other processing
}.saveToCassandra("ks","table"
This would be completely manual operation of a Cassandra Connection. The sessions would all be automatically pooled and cached and if you prepare a statement those will be cached on the executor as well.
If you would like to use more built in methods, there also exists joinWithCassandraTable which may work in your situation.
sc.parallelize(inputData, numTasks)
.joinWithCassandraTable("ks","table") //Retrieves all records for which input data is the primary key
.map( //manipulate returned results if needed )
.saveToCassandra("ks","table")

Related

How to fix an IllegalThreadStateException in Spring Batch in the ItemWriter using a JPA repository

I have a Spring Batch app that is reading couple of million records from an Azure SQL db (P11), does some calls, then updates those records. Below is the configuration of that particular step.
The 'chunksize' is 400 and the 'throttleLimit' is 30
#Bean
public Step coreCardProcessStep(JpaPagingItemReader<CoreCardEntity> sqlCoreLoadItemReader,
StepBuilderFactory stepBuilderFactory,
OutCoreCardProcessor outCoreCardProcessor,
CoreProcessStepCardWriter coreProcessStepCardWriter
) {
log.info("coreCardProcessingStep: creating a step for processing sql rows");
return stepBuilderFactory.get("CORE_CARD_PROCESSOR STEP 001")
.<CoreCardEntity, CoreCardEntity>chunk(chunkSize)
.reader(sqlCoreLoadItemReader)
.processor(outCoreCardProcessor)
.writer(coreProcessStepCardWriter)
.taskExecutor(new SimpleAsyncTaskExecutor("coreCardLoad"))
.throttleLimit(throttleLimit)
.build();
}
My writer is actually a JPA writer but implemented ourselves
#Slf4j
#Component
#RequiredArgsConstructor
public class CoreProcessStepCardWriter implements ItemWriter<CoreCardEntity> {
private final CoreCardRepository coreCardRepository;
#Override
public void write(List<? extends CoreCardEntity> coreCards) throws Exception {
log.info("write: number of cards: {}", coreCards.size());
coreCardRepository.saveAll(coreCards);
}
}
The problem is that after already processed a few 10000 records correctly I somehow get an IllegalThreadStateException which as far as I know can only happen if something tries to start a Thread that is already started. The exception is thrown from the writer just when it tries to saveAll and it happens on all concurrent chunks. This is the stacktrace of the actual underlying problem (Spring Batch itself is wrapping all exceptions from each chunk up in something else after the failure):
org.springframework.dao.InvalidDataAccessApiUsageException: nested exception is java.lang.IllegalThreadStateException
at org.springframework.orm.jpa.EntityManagerFactoryUtils.convertJpaAccessExceptionIfPossible(EntityManagerFactoryUtils.java:374)
at org.springframework.orm.jpa.vendor.HibernateJpaDialect.translateExceptionIfPossible(HibernateJpaDialect.java:235)
at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.translateExceptionIfPossible(AbstractEntityManagerFactoryBean.java:551)
at org.springframework.dao.support.ChainedPersistenceExceptionTranslator.translateExceptionIfPossible(ChainedPersistenceExceptionTranslator.java:61)
at org.springframework.dao.support.DataAccessUtils.translateIfNecessary(DataAccessUtils.java:242)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:152)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.jpa.repository.support.CrudMethodMetadataPostProcessor$CrudMethodMetadataPopulatingMethodInterceptor.invoke(CrudMethodMetadataPostProcessor.java:174)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy225.saveAll(Unknown Source)
at com.abnamro.pim.loader.cardmigration.services.batches.out.core.steps.writers.CoreProcessStepCardWriter.write(CoreProcessStepCardWriter.java:25)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.write(SimpleChunkProcessor.java:294)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:708)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3759)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:456)
at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.tomcat.jdbc.pool.StatementFacade$StatementProxy.invoke(StatementFacade.java:118)
at com.sun.proxy.$Proxy208.executeQuery(Unknown Source)
at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:57)
at org.hibernate.loader.Loader.getResultSet(Loader.java:2322)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2075)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2037)
at org.hibernate.loader.Loader.doQuery(Loader.java:956)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:357)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:327)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2440)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:77)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:61)
at org.hibernate.persister.entity.AbstractEntityPersister.doLoad(AbstractEntityPersister.java:4521)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:4511)
at org.hibernate.event.internal.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:571)
at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:539)
at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:208)
at org.hibernate.event.internal.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:327)
at org.hibernate.event.internal.DefaultLoadEventListener.doOnLoad(DefaultLoadEventListener.java:108)
at org.hibernate.event.internal.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:74)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:118)
at org.hibernate.internal.SessionImpl.fireLoadNoChecks(SessionImpl.java:1231)
at org.hibernate.internal.SessionImpl.fireLoad(SessionImpl.java:1220)
at org.hibernate.internal.SessionImpl.access$2100(SessionImpl.java:202)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.doLoad(SessionImpl.java:2835)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.lambda$load$1(SessionImpl.java:2812)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.perform(SessionImpl.java:2768)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.load(SessionImpl.java:2812)
at org.hibernate.internal.SessionImpl.get(SessionImpl.java:1024)
at org.hibernate.event.internal.DefaultMergeEventListener.entityIsDetached(DefaultMergeEventListener.java:306)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:172)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:70)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:107)
at org.hibernate.internal.SessionImpl.fireMerge(SessionImpl.java:829)
at org.hibernate.internal.SessionImpl.merge(SessionImpl.java:816)
at sun.reflect.GeneratedMethodAccessor306.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.orm.jpa.SharedEntityManagerCreator$SharedEntityManagerInvocationHandler.invoke(SharedEntityManagerCreator.java:311)
at com.sun.proxy.$Proxy222.merge(Unknown Source)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.save(SimpleJpaRepository.java:669)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.saveAll(SimpleJpaRepository.java:700)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.saveAll(SimpleJpaRepository.java:88)
at sun.reflect.GeneratedMethodAccessor316.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker$RepositoryFragmentMethodInvoker.lambda$new$0(RepositoryMethodInvoker.java:289)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.doInvoke(RepositoryMethodInvoker.java:137)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.invoke(RepositoryMethodInvoker.java:121)
at org.springframework.data.repository.core.support.RepositoryComposition$RepositoryFragments.invoke(RepositoryComposition.java:530)
at org.springframework.data.repository.core.support.RepositoryComposition.invoke(RepositoryComposition.java:286)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$ImplementationMethodExecutionInterceptor.invoke(RepositoryFactorySupport.java:640)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:164)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:139)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.projection.DefaultMethodInvokingMethodInterceptor.invoke(DefaultMethodInvokingMethodInterceptor.java:81)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
... 20 more
I don't quite understand how this happens, let alone how to fix it. Anyone has any idea?
When you use a multi-threaded step, the batch artifacts (reader, writer, etc) should be thread-safe. Please refer to the documentation here: Multi-threaded Step. In your case, you should synchronize your item writer or make it step-scoped.
Another option is to use a partitioned step (instead of a multi-threaded step), where each partition is a distinct data set that is processed by a distinct worker thread.
As a side note, I'd recommend using a ThreadPoolTaskExecutor, because the SimpleAsyncTaskExecutor does not re-use threads.

How to pass ojai configuration from driver to executors in spark?

I wonder how can I pass OJAI connection from spark driver to its executors. Here's my code:
val connection = DriverManager.getConnection("ojai:mapr:")
val store = connection.getStore("/tables/table1")
val someStream = messagesDStream.mapPartitions {
iterator => {
val list = iterator
.map(record => record.value())
.toList
.asJava
//TODO serializacja, deserializacja, interface serializable w javie
val query = connection
.newQuery()
.where(connection.newCondition()
.in("_id", list)
.build())
.build()}
and the error I got:
Caused by: java.io.NotSerializableException: com.mapr.ojai.store.impl.OjaiConnection
Serialization stack:
- object not serializable (class: com.mapr.ojai.store.impl.OjaiConnection, value: com.mapr.ojai.store.impl.OjaiConnection#2a367e93)
- field (class: com.example.App$$anonfun$1, name: connection$1, type: interface org.ojai.store.Connection)
- object (class com.example.App$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
...
As long as the connection to the OJAI is inside the mapPartitions function, everything is fine and dandy. I know that I need to pass the configuration from the driver to executors in order for the code to work but I don't know how to do it. Tschüs!
You're running into spark's most infamous error - task not serialisable.
Essentailly what it means is that one of the classes or objects you're attempting to serialise - send over the network from the driver to the executors - cannot be processed in this way: here, it's the ojai connector.
You cannot pass the connection itself from the driver to the executors - what you can do, while avoiding constant re-creation of the connection for each batch of RDDs coming from your stream, is declare the connection in a companion object as
#transient lazy val connection = ...
And refer to that inside mapPartitions. This will ensure that each executor has a connection to the database which will persist through multiple batches, as fields marked in this way are not creted on the driver then serialised but created on each executor instead.

How to Restrict a column of my pojo to be part of Ignite tables

I have a pojo that I am using to create ignite caches. now I want to add one more column(XXX) to that pojo and don't want that column(XXX) to be part of ignite cache creation.
Caused by: class org.apache.ignite.IgniteException: Failed to prepare Cassandra CQL statement: select "customer_ref", "tenant_id", "event_discount_id", "period_num", "domain_id", "event_source", "prod_group_id", "event_seq", "product_seq", "online_version_num", "total_authorised_mny", "version_num", "bonus_count", "customer_category", "recovery_status", "total_discounted_usage", "external_balance_liid", "total_online_discounted_mny", "anti_event_disc_mny", "total_partials_mny", "counter_usage", "total_partials_usage", "online_event_count", "event_count", "last_rated_dtm", "account_num", "dyn_alloc_charge_data", "anti_event_disc_usage", "total_usage", "anti_event_count", "last_online_event_dtm", "fast_cache_seq", "total_discounted_mny", "latest_event_dtm", "total_online_discounted_usage", "carried_over_boo", "total_authorised_usage", "total_otc_mny", "online_batch_info", "counter_resets", "total_bonus_award" from "smart"."custprodinvoicediscusage" where "customer_ref"=? and "tenant_id"=? and "event_discount_id"=? and "period_num"=? and "domain_id"=? and "event_source"=? and "prod_group_id"=? and "event_seq"=? and "product_seq"=?;
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.prepareStatement(CassandraSessionImpl.java:603)
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.execute(CassandraSessionImpl.java:201)
... 12 more
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Undefined column name recovery_status
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:104)
at org.apache.ignite.cache.store.cassandra.session.CassandraSessionImpl.prepareStatement(CassandraSessionImpl.java:585)
... 13 more
ignite takes getter and setter method to read and write.
Changed this method signature in that POJO instead of getXXX() and setXXX()
public void putRecovery_status(Integer RECOVERY_STATUS) { this.RECOVERY_STATUS = RECOVERY_STATUS; }
public Integer fetchRecovery_status() { return RECOVERY_STATUS; }
If you don't want to make a field end up in Ignite cache, you can probably mark it transient.

How to make concurrency work with dataframes writing into hive tables?

I have multiple threads on Spark 1.6 writing into the same hive table (using parquet files), when they try to write at the same time it prompts an error during the renaming part of write files into HDFS. I'm searching a solution to bypass this known Spark issue.
class MyThread extends Runnable {
def run {
//some code
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()
I get this error :
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:189)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:221)
at fr.neolink.spark.streaming.StreamingNeo$.algo(StreamingNeo.scala:837)
at fr.neolink.spark.streaming.StreamingNeo$$anonfun$main$3$$anonfun$apply$18$MyThread$1.run(StreamingNeo.scala:374)
at java.lang.Thread.run(Thread.java:748)
caused by :
java.io.IOException: Failed to rename
FileStatus{path=hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/_temporary/0/task_201812281010_1770_m_000000/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet;
isDirectory=false; length=4608; replication=3; blocksize=134217728; modification_time=1545988247575;
access_time=1545988247494; owner=owner; group=hive; permission=rw-r--r--; isSymlink=false}
to hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet
I found this issue on jira : https://issues.apache.org/jira/browse/SPARK-18626
Is there a way to make the writing part thread safe ? to make the execution one by one, one after another ?
Thanks.
SOLUTION
Use this.synchronized{} like below
class MyThread extends Runnable{
def run{
//some code
this.synchronized{
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
}
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()

InvalidQueryException when trying to create column family in Cassandra via unit

I have a 3 node cassandra cluster and via my unit test in Java, I first create a keyspace and then create a column family within that keyspace. Sometimes the unit tests passes but randomly I keep getting the following error. I am using the latest datastax 2.1.4 java driver and the cassandra version in 2.1.0.
com.symc.edp.database.nosql.NoSQLPersistenceException: com.datastax.driver.core.exceptions.InvalidQueryException: Cannot add column family 'testmaxcolumnstable' to non existing keyspace 'testmaxcolumnskeyspace'.
at com.symc.edp.database.nosql.cassandra.CassandraCQLTableEditor.createTable(CassandraCQLTableEditor.java:67)
at com.symc.edp.database.nosql.cassandra.TestCassandraWideRowPerformance.testWideRowInserts(TestCassandraWideRowPerformance.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(Method.java:483)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Cannot add column family 'testmaxcolumnstable' to non existing keyspace 'testmaxcolumnskeyspace'.
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36)
at com.symc.edp.database.nosql.cassandra.CassandraCQLTableEditor.createTable(CassandraCQLTableEditor.java:65)
... 6 more
Caused by: com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException: Cannot add column family 'testmaxcolumnstable' to non existing keyspace 'testmaxcolumnskeyspace'.
at com.datastax.driver.core.Responses$Error.asException(Responses.java:104)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:140)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:249)
at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:421)
at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:697)
at com.datastax.shaded.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at com.datastax.shaded.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at com.datastax.shaded.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at com.datastax.shaded.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at com.datastax.shaded.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at com.datastax.shaded.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at com.datastax.shaded.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at com.datastax.shaded.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at com.datastax.shaded.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at com.datastax.shaded.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at com.datastax.shaded.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at com.datastax.shaded.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at com.datastax.shaded.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
And in the system.log file of cassandra I see the following exception:
ERROR [SharedPool-Worker-1] 2015-01-28 15:08:24,286 ErrorMessage.java:218 - Unexpected exception during request
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_05]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_05]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_05]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_05]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375) ~[na:1.8.0_05]
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:878) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_05]
INFO [SharedPool-Worker-1] 2015-01-28 15:13:01,051 MigrationManager.java:229 - Create new Keyspace: KSMetaData{name=testmaxcolumnskeyspace, strategyClass=SimpleStrategy, strategyOptions={replication_factor=1}, cfMetaData={}, durableWrites=true, userTypes=org.apache.cassandra.config.UTMetaData#790ee1bb}
INFO [MigrationStage:1] 2015-01-28 15:13:01,058 ColumnFamilyStore.java:856 - Enqueuing flush of schema_keyspaces: 512 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:7] 2015-01-28 15:13:01,059 Memtable.java:326 - Writing Memtable-schema_keyspaces#1727029917(138 serialized bytes, 3 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:7] 2015-01-28 15:13:01,077 Memtable.java:360 - Completed flushing /usr/share/apache-cassandra-2.1.0/bin/../data/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-103-Data.db (175 bytes) for commitlog position ReplayPosition(segmentId=1422485457803, position=1181)
Also, I verified via devcenter, the keyspace didn't get created.
Without seeing your code, my guess is you need a sleep in between creating the keyspace and trying to create tables in it. You probably need to give the keyspace definition a couple seconds to propagate to all the nodes in your cluster before you try to use it.
It would help as noted to see your configuration class. We are using ClassPathCQLDataSet to issue our statements and create keyspace at same go (link to ClassPathCqlDataSet documentation, note that boolean on position 2 and 3 tell it to create and delete keyspace). db.cql is file where we hold create table statments. Here is our configuration which might help you:
package some.package;
import org.cassandraunit.CQLDataLoader;
import org.cassandraunit.dataset.cql.ClassPathCQLDataSet;
import org.cassandraunit.utils.EmbeddedCassandraServerHelper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.DisposableBean;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Profile;
import org.springframework.context.support.PropertySourcesPlaceholderConfigurer;
import org.springframework.core.io.ClassPathResource;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
#Configuration
#Profile({"test"})
public class TestCassandraConfig implements DisposableBean {
private static final Logger LOGGER = LoggerFactory.getLogger(TestCassandraConfig.class);
private final String CQL = "db.cql";
#Value("${cassandra.contact_points:localhost}")
private String contact_points;
#Value("${cassandra.port:9142}")
private int port;
#Value("${cassandra.keyspace:test}")
private String keyspace;
private static Cluster cluster;
private static Session session;
private static SessionProxy sessionProxy;
#Bean
public Session session() throws Exception {
if (session == null) {
initialize();
}
return sessionProxy;
}
#Bean
public TestApplicationContext testApplicationContext() {
return new TestApplicationContext();
}
private void initialize() throws Exception {
LOGGER.info("Starting embedded cassandra server");
EmbeddedCassandraServerHelper.startEmbeddedCassandra("another-cassandra.yaml");
LOGGER.info("Connect to embedded db");
cluster = Cluster.builder().addContactPoints(contact_points).withPort(port).build();
session = cluster.connect();
LOGGER.info("Initialize keyspace");
final CQLDataLoader cqlDataLoader = new CQLDataLoader(session);
cqlDataLoader.load(new ClassPathCQLDataSet(CQL, false, true, keyspace));
}
#Override
public void destroy() throws Exception {
if (cluster != null) {
cluster.close();
cluster = null;
}
}
}

Resources