How to make concurrency work with dataframes writing into hive tables? - multithreading

I have multiple threads on Spark 1.6 writing into the same hive table (using parquet files), when they try to write at the same time it prompts an error during the renaming part of write files into HDFS. I'm searching a solution to bypass this known Spark issue.
class MyThread extends Runnable {
def run {
//some code
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()
I get this error :
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:189)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:221)
at fr.neolink.spark.streaming.StreamingNeo$.algo(StreamingNeo.scala:837)
at fr.neolink.spark.streaming.StreamingNeo$$anonfun$main$3$$anonfun$apply$18$MyThread$1.run(StreamingNeo.scala:374)
at java.lang.Thread.run(Thread.java:748)
caused by :
java.io.IOException: Failed to rename
FileStatus{path=hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/_temporary/0/task_201812281010_1770_m_000000/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet;
isDirectory=false; length=4608; replication=3; blocksize=134217728; modification_time=1545988247575;
access_time=1545988247494; owner=owner; group=hive; permission=rw-r--r--; isSymlink=false}
to hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet
I found this issue on jira : https://issues.apache.org/jira/browse/SPARK-18626
Is there a way to make the writing part thread safe ? to make the execution one by one, one after another ?
Thanks.

SOLUTION
Use this.synchronized{} like below
class MyThread extends Runnable{
def run{
//some code
this.synchronized{
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
}
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()

Related

Write to streamsets sdc in groovy scripting

I am pretty new to Streamsets and Groovy and I am updating my “Dev Raw Data Source” to “Groovy Scripting” as my input. The issue which I face here is, I have the below code in my “Groovy Scripting” and it throws an error.
Groovy scripting:
import com.streamsets.pipeline.stage.origin.scripting.*
record = sdc.createRecord("bulkUpdateRequestTemplate")
bulkUpdateRequestTemplate = "{ \"payload\": { \"objects\": { \"filter\": \"(equals(type,'configuration/entityTypes/Contract') and equals(attributes.ContractStatus, 'ACTIVE') and lt (attributes.ExpirationDate,'currentTimestampEpoch'))\", \"options\": \"searchByOv,ovOnly\" }, \"actions\": [ { \"operation\": \"UpdateAttribute\", \"operationParameters\": { \"attributeURI\": \"configuration/entityTypes/Contract/attributes/ContractStatus\", \"attributeValue\": \"INACTIVE\" } } ] } }"
record.value = bulkUpdateRequestTemplate
sdc.state['bulkUpdateRequestTemplate'] = record
Error:
com.streamsets.pipeline.api.StageException: SCRIPTING_10 - Script error in user script: javax.script.ScriptException: groovy.lang.MissingPropertyException: No such property: state for class: com.streamsets.pipeline.stage.origin.scripting.ScriptingOriginBindings
at com.streamsets.pipeline.stage.origin.scripting.AbstractScriptingSource.produce(AbstractScriptingSource.java:124)
at com.streamsets.pipeline.api.base.configurablestage.DPushSource.produce(DPushSource.java:44)
at com.streamsets.datacollector.runner.StageRuntime.lambda$execute$1(StageRuntime.java:258)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:232)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:267)
at com.streamsets.datacollector.runner.SourcePipe.process(SourcePipe.java:67)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPushSource(PreviewPipelineRunner.java:233)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:218)
at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:535)
at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:39)
at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:227)
at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$1(AsyncPreviewer.java:93)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:214)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:44)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:25)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:210)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:214)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:44)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:25)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:210)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at com.streamsets.datacollector.metrics.MetricSafeScheduledExecutorService$MetricsTask.run(MetricSafeScheduledExecutorService.java:88)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Am I missing something here? Please help.
state works only for processors and only within a given processor. It stores values you want to preserve between invocations of a processor.
Groovy Scripting origin doesn't have "invocations", it works constantly throughout the pipeline's execution, therefore it doesn't need "state".

How to fix an IllegalThreadStateException in Spring Batch in the ItemWriter using a JPA repository

I have a Spring Batch app that is reading couple of million records from an Azure SQL db (P11), does some calls, then updates those records. Below is the configuration of that particular step.
The 'chunksize' is 400 and the 'throttleLimit' is 30
#Bean
public Step coreCardProcessStep(JpaPagingItemReader<CoreCardEntity> sqlCoreLoadItemReader,
StepBuilderFactory stepBuilderFactory,
OutCoreCardProcessor outCoreCardProcessor,
CoreProcessStepCardWriter coreProcessStepCardWriter
) {
log.info("coreCardProcessingStep: creating a step for processing sql rows");
return stepBuilderFactory.get("CORE_CARD_PROCESSOR STEP 001")
.<CoreCardEntity, CoreCardEntity>chunk(chunkSize)
.reader(sqlCoreLoadItemReader)
.processor(outCoreCardProcessor)
.writer(coreProcessStepCardWriter)
.taskExecutor(new SimpleAsyncTaskExecutor("coreCardLoad"))
.throttleLimit(throttleLimit)
.build();
}
My writer is actually a JPA writer but implemented ourselves
#Slf4j
#Component
#RequiredArgsConstructor
public class CoreProcessStepCardWriter implements ItemWriter<CoreCardEntity> {
private final CoreCardRepository coreCardRepository;
#Override
public void write(List<? extends CoreCardEntity> coreCards) throws Exception {
log.info("write: number of cards: {}", coreCards.size());
coreCardRepository.saveAll(coreCards);
}
}
The problem is that after already processed a few 10000 records correctly I somehow get an IllegalThreadStateException which as far as I know can only happen if something tries to start a Thread that is already started. The exception is thrown from the writer just when it tries to saveAll and it happens on all concurrent chunks. This is the stacktrace of the actual underlying problem (Spring Batch itself is wrapping all exceptions from each chunk up in something else after the failure):
org.springframework.dao.InvalidDataAccessApiUsageException: nested exception is java.lang.IllegalThreadStateException
at org.springframework.orm.jpa.EntityManagerFactoryUtils.convertJpaAccessExceptionIfPossible(EntityManagerFactoryUtils.java:374)
at org.springframework.orm.jpa.vendor.HibernateJpaDialect.translateExceptionIfPossible(HibernateJpaDialect.java:235)
at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.translateExceptionIfPossible(AbstractEntityManagerFactoryBean.java:551)
at org.springframework.dao.support.ChainedPersistenceExceptionTranslator.translateExceptionIfPossible(ChainedPersistenceExceptionTranslator.java:61)
at org.springframework.dao.support.DataAccessUtils.translateIfNecessary(DataAccessUtils.java:242)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:152)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.jpa.repository.support.CrudMethodMetadataPostProcessor$CrudMethodMetadataPopulatingMethodInterceptor.invoke(CrudMethodMetadataPostProcessor.java:174)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at com.sun.proxy.$Proxy225.saveAll(Unknown Source)
at com.abnamro.pim.loader.cardmigration.services.batches.out.core.steps.writers.CoreProcessStepCardWriter.write(CoreProcessStepCardWriter.java:25)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.write(SimpleChunkProcessor.java:294)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalThreadStateException
at java.lang.Thread.start(Thread.java:708)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3759)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:456)
at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.tomcat.jdbc.pool.StatementFacade$StatementProxy.invoke(StatementFacade.java:118)
at com.sun.proxy.$Proxy208.executeQuery(Unknown Source)
at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:57)
at org.hibernate.loader.Loader.getResultSet(Loader.java:2322)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2075)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2037)
at org.hibernate.loader.Loader.doQuery(Loader.java:956)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:357)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:327)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2440)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:77)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:61)
at org.hibernate.persister.entity.AbstractEntityPersister.doLoad(AbstractEntityPersister.java:4521)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:4511)
at org.hibernate.event.internal.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:571)
at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:539)
at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:208)
at org.hibernate.event.internal.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:327)
at org.hibernate.event.internal.DefaultLoadEventListener.doOnLoad(DefaultLoadEventListener.java:108)
at org.hibernate.event.internal.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:74)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:118)
at org.hibernate.internal.SessionImpl.fireLoadNoChecks(SessionImpl.java:1231)
at org.hibernate.internal.SessionImpl.fireLoad(SessionImpl.java:1220)
at org.hibernate.internal.SessionImpl.access$2100(SessionImpl.java:202)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.doLoad(SessionImpl.java:2835)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.lambda$load$1(SessionImpl.java:2812)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.perform(SessionImpl.java:2768)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.load(SessionImpl.java:2812)
at org.hibernate.internal.SessionImpl.get(SessionImpl.java:1024)
at org.hibernate.event.internal.DefaultMergeEventListener.entityIsDetached(DefaultMergeEventListener.java:306)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:172)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:70)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:107)
at org.hibernate.internal.SessionImpl.fireMerge(SessionImpl.java:829)
at org.hibernate.internal.SessionImpl.merge(SessionImpl.java:816)
at sun.reflect.GeneratedMethodAccessor306.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.orm.jpa.SharedEntityManagerCreator$SharedEntityManagerInvocationHandler.invoke(SharedEntityManagerCreator.java:311)
at com.sun.proxy.$Proxy222.merge(Unknown Source)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.save(SimpleJpaRepository.java:669)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.saveAll(SimpleJpaRepository.java:700)
at org.springframework.data.jpa.repository.support.SimpleJpaRepository.saveAll(SimpleJpaRepository.java:88)
at sun.reflect.GeneratedMethodAccessor316.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker$RepositoryFragmentMethodInvoker.lambda$new$0(RepositoryMethodInvoker.java:289)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.doInvoke(RepositoryMethodInvoker.java:137)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.invoke(RepositoryMethodInvoker.java:121)
at org.springframework.data.repository.core.support.RepositoryComposition$RepositoryFragments.invoke(RepositoryComposition.java:530)
at org.springframework.data.repository.core.support.RepositoryComposition.invoke(RepositoryComposition.java:286)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$ImplementationMethodExecutionInterceptor.invoke(RepositoryFactorySupport.java:640)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:164)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:139)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.data.projection.DefaultMethodInvokingMethodInterceptor.invoke(DefaultMethodInvokingMethodInterceptor.java:81)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
... 20 more
I don't quite understand how this happens, let alone how to fix it. Anyone has any idea?
When you use a multi-threaded step, the batch artifacts (reader, writer, etc) should be thread-safe. Please refer to the documentation here: Multi-threaded Step. In your case, you should synchronize your item writer or make it step-scoped.
Another option is to use a partitioned step (instead of a multi-threaded step), where each partition is a distinct data set that is processed by a distinct worker thread.
As a side note, I'd recommend using a ThreadPoolTaskExecutor, because the SimpleAsyncTaskExecutor does not re-use threads.

Exception while doing hbase scan

I was trying out hbase spark distributed scan example.
My simple code looks like this:
public class DistributedHBaseScanToRddDemo {
public static void main(String[] args) {
JavaSparkContext jsc = getJavaSparkContext("hbasetable1");
Configuration hbaseConf = getHbaseConf(0, "", "");
JavaHBaseContext javaHbaseContext = new JavaHBaseContext(jsc, hbaseConf);
Scan scan = new Scan();
scan.setCaching(100);
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> javaRdd =
javaHbaseContext.hbaseRDD(TableName.valueOf("hbasetable1"), scan);
List<String> results = javaRdd.map(new ScanConvertFunction()).collect();
System.out.println("Result Size: " + results.size());
}
public static Configuration getHbaseConf(int pRimeout, String pQuorumIP, String pClientPort)
{
Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.setInt("timeout", 120000);
hbaseConf.set("hbase.zookeeper.quorum", "10.56.36.14");
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181");
return hbaseConf;
}
public static JavaSparkContext getJavaSparkContext(String pTableName)
{
SparkConf sparkConf = new SparkConf().setAppName("JavaHBaseBulkPut" + pTableName);
sparkConf.setMaster("local");
sparkConf.set("spark.testing.memory", "471859200");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
return jsc;
}
private static class ScanConvertFunction implements Function<Tuple2<ImmutableBytesWritable, Result>, String> {
public String call(Tuple2<ImmutableBytesWritable, Result> v1) throws Exception {
return Bytes.toString(v1._1().copyBytes());
}
}
}
I am getting following exception:
Exception in thread "main" org.apache.hadoop.hbase.DoNotRetryIOException: /10.56.48.219:16020 is unable to read call parameter from client 10.56.49.148; java.lang.UnsupportedOperationException: GetRegionLoad
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:93)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:83)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:345)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.getRegionLoad(ProtobufUtil.java:1746)
at org.apache.hadoop.hbase.client.HBaseAdmin.getRegionLoad(HBaseAdmin.java:2089)
at org.apache.hadoop.hbase.mapreduce.RegionSizeCalculator.init(RegionSizeCalculator.java:82)
at org.apache.hadoop.hbase.mapreduce.RegionSizeCalculator.<init>(RegionSizeCalculator.java:60)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.oneInputSplitPerRegion(TableInputFormatBase.java:293)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:257)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.myproj.poc.sparkhbaseneo4j.DistributedHBaseScanToRddDemo.main(DistributedHBaseScanToRddDemo.java:32)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): /10.56.48.219:16020 is unable to read call parameter from client 10.56.49.148; java.lang.UnsupportedOperationException: GetRegionLoad
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:161)
at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:191)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at org.apache.hadoop.hbase.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at org.apache.hadoop.hbase.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hadoop.hbase.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
I also tried bulk get and put examples and they are working correctly. So I was guessing whats going wrong with bulk scan example.
This Cloudera hbase-spark connector seems to work:
https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark?repo=cloudera
So, add something like this in pom.xml:
<repositories>
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/content/repositories/releases/</url>
</repository>
</repositories>
and for dependencies:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>${hbase-spark.version}</version>
</dependency>
One thing I noticed is that this functionality doesn't seem to reuse the HBase connection well and tries to re-establish it for every partition. See my question and related discussion here:
HBase-Spark Connector: connection to HBase established for every scan?
For this reason I actually avoid this functionality, but curious to know your experience with this.
It is due to the mismatch of version of the jars you have with respect to the hadoop and hbase and the code that you are running.

Spark and Cassandra parallel processing

I have a following task ahead of me.
User provides set of IP addresses a config file while executing spark submit command.
Lets say that array looks like this :
val ips = Array(1,2,3,4,5)
There can be up to 100.000 values in array..
For all elements in array, I should read data for Cassandra, perform some computation and insert data back to Cassandra.
If I do:
ips.foreach(ip =>{
- read data from Casandra for specific "ip" // for each IP there is different amount of data to read (within the functions I determine start and end date for each IP)
- process it
- save it back to Cassandra})
this works fine.
I believe that process runs sequentially; I don't exploit parallelism.
On the other hand if I do:
val IPRdd = sc.parallelize(Array(1,2,3,4,5))
IPRdd.foreach(ip => {
- read data from Cassandra // I need to use spark context to make the query
-process it
save it back to Cassandra})
I get serialization exception, because spark is trying to serialize spark context, which is not serializable.
How to make this work, but still exploit parallelism.
Thanks
Edited
This is the execption I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:59)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1.apply(WibeeeBatchJob.scala:54)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$.main(WibeeeBatchJob.scala:54)
at com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob.main(WibeeeBatchJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#311ff287)
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1, )
- field (class: com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1)
- object (class com.enerbyte.spark.jobs.wibeeebatch.WibeeeBatchJob$$anonfun$main$1$$anonfun$apply$1, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
Easiest thing to do is to use the Spark Cassandra Connector which can handle connection pooling and serialization.
With that you could do something like
sc.parallelize(inputData, numTasks)
.mapPartitions { it =>
val con = CassandraConnection(yourConf)
con.withSessionDo{ session =>
//Use the session
}
//Do any other processing
}.saveToCassandra("ks","table"
This would be completely manual operation of a Cassandra Connection. The sessions would all be automatically pooled and cached and if you prepare a statement those will be cached on the executor as well.
If you would like to use more built in methods, there also exists joinWithCassandraTable which may work in your situation.
sc.parallelize(inputData, numTasks)
.joinWithCassandraTable("ks","table") //Retrieves all records for which input data is the primary key
.map( //manipulate returned results if needed )
.saveToCassandra("ks","table")

Gradle remembers old source code

I want to create gradle custom task. The problem is that sometimes when I change something in source code, gradle shows error that in my code is error with that non existing code. For example I've created local variable timestamp and compiled project. Everything worked fine. But later I changed that variable to final ... TIMESTAMP. After that change gradle shows me error Could not find property 'timestamp'.... How to solve this problem? I assume that gradle has got cache?
Code
class MyCustomTask extends DefaultTask {
private final String TIMESTAMP = System.currentTimeMillis().toString()
public MyCustomTask() {
}
#TaskAction
def build() {
// do something
}
}
Exception
* Exception is:
org.gradle.api.GradleScriptException: A problem occurred evaluating project ':customer'.
at org.gradle.groovy.scripts.internal.DefaultScriptRunnerFactory$ScriptRunnerImpl.run(DefaultScriptRunnerFactory.ja
va:54)
at org.gradle.configuration.DefaultScriptPluginFactory$ScriptPluginImpl.apply(DefaultScriptPluginFactory.java:127)
at org.gradle.configuration.BuildScriptProcessor.execute(BuildScriptProcessor.java:36)
at org.gradle.configuration.BuildScriptProcessor.execute(BuildScriptProcessor.java:23)
at org.gradle.configuration.ConfigureActionsProjectEvaluator.evaluate(ConfigureActionsProjectEvaluator.java:34)
at org.gradle.configuration.LifecycleProjectEvaluator.evaluate(LifecycleProjectEvaluator.java:55)
at org.gradle.api.internal.project.AbstractProject.evaluate(AbstractProject.java:465)
at org.gradle.api.internal.project.AbstractProject.evaluate(AbstractProject.java:76)
at org.gradle.configuration.DefaultBuildConfigurer.configure(DefaultBuildConfigurer.java:31)
at org.gradle.initialization.DefaultGradleLauncher.doBuildStages(DefaultGradleLauncher.java:142)
at org.gradle.initialization.DefaultGradleLauncher.doBuild(DefaultGradleLauncher.java:113)
at org.gradle.initialization.DefaultGradleLauncher.run(DefaultGradleLauncher.java:81)
at org.gradle.launcher.exec.InProcessBuildActionExecuter$DefaultBuildController.run(InProcessBuildActionExecuter.ja
va:64)
at org.gradle.launcher.cli.ExecuteBuildAction.run(ExecuteBuildAction.java:33)
at org.gradle.launcher.cli.ExecuteBuildAction.run(ExecuteBuildAction.java:24)
at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:35)
at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:26)
at org.gradle.launcher.cli.RunBuildAction.run(RunBuildAction.java:50)
at org.gradle.api.internal.Actions$RunnableActionAdapter.execute(Actions.java:171)
at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:201)
at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:174)
at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:170)
at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:139)
at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:33)
at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:22)
at org.gradle.launcher.Main.doAction(Main.java:48)
at org.gradle.launcher.bootstrap.EntryPoint.run(EntryPoint.java:45)
at org.gradle.launcher.Main.main(Main.java:39)
at org.gradle.launcher.bootstrap.ProcessBootstrap.runNoExit(ProcessBootstrap.java:50)
at org.gradle.launcher.bootstrap.ProcessBootstrap.run(ProcessBootstrap.java:32)
at org.gradle.launcher.GradleMain.main(GradleMain.java:26)
Caused by: org.gradle.api.tasks.TaskInstantiationException: Could not create task of type 'MyCustomTask'.
at org.gradle.api.internal.project.taskfactory.TaskFactory$1.call(TaskFactory.java:115)
at org.gradle.api.internal.project.taskfactory.TaskFactory$1.call(TaskFactory.java:110)
at org.gradle.api.internal.AbstractTask.injectIntoNewInstance(AbstractTask.java:141)
at org.gradle.api.internal.project.taskfactory.TaskFactory.createTaskObject(TaskFactory.java:110)
at org.gradle.api.internal.project.taskfactory.TaskFactory.createTask(TaskFactory.java:70)
at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory.createTask(AnnotationProcessingTaskF
actory.java:98)
at org.gradle.api.internal.project.taskfactory.DependencyAutoWireTaskFactory.createTask(DependencyAutoWireTaskFacto
ry.java:39)
at org.gradle.api.internal.tasks.DefaultTaskContainer.create(DefaultTaskContainer.java:53)
at org.gradle.api.internal.project.AbstractProject.task(AbstractProject.java:908)
at org.gradle.api.internal.BeanDynamicObject$MetaClassAdapter.invokeMethod(BeanDynamicObject.java:216)
at org.gradle.api.internal.BeanDynamicObject.invokeMethod(BeanDynamicObject.java:122)
at org.gradle.api.internal.CompositeDynamicObject.invokeMethod(CompositeDynamicObject.java:147)
at org.gradle.groovy.scripts.BasicScript.methodMissing(BasicScript.java:83)
at build_1i1akohldagki7rlpcas5oct58.run(C:\workspace-eclipse-juno\Is2k8\customer\build.gradle:19)
at org.gradle.groovy.scripts.internal.DefaultScriptRunnerFactory$ScriptRunnerImpl.run(DefaultScriptRunnerFactory.ja
va:52)
... 30 more
Caused by: groovy.lang.MissingPropertyException: Could not find property 'timestamp' on task ':customer:is2k8Test'.
at org.gradle.api.internal.AbstractDynamicObject.propertyMissingException(AbstractDynamicObject.java:43)
at org.gradle.api.internal.AbstractDynamicObject.getProperty(AbstractDynamicObject.java:35)
at org.gradle.api.internal.CompositeDynamicObject.getProperty(CompositeDynamicObject.java:94)
at pl.ydp.gradle.MyCustomTask_Decorated.getProperty(Unknown Source)
at pl.ydp.gradle.MyCustomTask.<init>(MyCustomTask.groovy:15)
at pl.ydp.gradle.MyCustomTask_Decorated.<init>(Unknown Source)
at org.gradle.api.internal.DependencyInjectingInstantiator.newInstance(DependencyInjectingInstantiator.java:62)
at org.gradle.api.internal.ClassGeneratorBackedInstantiator.newInstance(ClassGeneratorBackedInstantiator.java:36)
at org.gradle.api.internal.project.taskfactory.TaskFactory$1.call(TaskFactory.java:113)
... 44 more
At the end you will find Caused by: groovy.lang.MissingPropertyException: Could not find property 'timestamp' on task ':customer:is2k8Test'.

Resources