I was trying out hbase spark distributed scan example.
My simple code looks like this:
public class DistributedHBaseScanToRddDemo {
public static void main(String[] args) {
JavaSparkContext jsc = getJavaSparkContext("hbasetable1");
Configuration hbaseConf = getHbaseConf(0, "", "");
JavaHBaseContext javaHbaseContext = new JavaHBaseContext(jsc, hbaseConf);
Scan scan = new Scan();
scan.setCaching(100);
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> javaRdd =
javaHbaseContext.hbaseRDD(TableName.valueOf("hbasetable1"), scan);
List<String> results = javaRdd.map(new ScanConvertFunction()).collect();
System.out.println("Result Size: " + results.size());
}
public static Configuration getHbaseConf(int pRimeout, String pQuorumIP, String pClientPort)
{
Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.setInt("timeout", 120000);
hbaseConf.set("hbase.zookeeper.quorum", "10.56.36.14");
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181");
return hbaseConf;
}
public static JavaSparkContext getJavaSparkContext(String pTableName)
{
SparkConf sparkConf = new SparkConf().setAppName("JavaHBaseBulkPut" + pTableName);
sparkConf.setMaster("local");
sparkConf.set("spark.testing.memory", "471859200");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
return jsc;
}
private static class ScanConvertFunction implements Function<Tuple2<ImmutableBytesWritable, Result>, String> {
public String call(Tuple2<ImmutableBytesWritable, Result> v1) throws Exception {
return Bytes.toString(v1._1().copyBytes());
}
}
}
I am getting following exception:
Exception in thread "main" org.apache.hadoop.hbase.DoNotRetryIOException: /10.56.48.219:16020 is unable to read call parameter from client 10.56.49.148; java.lang.UnsupportedOperationException: GetRegionLoad
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:93)
at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:83)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:345)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.getRegionLoad(ProtobufUtil.java:1746)
at org.apache.hadoop.hbase.client.HBaseAdmin.getRegionLoad(HBaseAdmin.java:2089)
at org.apache.hadoop.hbase.mapreduce.RegionSizeCalculator.init(RegionSizeCalculator.java:82)
at org.apache.hadoop.hbase.mapreduce.RegionSizeCalculator.<init>(RegionSizeCalculator.java:60)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.oneInputSplitPerRegion(TableInputFormatBase.java:293)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:257)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.myproj.poc.sparkhbaseneo4j.DistributedHBaseScanToRddDemo.main(DistributedHBaseScanToRddDemo.java:32)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): /10.56.48.219:16020 is unable to read call parameter from client 10.56.49.148; java.lang.UnsupportedOperationException: GetRegionLoad
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:161)
at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:191)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at org.apache.hadoop.hbase.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.hadoop.hbase.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at org.apache.hadoop.hbase.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at org.apache.hadoop.hbase.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at org.apache.hadoop.hbase.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hadoop.hbase.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
I also tried bulk get and put examples and they are working correctly. So I was guessing whats going wrong with bulk scan example.
This Cloudera hbase-spark connector seems to work:
https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark?repo=cloudera
So, add something like this in pom.xml:
<repositories>
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/content/repositories/releases/</url>
</repository>
</repositories>
and for dependencies:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>${hbase-spark.version}</version>
</dependency>
One thing I noticed is that this functionality doesn't seem to reuse the HBase connection well and tries to re-establish it for every partition. See my question and related discussion here:
HBase-Spark Connector: connection to HBase established for every scan?
For this reason I actually avoid this functionality, but curious to know your experience with this.
It is due to the mismatch of version of the jars you have with respect to the hadoop and hbase and the code that you are running.
Related
Currently am writing Junit for our webservice code.
WebService Code and Junit code is written in the code section
When I run the Junit am getting the below error
java.lang.AbstractMethodError: javax.ws.rs.core.Response$ResponseBuilder.status(ILjava/lang/String;)Ljavax/ws/rs/core/Response$ResponseBuilder;
at javax.ws.rs.core.Response$ResponseBuilder.status(Response.java:921)
at javax.ws.rs.core.Response.status(Response.java:592)
at javax.ws.rs.core.Response.status(Response.java:603)
at javax.ws.rs.core.Response.ok(Response.java:638)
at javax.ws.rs.core.Response.ok(Response.java:650)
at com.renault.rntbci.stl.service.demande.impl.ListDemandServiceImpl.getDemandeListCount(ListDemandServiceImpl.java:281)
at com.renault.rntbci.stl.service.demande.impl.ListDemandeTest.testGetDemandeListCount(ListDemandeTest.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
One of my friend advised me to change the scope to test in pom.xml for jax rs version as below.
<dependency>
<groupId>org.jboss.resteasy</groupId>
<artifactId>jaxrs-api</artifactId>
<scope>test</scope>
<version>2.3.7.Final</version>
</dependency>
But still the same error is persisting. Kindly anyone please help me with a solution.
Thanks a lot in advance.
Note:
This is an existing application and I am new to this project which is with Java 1.7
public Response someMethod(String stringValue){
Map<String, Long> outputMap = new TreeMap<String, Long>();
try {
Long outputValue = service.getCount(stringValue);
//this call goes to a service and gets the db count for a table
outputMap.put("listCount", outputValue);
} catch (Exception e) {
logger.info(e.getMessage(), e);
}
return Response.ok(outputMap).build();
}
And Junit code is
#Mock
Service service;
#Mock
Response res;
#Test
public void testSomeMethod() throws SQLException {
String stringValue="12";
Long returnValue=10L;
when(service.getCount(stringValue)).thenReturn(returnValue);
res=obj.someMethod(stringValue);
Assert.assertNotNull(res);
}
I suspect that what's happening here is that you have two different versions of the JAR file containing the javax.ws.rs.core.Response$ResponseBuilder class. Between the two versions of this class that you have floating around, the abstract status() method changed. You need to go through your list of Maven dependencies by running the mvn dependency:tree command as per this answer. You need to identify the two different JAR versions containing the two different javax.ws.rs.core.Response$ResponseBuilder classes. Exclude the dependency version you don't want and keep the version you do want to keep.
It could be as simple as deleting the jboss-jaxrs-api_1.1_spec Jar file but depending on what other dependencies you have (and which transitive dependencies) there may be more to it than that.
I have multiple threads on Spark 1.6 writing into the same hive table (using parquet files), when they try to write at the same time it prompts an error during the renaming part of write files into HDFS. I'm searching a solution to bypass this known Spark issue.
class MyThread extends Runnable {
def run {
//some code
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()
I get this error :
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:189)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:221)
at fr.neolink.spark.streaming.StreamingNeo$.algo(StreamingNeo.scala:837)
at fr.neolink.spark.streaming.StreamingNeo$$anonfun$main$3$$anonfun$apply$18$MyThread$1.run(StreamingNeo.scala:374)
at java.lang.Thread.run(Thread.java:748)
caused by :
java.io.IOException: Failed to rename
FileStatus{path=hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/_temporary/0/task_201812281010_1770_m_000000/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet;
isDirectory=false; length=4608; replication=3; blocksize=134217728; modification_time=1545988247575;
access_time=1545988247494; owner=owner; group=hive; permission=rw-r--r--; isSymlink=false}
to hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet
I found this issue on jira : https://issues.apache.org/jira/browse/SPARK-18626
Is there a way to make the writing part thread safe ? to make the execution one by one, one after another ?
Thanks.
SOLUTION
Use this.synchronized{} like below
class MyThread extends Runnable{
def run{
//some code
this.synchronized{
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
}
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()
I am getting below given error when trying to write dataset from spark to teradata while having some string data in dataset:
2018-01-02 15:49:05 [pool-2-thread-2] ERROR c.i.i.t.spark2.algo.JDBCTableWriter:115 - Error in JDBC operation:
java.sql.SQLException: [Teradata Database] [TeraJDBC 15.00.00.20] [Error 3706] [SQLState 42000] Syntax error: Data Type "TEXT" does not match a Defined Type name.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:308)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveInitSubState.java:109)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:307)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:196)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:123)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:114)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:385)
at com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecuteUpdate(TDStatement.java:602)
at com.teradata.jdbc.jdbc_4.TDStatement.executeUpdate(TDStatement.java:1109)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
How can I ensure that the data gets properly written into teradata.
I am reading csv file from HDFS into dataset and then trying to write the same to Teradata using DataFrameWriter. I using below given code for this:
ds.write().mode("append")
.jdbc(url, tableName, props);
I am using spark 2.2.0 and Teradata is 15.00.00.07
I am getting somewhat similar issues when I tried writing in to Nettezza while in DB2 the I am able write but string values are getting replaced with .
Is there any kind of option required while writing to these databases..?
I was able to fix this issue by implementing custom JDBCDialect for Teradata.
The same approach can be used to address similar issues with other datasources like Netezza, DB2, Hive, etc.
To do so, you need to extend 'JdbcDialect' class and register it:
public class TDDialect extends JdbcDialect {
private static final Map<String, Option<JdbcType>> dataTypeMap = new HashMap<String, Option<JdbcType>>();
static {
dataTypeMap
.put("int", Option.apply(JdbcType.apply("INTEGER",
java.sql.Types.INTEGER)));
dataTypeMap.put("long",
Option.apply(JdbcType.apply("BIGINT", java.sql.Types.BIGINT)));
dataTypeMap.put("double", Option.apply(JdbcType.apply(
"DOUBLE PRECISION", java.sql.Types.DOUBLE)));
dataTypeMap.put("float",
Option.apply(JdbcType.apply("FLOAT", java.sql.Types.FLOAT)));
dataTypeMap.put("short", Option.apply(JdbcType.apply("SMALLINT",
java.sql.Types.SMALLINT)));
dataTypeMap
.put("byte", Option.apply(JdbcType.apply("BYTEINT",
java.sql.Types.TINYINT)));
dataTypeMap.put("binary",
Option.apply(JdbcType.apply("BLOB", java.sql.Types.BLOB)));
dataTypeMap.put("timestamp", Option.apply(JdbcType.apply("TIMESTAMP",
java.sql.Types.TIMESTAMP)));
dataTypeMap.put("date",
Option.apply(JdbcType.apply("DATE", java.sql.Types.DATE)));
dataTypeMap.put("string", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
dataTypeMap.put("boolean",
Option.apply(JdbcType.apply("CHAR(1)", java.sql.Types.CHAR)));
dataTypeMap.put("text", Option.apply(JdbcType.apply("VARCHAR(255)",
java.sql.Types.VARCHAR)));
}
/***/
private static final long serialVersionUID = 1L;
#Override
public boolean canHandle(String url) {
return url.startsWith("jdbc:teradata");
}
#Override
public Option<JdbcType> getJDBCType(DataType dt) {
Option<JdbcType> option = dataTypeMap.get(dt.simpleString().toLowerCase());
if(option == null){
option = Option.empty();
}
return option;
}
}
Now you can register this using below code snippet before calling any Action on spark:
JdbcDialects.registerDialect(new TDDialect());
With some datasources, like Hive for example, You may need to override one more method to avoid NumberFormatExceptions or some similar exceptions:
#Override
public String quoteIdentifier(String colName) {
return colName;
}
Hope this will help anyone facing similar issues.
Its working for me, can you please try once and let me know?
Point to be noted:
***Your hive table must be in Text format as storage. It should not be ORC.
Create the schema in Teradata before writing it from your pyspark notebook.***
df = spark.sql("select * from dbname.tableName")
properties = {
"driver": "com.teradata.jdbc.TeraDriver",
"user": "xxxx",
"password": "xxxxx"
}
df.write.jdbc(url='provide_url',table='dbName.tableName', properties=properties)
I am using Kryo serialization in Spark (v1.6.1) in Java and while serializing a class which has a collection in its field, it throws the following error -
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1055)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:102)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 27 more
I found out that this is because the default CollectionSerializer of Kryo can not deserialize the collection, because its not modifiable and we should use UnmodifiableCollectionsSerializer instead.
How do I mention specifically in spark code to use UnmodifiableCollectionsSerializer for Kryo?
My current configuration is -
SparkConf conf = new SparkConf().setAppName("ABC");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class<?>[] {*list of classes I want to register*});
In case anybody else face this issue, here is the solution - I got it working by using javakaffee kryo serializers.
Add the following maven dependency:
<dependency>
<groupId>de.javakaffee</groupId>
<artifactId>kryo-serializers</artifactId>
<version>0.42</version>
</dependency>
Write a custom kryo registrator to register UnmodifiableCollectionsSerializer
public class CustomKryoRegistrator implements KryoRegistrator {
#Override
public void registerClasses(Kryo kryo) {
UnmodifiableCollectionsSerializer.registerSerializers(kryo);
}
}
Set spark.kryo.registrator to the custom registrator's fully-qualified name
conf.set("spark.kryo.registrator", "com.abc.CustomKryoRegistrator");
References -
https://github.com/magro/kryo-serializers
Spark Kryo: Register a custom serializer
Related to this question, I got the tip that the getOrCreate idiom should be used to avoid this issues. But trying:
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
#Override
public JavaStreamingContext create() {
final SparkConf conf = new SparkConf().setAppName(NAME);
return new JavaStreamingContext(conf, Durations.seconds(BATCH_SPAN));
}
};
final JavaStreamingContext context = JavaStreamingContext.getOrCreate("/tmp/"+NAME, contextFactory);
I'm still getting:
Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)
org.apache.spark.streaming.api.java.JavaStreamingContext.<init>(JavaStreamingContext.scala:140)
org.example.ExamplePipeline$1.create(ExamplePipeline.java:56)
org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$7.apply(JavaStreamingContext.scala:706)
org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$7.apply(JavaStreamingContext.scala:705)
scala.Option.getOrElse(Option.scala:120)
org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:864)
org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:705)
org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
org.example.ExamplePipeline.createExecutionContext(ExamplePipeline.java:70)
org.example.ExamplePipeline.exec(ExamplePipeline.java:116)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeCustomInitMethod(AbstractAutowireCapableBeanFactory.java:1702)
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1641)
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1570)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$1.apply(SparkContext.scala:2257)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$1.apply(SparkContext.scala:2239)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2239)
at org.apache.spark.SparkContext$.setActiveContext(SparkContext.scala:2325)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:2197)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)
at org.apache.spark.streaming.api.java.JavaStreamingContext.<init>(JavaStreamingContext.scala:140)
at org.example.ExamplePipeline$1.create(ExamplePipeline.java:56)
at org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$7.apply(JavaStreamingContext.scala:706)
at org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$7.apply(JavaStreamingContext.scala:705)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:864)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:705)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.example.ExamplePipeline.createExecutionContext(ExamplePipeline.java:70)
at org.example.ExamplePipeline.exec(ExamplePipeline.java:116)
at org.example.ExamplePipeline.main(ExamplePipeline.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What am I suppose of being doing wrong?
Thanks in advance.
According to this question I think this is how I should do it:
SparkConf conf = new SparkConf().setAppName(NAME);
JavaSparkContext ctx = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate(conf));
JavaStreamingContext context = new JavaStreamingContext(ctx, Durations.seconds(BATCH_SPAN));
Right?