Dependency Issues for Cloud DataProc + Spark + Cloud BigTable with JAVA - apache-spark

I need to create an application to run on Cloud DataProc and process large BigTable writes, scans, and deletes in massively parallel fashion using Spark. This could be in JAVA (or Python if it's doable).
I am trying to write the minimum code using Eclipse to achieve the basic functionality of getting an RDD out of a BigTable table, either by using bulkPut/bulkDelete/butkGet or by using newAPIHadoopRDD() or something similar.
I have seen multiple posts on SO and elsewhere how this can be done and on various challenges with connecting Bigtable API, HBase API, and Spark. Some of these posts are quite outdated by now (couple years old, so may not be relevant). So far I haven't managed to get anything working, mostly due to various dependency clashes or inconsistencies. No matter what combination of dependencies and versions I try in POM.XML I get either ClassNotFound or NoSuchMethod exception when I try running things.
Can I get some advice please on what are the 'working' combinations of Spark, HBase, and Bigtable dependency versions and packages that I need to include? My POM.xml currently looks like below.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>FFSpark5</groupId>
<artifactId>FFSpark5</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<bigtable.version>1.0.0</bigtable.version>
<hbase.version>1.3.1</hbase.version>
<hbase-shaded.version>2.0.0-alpha2</hbase-shaded.version>
<hbase-spark.version>2.0.0-alpha4</hbase-spark.version>
<maven.compiler.target>1.8</maven.compiler.target>
<maven.compiler.source>1.8</maven.compiler.source>
<spark.version>1.6.2</spark.version>
<spark-streaming.version>1.6.2</spark-streaming.version>
<scala.version>2.11.0</scala.version>
<scalatest.version>3.0.3</scalatest.version>
<bigtable.projectID>my_gcp_project_id</bigtable.projectID>
<bigtable.instanceID>my_bigtable_instance_name</bigtable.instanceID>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark-streaming.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>${hbase-spark.version}</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigtable</groupId>
<artifactId>bigtable-hbase-1.x-hadoop</artifactId>
<version>${bigtable.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-shaded-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-shaded-client</artifactId>
<version>${hbase-shaded.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-shaded-server</artifactId>
<version>${hbase-shaded.version}</version>
</dependency>
</dependencies>
<build>
<outputDirectory>target/classes</outputDirectory>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>FFSpark5</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
I realize there is probably a lot of wrong stuff in this version of POM.xml, but I have played around with many combinations of dependencies and versions, and couldn't get any of them do actually work. This latest one seems to be going the furthest in terms of the log outputs, but still breaks. Here's the latest stack trace:
18/03/12 15:37:17 INFO BigtableSession: Bigtable options: BigtableOptions{dataHost=bigtable.googleapis.com, tableAdminHost=bigtableadmin.googleapis.com, instanceAdminHost=bigtableadmin.googleapis.com .... (lost of other options here)}.
18/03/12 15:37:17 INFO RefreshingOAuth2CredentialsInterceptor: Refreshing the OAuth token
18/03/12 15:37:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 210.6 KB, free 210.6 KB)
18/03/12 15:37:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.9 KB, free 230.5 KB)
18/03/12 15:37:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58982 (size: 19.9 KB, free: 457.9 MB)
18/03/12 15:37:19 INFO SparkContext: Created broadcast 0 from broadcast at HBaseContext.scala:73
18/03/12 15:37:19 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 160.0 B, free 230.6 KB)
18/03/12 15:37:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 120.0 B, free 230.7 KB)
18/03/12 15:37:19 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:58982 (size: 120.0 B, free: 457.9 MB)
18/03/12 15:37:19 INFO SparkContext: Created broadcast 1 from broadcast at HBaseContext.scala:74
Direct test done
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.getMvccReadPoint()J
at org.apache.hadoop.hbase.client.PackagePrivateFieldAccessor.getMvccReadPoint(PackagePrivateFieldAccessor.java:39)
at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:1088)
at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:601)
at FFSpark5.main(FFSpark5.java:64)
And below is my basic code. The idea is to do 3 tests:
Test 1: simply tries to access Bigtable directly via the simple Bigtable API, just to make sure there are no simple issues, like authentication, etc. This works fine
Test 2: tries to get newAPIHadoopRDD(). This fails
Test 3: tries to get bulkPut(). This fails
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.spark.JavaHBaseContext;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.Connection;
class PutFunction implements Function<String, Put> {
private static final long serialVersionUID = 1L;
public Put call(String v) throws Exception {
String[] cells = v.split(",");
Put put = new Put(Bytes.toBytes(cells[0]));
put.addColumn(Bytes.toBytes(cells[1]), Bytes.toBytes(cells[2]),
Bytes.toBytes(cells[3]));
return put;
}
}
public class FFSpark5
{
public static void main(String args[]) throws IOException
{
SparkConf conf = new SparkConf().setAppName("SparkTest").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration hBaseConf = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(hBaseConf);
JavaHBaseContext hbaseContext = new JavaHBaseContext(sc, hBaseConf);
// test 1: simple direct Bigtable access
connection.getTable(TableName.valueOf("FFTransFlat".getBytes()))
.put(new Put("abc".getBytes())
.addColumn("CI".getBytes(), "I".getBytes(), "val".getBytes()));
System.out.println("Direct test done");
// Test 2: newAPIHadoopRDD()
Scan scan1 = new Scan();
scan1.setCaching(500);
scan1.setCacheBlocks(false);
hBaseConf.set(TableInputFormat.INPUT_TABLE, "FFTransFlat");
hBaseConf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan1));
JavaPairRDD<ImmutableBytesWritable, Result> source = sc
.newAPIHadoopRDD(hBaseConf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
System.out.println(source.count());
// Test 3: bulkPut()
List<String> list = new ArrayList<String>(5);
list.add("1,CI,a,1");
list.add("2,CI,a,2");
list.add("3,CI,a,3");
JavaRDD<String> rdd = sc.parallelize(list);
byte tableName[] = "FFTransFlat".getBytes();
hbaseContext.bulkPut(rdd,
TableName.valueOf(tableName),
new PutFunction());
System.out.println(source.count());
connection.close();
}
}
I see from the DataProc website that Spark 2.2.0 and 1.6.2 are supported. I had issues with 2.2.0, so I'm using 1.6.2.
Can I get some advice on the following:
What is the right combination of dependencies and versions to use (specifically to work with Cloud Bigtable, not HBase cluster)
Is the recommended to achieve parallelization by using newAPIHadoopRDD or something like bulkRead()/bulkDelete()/etc. ? Or is there another preferred and performant way to do MPP with DataProc/Bigtable?
Apologies for the lengthy post -- this is our first DataProc attempt.
*** UPDATE:
I managed to get something working after updating the Bigtable dependency to bigtable-hbase-2.x-hadoop and the HBase versions to 2.0.0-alpha2. At least the bulkPut seems to be working at this stage. Now will work on cleaning out the unneeded stuff from dependencies.

Here is a full working example of Cloud Bigtable with Hortonwork's SHC which is based on HBase 1.. We will work on creating a similar example with HBase 2.'s new Spark integration built on Cloud Bigtable artifacts designed to work with the new HBase 2.* APIs (tracking issue link).

Related

Problems configuring clustered Vertx Eventbus in Quarkus

I'm using:
Quarkus 1.6.1.Final
Vertx 3.9.1 (provided by quarkus-vertx dependency, see pom.xml below)
And I can't get the clusered Eventbus working. I've followed the instructions listed here:
https://vertx.io/docs/vertx-hazelcast/java/
I've also enabled clustering in Quarkus:
quarkus.vertx.cluster.clustered=true
quarkus.vertx.cluster.port=8081
quarkus.vertx.prefer-native-transport=true
quarkus.http.port=8080
And here is my pom.xml:
<dependencies>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-resteasy</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-resteasy-mutiny</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-vertx</artifactId>
</dependency>
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-hazelcast</artifactId>
<version>3.9.2</version>
<exclusions>
<exclusion>
<groupId>io.vertx</groupId>
<artifactId>vertx-core</artifactId>
</exclusion>
<!-- <exclusion>-->
<!-- <groupId>com.hazelcast</groupId>-->
<!-- <artifactId>hazelcast</artifactId>-->
<!-- </exclusion>-->
</exclusions>
</dependency>
<!-- <dependency>-->
<!-- <groupId>com.hazelcast</groupId>-->
<!-- <artifactId>hazelcast-all</artifactId>-->
<!-- <version>3.9</version>-->
<!-- </dependency>-->
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-transport-native-epoll</artifactId>
<classifier>linux-x86_64</classifier>
</dependency>
</dependencies>
And the error I get is the following:
Caused by: java.lang.ClassNotFoundException: com.hazelcast.core.MembershipListener
As you can see in my pom.xml, I've also added the dependency hazelcast-all:3.9 and excluded the hazelcast dependency from vertx-hazelcast:3.9.2, then this error disappears but another comes up:
Caused by: com.hazelcast.config.InvalidConfigurationException: cvc-complex-type.2.4.a: Invalid content was found starting with element '{"http://www.hazelcast.com/schema/config":memcache-protocol}'. One of '{"http://www.hazelcast.com/schema/config":public-address, "http://www.hazelcast.com/schema/config":reuse-address, "http://www.hazelcast.com/schema/config":outbound-ports, "http://www.hazelcast.com/schema/config":join, "http://www.hazelcast.com/schema/config":interfaces, "http://www.hazelcast.com/schema/config":ssl, "http://www.hazelcast.com/schema/config":socket-interceptor, "http://www.hazelcast.com/schema/config":symmetric-encryption, "http://www.hazelcast.com/schema/config":member-address-provider}' is expected.
Am I doing something wrong or forgetting something, or is this simply a bug in Quarkus or in Vertx ?
Thx in advance for any help.
I think the most probable reason of your issue is that you are using the quarkus-universe-bom which enforces a version of Hazelcast (we have an Hazelcast extension there) which is not compatible with vertx-hazelcast.
Check your dependency tree with mvn dependency:tree and make sure the Hazelcast artifacts are of the version required by vertx-hazelcast.
Another option would be to simply use the quarkus-bom which does not enforce an Hazelcast version and let vertx-hazelcast drag the dependency by itself.
It seems like a bug in Quarkus and this issue is related to:
https://github.com/quarkusio/quarkus/issues/10889
Bringing this from its winter sleep...
I am looking to use quarkus 2 + vert.x 4 and use either the shared data vert.x api or vert.x cluster manager in order to achieve an in-process, distributed cache (as opposed to an external dist. cache cluster)
What's unclear to me, also by looking at the git issue described above (thats still open), is if I can count on these APIs working for me at this time with the versions I mentioned.
Any comments will be great!
Thanks in advance...
[UPDATE]: looks like the clustered cache works with no issues using the shared data API along with quarkus, vertx, hazlecast & mutiny bindings for vertx (all with latest versions).
all I needed to do is set quarkus.vertx.cluster.clustered=true in quarkus properties file, use vertx.sharedData().getClusterWideMap implementation for the distrubuted cache and add gradle/maven 'io.vertx:vertx-hazelcast:4.3.1' support.
in general, thats all it took for a small poc code.
thanks

Coverting Java class to Json schema - FasterXML / jackson-module-jsonSchema

I am playing with some Spring Boot code to convert a java class to a Json schema and am getting strange behavior just by addin the dependency to the POM file as in
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-jsonSchema</artifactId>
<version>2.9.4</version>
</dependency>
the error I am getting is:
The Bean Validation API is on the classpath but no implementation could be found
Action:
Add an implementation, such as Hibernate Validator, to the classpath
Any suggestion on reading on this or resolving.
Thanks.
Yes this comes because the artifact mentioned by you depends on:
<dependency>
<groupId>javax.validation</groupId>
<artifactId>validation-api</artifactId>
<version>1.1.0.Final</version>
</dependency>
But this is only the validation API, a real validator that can do the 'real' work has to be added.
It would be interesting to see your pom.xml because many Spring Boot Starters already come with a validation implementation, e.g.:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
…comes with…
<dependency>
<groupId>org.hibernate.validator</groupId>
<artifactId>hibernate-validator-parent</artifactId>
</dependency>
By the way, the behavior described by you is also documented in this Spring Boot issue.
It will be fixed with this pull request which mandates a validation implementation only if a validation action is really performed (#Validated).

Datastax Cassandra Driver throwing CodecNotFoundException

The exact Exception is as follows
com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [varchar <-> java.math.BigDecimal]
These are the versions of Software I am using
Spark 1.5
Datastax-cassandra 3.2.1
CDH 5.5.1
The code I am trying to execute is a Spark program using the java api and it basically reads data (csv's) from hdfs and loads it into cassandra tables . I am using the spark-cassandra-connector. I had a lot of issues regarding the google s guava library conflict initially which I was able to resolve by shading the guava library and building a snap-shot jar with all the dependencies.
However I was able to load data for some files but for some files I get the Codec Exception . When I researched on this issue I got these following threads on the same issue.
https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/yZyaOQ-wazk
https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/yZyaOQ-wazk
After going through these discussion what I understand is either it is a wrong version of the cassandra-driver I am using . Or there is still a class path issue related to the guava library as cassandra 3.0 and later versions use guava 16.0.1 and the discussions above say that there might be a lower version of the guava present in the class path .
Here is pom.xml file
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.5.0-M3</version>
</dependency>
<dependency>
<groupId>org.apache.cassandra</groupId>
<artifactId>cassandra-clientutil</artifactId>
<version>3.2.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>com.pointcross.shaded.google</shadedPattern>
</relocation>
</relocations>
<minimizeJar>false</minimizeJar>
<shadedArtifactAttached>true</shadedArtifactAttached>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
and these are the dependencies that were downloaded using the above pom
spark-core_2.10-1.5.0.jar
spark-cassandra-connector- java_2.10-1.5.0-M3.jar
spark-cassandra-connector_2.10-1.5.0-M3.jar
spark-repl_2.10-1.5.1.jar
spark-bagel_2.10-1.5.1.jar
spark-mllib_2.10-1.5.1.jar
spark-streaming_2.10-1.5.1.jar
spark-graphx_2.10-1.5.1.jar
guava-16.0.1.jar
cassandra-clientutil-3.2.1.jar
cassandra-driver-core-3.0.0-alpha4.jar
Above are some of the main dependencies on in my snap-shot jar.
Y is the CodecNotFoundException ? Is it because of the class path (guava) ? or cassandra-driver (cassandra-driver-core-3.0.0-alpha4.jar for datastax cassandra 3.2.1) or because of the code .
Another point is all the dates I am inserting to columns who's data type is timestamp .
Also when I do a spark-submit I see the class path in the logs , There are other guava versions which are under the hadoop libs . R these causing the problem ?
How do we specify the a user-specific class path while do a spark-submit. Will that help ?
Would be glad to get some points on these.
Thanks
Following is the stacktrace
com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [timestamp <-> java.lang.String]
at com.datastax.driver.core.CodecRegistry.notFound(CodecRegistry.java:689)
at com.datastax.driver.core.CodecRegistry.createCodec(CodecRegistry.java:550)
at com.datastax.driver.core.CodecRegistry.findCodec(CodecRegistry.java:530)
at com.datastax.driver.core.CodecRegistry.codecFor(CodecRegistry.java:485)
at com.datastax.driver.core.AbstractGettableByIndexData.codecFor(AbstractGettableByIndexData.java:85)
at com.datastax.driver.core.BoundStatement.bind(BoundStatement.java:198)
at com.datastax.driver.core.DefaultPreparedStatement.bind(DefaultPreparedStatement.java:126)
at com.cassandra.test.LoadDataToCassandra$1.call(LoadDataToCassandra.java:223)
at com.cassandra.test.LoadDataToCassandra$1.call(LoadDataToCassandra.java:1)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1555)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1121)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1121)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I also got
com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [Math.BigDecimal <-> java.lang.String]
When you call bind(params...) on a PreparedStatement the driver expects you to provide values w/ java types that map to the cql types.
This error ([timestamp <-> java.lang.String]) is telling you that there is no such Codec registered that maps the java String to a cql timestamp. In the java driver, the timestamp type maps to java.util.Date. So you have 2 options here:
Where the column being bound is for a timestamp, provide a Date-typed value instead of a String.
Create a codec that maps timestamp <-> String. To do so you could create sub class of MappingCodec as described on the documentation site, that maps String to timestamp:
public class TimestampAsStringCodec extends MappingCodec<String, Date> {
public TimestampAsStringCodec() { super(TypeCodec.timestamp(), String.class); }
#Override
protected Date serialize(String value) { ... }
#Override
protected String deserialize(Date value) { ... }
}
You then would need to register the Codec:
cluster.getConfiguration().getCodecRegistry()
.register(new TimestampAsStringCodec());
Better solution is provided here
The correct mappings that the driver offers out of the box for temporal types are:
DATE <-> com.datastax.driver.core.LocalDate : use getDate()

Spark 1.5.1 not working with hive jdbc 1.2.0

I am trying to execute hive query using spark 1.5.1 in standalone mode and hive 1.2.0 jdbc version.
Here is my piece of code:
private static final String HIVE_DRIVER = "org.apache.hive.jdbc.HiveDriver";
private static final String HIVE_CONNECTION_URL = "jdbc:hive2://localhost:10000/idw";
private static final SparkConf sparkconf = new SparkConf().set("spark.master", "spark://impetus-i0248u:7077").set("spark.app.name", "sparkhivesqltest")
.set("spark.cores.max", "1").set("spark.executor.memory", "512m");
private static final JavaSparkContext sc = new JavaSparkContext(sparkconf);
private static final SQLContext sqlContext = new SQLContext(sc);
public static void main(String[] args) {
//Data source options
Map<String, String> options = new HashMap<String, String>();
options.put("driver", HIVE_DRIVER);
options.put("url", HIVE_CONNECTION_URL);
options.put("dbtable", "(select * from idw.emp) as employees_name");
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
}
I am getting below error at sqlContext.read().format("jdbc").options(options).load();
Exception in thread "main" java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:143)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:135)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
I am running spark 1.5.1 in standalone mode
Hadoop version is 2.6
Hive version is 1.2.0
Here is the dependency that I have added in java project in pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.0</version>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.0</version>
</dependency>
Can anyone help me out in this?
If somebody has used spark 1.5.1 with hive jdbc, then can you please tell me the compatible version of hive for spark 1.5.1.
Thank you in advance..!
As far as I can tell, you're unfortunately out of luck in terms of using the jdbc connector until it's fixed upstream; the "Method not supported" in this case is not just a version mismatch, but is explicitly not implemented in the hive jdbc library branch-1.2 and even if you look at the hive jdbc master branch or branch-2.0 it's still not implemented:
public boolean isSigned(int column) throws SQLException {
throw new SQLException("Method not supported");
}
Looking at the Spark callsite, isSigned is called during resolveTable in Spark 1.5 as well as at master.
That said, most likely the real reason this "issue" remains is that when interactive with Hive, you're expected to connect to the Hive metastore directly rather than needing to mess around with jdbc connectors; see the Hive Tables in Spark documentation for how to do this. Essentially, you want to think of Spark as an equal/replacement of Hive rather than being a consumer of Hive.
This way, pretty much all you do is add hive-site.xml to your Spark's conf/ directory and make sure the datanucleus jars under lib_managed/jars are available to all Spark executors, and then Spark talks directly to the Hive metastore for schema info and fetches data directly from your HDFS in a way amenable to nicely parallelized RDDs.

Log4J - SiftingAppender-like functionality

I work in a project that uses Log4J. One of the requirement is to create a separate log file for each thread; this itself was a odd issue, somewhat sorted by creating a new FileAppender on the fly and attaching it to the Logger instance.
Logger logger = Logger.getLogger(<thread dependent string>);
FileAppender appender = new FileAppender();
appender.setFile(fileName);
appender.setLayout(new PatternLayout(lp.getPattern()));
appender.setName(<thread dependent string>);
appender.setThreshold(Level.DEBUG);
appender.activateOptions();
logger.addAppender(appender);
Everything went fine until we realised that another library we use - Spring Framework v3.0.0 (which use Commons Logging) - does not play ball with the technique above – the Spring logging data is “seen” only by Appenders initialised from the log4.configuration file but not by the runtime created Appenders.
So, back to square one.
After some investigation, I found out that the new and improved LogBack has an appender - SiftingAppender – which does exactly what we need i.e. thread level logging on independent files.
At the moment, moving to LogBack is not an option, so, being stuck with Log4J, how can I achieve SiftingAppender-like functionality and keep Spring happy as well ?
Note: Spring is only used for JdbcTemplate functionality, no IOC; in order to “hook” Spring’s Commons Logging to Log4J I added this line in the log4j.properties file:
log4j.logger.org.springframework=DEBUG
as instructed here.
In Log4j2, we can now use RoutingAppender:
The RoutingAppender evaluates LogEvents and then routes them to a subordinate Appender. The target Appender may be an appender previously configured and may be referenced by its name or the Appender can be dynamically created as needed.
From their FAQ:
How do I dynamically write to separate log files?
Look at the RoutingAppender. You can define multiple routes in the configuration, and put values in the ThreadContext map that determine which log file subsequent events in this thread get logged to.
LogBack is accessed via the slf4j api. There is an adapter library called jcl-over-sjf4j which exposes the commons logging interface but makes all the logging to the slf4j API, which goes directly to the implementation - LogBack. If you are using maven, here are the dependencies:
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.5.8</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>1.5.8</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>0.9.18</version>
</dependency>
(and add the commons-logging to the exclusion list, see here)
I struggled for a while to find SiftingAppender-like functionality in log4j (we couldn't switch to logback because of some dependencies), and ended up with a programmatic solution that works pretty well, using an MDC and appending loggers at runtime:
// this can be any thread-specific string
String processID = request.getProcessID();
Logger logger = Logger.getRootLogger();
// append a new file logger if no logger exists for this tag
if(logger.getAppender(processID) == null){
try{
String pattern = "%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n";
String logfile = "log/"+processID+".log";
FileAppender fileAppender = new FileAppender(
new PatternLayout(pattern), logfile, true);
fileAppender.setName(processID);
// add a filter so we can ignore any logs from other threads
fileAppender.addFilter(new ProcessIDFilter(processID));
logger.addAppender(fileAppender);
}catch(Exception e){
throw new RuntimeException(e);
}
}
// tag all child threads with this process-id so we can separate out log output
MDC.put("process-id", processID);
//whatever you want to do in the thread
LOG.info("This message will only end up in "+processID+".log!");
MDC.remove("process-id");
The filter appended above just checks for a specific process id:
public class RunIdFilter extends Filter {
private final String runId;
public RunIdFilter(String runId) {
this.runId = runId;
}
#Override
public int decide(LoggingEvent event) {
Object mdc = event.getMDC("run-id");
if (runId.equals(mdc)) {
return Filter.ACCEPT;
}
return Filter.DENY;
}
}
Hope this helps a bit.
I like to include all of the slf4j facades/re-routers/whateveryoucallthem. Also note the "provided" hack, which keeps dependencies from pulling in commons logging; previously I was using a fake empty commons logging library called version-99.0-does-not-exist.
Also see http://blog.springsource.com/2009/12/04/logging-dependencies-in-spring/
<dependencies>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<!-- use provided scope on real JCL instead -->
<!-- <version>99.0-does-not-exist</version> -->
<version>1.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging-api</artifactId>
<!-- use provided scope on real JCL instead -->
<!-- <version>99.0-does-not-exist</version> -->
<version>1.1</version>
<scope>provided</scope>
</dependency>
<!-- the slf4j commons-logging replacement -->
<!-- if any package is using jakarta commons logging this will -->
<!-- re-route it through slf4j. -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${version.slf4j}</version>
</dependency>
<!-- the slf4j log4j replacement. -->
<!-- if any package is using log4j this will re-route -->
<!-- it through slf4j. -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
<version>${version.slf4j}</version>
</dependency>
<!-- the slf4j java.util.logging replacement. -->
<!-- if any package is using java.util.logging this will re-route -->
<!-- it through slf4j. -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jul-to-slf4j</artifactId>
<version>${version.slf4j}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${version.slf4j}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>${version.logback}</version>
</dependency>
</dependencies>
<properties>
<version.logback>0.9.15</version.logback>
<version.slf4j>1.5.8</version.slf4j>
</properties>
have you looked at log4j.NDC and MDC? This at least allows you to tag thread specific data to your logging. Not exactly what you're asking, but might be useful. There's a discussion here.

Resources