I am working with Apache Spark and Apache Ignite. I have a spark dataset which I wrote in Ignite using following code
dataset.write()
.mode(SaveMode.Overwrite)
.format(FORMAT_IGNITE())
.option(OPTION_CONFIG_FILE(), "ignite-server-config.xml")
.option(OPTION_TABLE(), "CUSTOM_VALUES")
.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "ID")
.save();
And I am reading it again to perform group by operation which will be pushed to Ignite.
Dataset igniteDataset = sparkSession.read()
.format(FORMAT_IGNITE())
.option(OPTION_CONFIG_FILE(), "ignite-server-config.xml")
.option(OPTION_TABLE(), "CUSTOM_VALUES")
.load();
RelationalGroupedDataset idGroupedData = igniteDataset.groupBy(customized_id);
Dataset<Row> result = idGroupedData.agg(count(id).as("count_id"),
count(fid).as("count_custom_field_id"),
count(type).as("count_customized_type"),
count(val).as("count_value"), count(customized_id).as("groupCount"));
Now, I want to get the number of rows returned by groupby action. So, I am calling count() on dataset asresult.count();
When I do this, I get following exception.
Caused by: org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement "SELECT COUNT(1) AS COUNT FROM (SELECT FROM CUSTOM_VALUES GROUP[*] BY CUSTOMIZED_ID) TABLE1 "; expected "., (, USE, AS, RIGHT, LEFT, FULL, INNER, JOIN, CROSS, NATURAL, ,, SELECT"; SQL statement:
SELECT COUNT(1) AS count FROM (SELECT FROM CUSTOM_VALUES GROUP BY CUSTOMIZED_ID) table1 [42001-197]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:357)
at org.h2.message.DbException.getSyntaxError(DbException.java:217)
Other functions such as show(), collectAsList().size(); works.
What am I missing here ?
I tested your example against the last community version 8.7.5 of GridGain that is the opensource version of Gridgain based on Ignite 2.7.0 sources with a subset of additional fixes (https://www.gridgain.com/resources/download).
Here is the code:
public class Main {
public static void main(String[] args) {
if (args.length < 1)
throw new IllegalArgumentException("You should set the path to client configuration file.");
String configPath = args[0];
SparkSession session = SparkSession.builder()
.enableHiveSupport()
.getOrCreate();
Dataset<Row> igniteDataset = session.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), configPath) //Ignite config.
.load();
RelationalGroupedDataset idGroupedData = igniteDataset.groupBy("CITY_ID");
Dataset<Row> result = idGroupedData.agg(count("id").as("count_id"),
count("city_id").as("count_city_id"),
count("name").as("count_name"),
count("age").as("count_age"),
count("company").as("count_company"));
result.show();
session.close();
}
}
Here are the maven dependencies:
<dependencies>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>gridgain-core</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-core</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-spring</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-indexing</artifactId>
<version>8.7.5</version>
</dependency>
<dependency>
<groupId>org.gridgain</groupId>
<artifactId>ignite-spark</artifactId>
<version>8.7.5</version>
</dependency>
</dependencies>
Here is the cache configuration:
<property name="cacheConfiguration">
<list>
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="Person"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="sqlSchema" value="PUBLIC"/>
<property name="queryEntities">
<list>
<bean class="org.apache.ignite.cache.QueryEntity">
<property name="keyType" value="PersonKey"/>
<property name="valueType" value="PersonValue"/>
<property name="tableName" value="Person"/>
<property name="keyFields">
<list>
<value>id</value>
<value>city_id</value>
</list>
</property>
<property name="fields">
<map>
<entry key="id" value="java.lang.Integer"/>
<entry key="city_id" value="java.lang.Integer"/>
<entry key="name" value="java.lang.String"/>
<entry key="age" value="java.lang.Integer"/>
<entry key="company" value="java.lang.String"/>
</map>
</property>
<property name="aliases">
<map>
<entry key="id" value="id"/>
<entry key="city_id" value="city_id"/>
<entry key="name" value="name"/>
<entry key="age" value="age"/>
<entry key="company" value="company"/>
</map>
</property>
</bean>
</list>
</property>
</bean>
</list>
</property>
Using Spark 2.3.0 that is only supported for ignite-spark dependency I have next result on my test data:
Data:
ID,CITY_ID,NAME,AGE,COMPANY,
4,1,Justin Bronte,23,bank,
3,1,Helen Richard,49,bank,
Result:
+-------+--------+-------------+----------+---------+-------------+
|CITY_ID|count_id|count_city_id|count_name|count_age|count_company|
+-------+--------+-------------+----------+---------+-------------+
| 1| 2| 2| 2| 2| 2|
+-------+--------+-------------+----------+---------+-------------+
Also, this code could be fully applied to Ignite 2.7.0.
Related
Every Hi:
There is a exception i have never encountered,Pls see the below:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.copyTableJobPropertiesToConf(Lorg/apache/hadoop/hive/ql/plan/TableDesc;Lorg/apache/hadoop/conf/Configuration;)V
at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:399)
at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1(TableReader.scala:314)
at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$createOldHadoopRDD$1$adapted(TableReader.scala:314)
at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8(HadoopRDD.scala:181)
at org.apache.spark.rdd.HadoopRDD.$anonfun$getJobConf$8$adapted(HadoopRDD.scala:181)
What's the code is:
import org.apache.spark.sql.SparkSession
object test {
def main(args:Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "nuochengze")
val spark: SparkSession = SparkSession.builder()
.appName("Test")
.master("local[*]")
.config("hadoop.home.dir", "hdfs://pc001:8082/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
spark.sql("use test")
spark.sql(
"""
|select * from emp
|""".stripMargin).show
spark.close()
}
}
A thing that made me at a loss happended when i used spark to operate hiveļ¼
I can perform DDL operations through spark.sql(...).But when i try perform DML operations,such as select ,the above Exception will be reported,I know the lock of this method.But after searching the internet,i did not find any related blogs that if this method is missing,how can solve it?
Have you encountered it? if ture, can i ask for help?
Thinks!!!
I have found the cause of the error. Due to my negligence, when importing modules into pom.xml, there are some inconsistencies between the versions of some modules. If you encounter similar errors, you can refer to my current maven configuration:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.25</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.10.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>3.1.2</version>
</dependency>
</dependencies>
</project>
I am working on Kafka Spark Streaming. The IDLE doesn't show any errors and the program builds successfully as well but I am getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at KafkaSparkStream1$.main(KafkaSparkStream1.scala:13)
at KafkaSparkStream1.main(KafkaSparkStream1.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
I am using maven. I have also set up my environment variables correctly as every component is working individually My spark version is 3.0.0-preview2, Scala version is 2.12
I have exported a spark-streaming-Kafka jar file.
Here is my pom file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.org.cpg.casestudy</groupId>
<artifactId>Kafka_casestudy</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>3.0.0</spark.version>
<scala.version>2.12</scala.version>
</properties>
<build>
<plugins>
<!-- Maven Compiler Plugin-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- Apache Kafka Clients-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Kafka Streams-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Apache Log4J2 binding for SLF4J -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.11.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0-preview2</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview2</version>
</dependency>
</dependencies>
Here is my code (word count of message send by producer):
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark._
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.codehaus.jackson.map.deser.std.StringDeserializer
object KafkaSparkStream {
def main(args: Array[String]): Unit = {
val brokers = "localhost:9092";
val groupid = "GRP1";
val topics = "KafkaTesting";
val SparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaSparkStreaming");
val ssc = new StreamingContext(SparkConf,Seconds(10))
val sc = ssc.sparkContext
sc.setLogLevel("off")
val topicSet = topics.split(",").toSet
val kafkaPramas = Map[String , Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupid,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val messages = KafkaUtils.createDirectStream[String,String](
ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String,String](topicSet,kafkaPramas)
)
val line=messages.map(_.value)
val words = line.flatMap(_.split(" "))
val wordCount = words.map(x=> (x,1)).reduceByKey(_+_)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
Try cleaning your mvn local repository or else run below command to override you dependency JARs from online
mvn clean install -U
Your spark dependencies, specially spark-core_2.12-3.0.0-preview2.jar is not added to your class path while executing the Spark JAR.
you can do it via
spark-submit --jars <path>/spark-core_2.12-3.0.0-preview2.jar
I configured a databricks connect on Azure to run my spark programs on Azure cloud. For a dry run I tested a wordcount program. But the program is failing with following error.
"Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
I am using Intellij to run the program. I have the necessary permissions to access the cluster. But I still I am getting this error.
The following program is a wrapper which takes in the parameters and publishes the results.
package com.spark.scala
import com.spark.scala.demo.{Argument, WordCount}
import org.apache.spark.sql.SparkSession
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
import scala.collection.mutable.Map
object Test {
def main(args: Array[String]): Unit = {
val argumentMap: Map[String, String] = Argument.parseArgs(args)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
println(spark.range(100).count())
val rawread = String.format("/mnt/%s", argumentMap.get("--raw-reads").get)
val data = spark.sparkContext.textFile(rawread)
print(data.count())
val rawwrite = String.format("/dbfs/mnt/%s", argumentMap.get("--raw-write").get)
WordCount.executeWordCount(spark, rawread, rawwrite);
// The Spark code will execute on the Databricks cluster.
spark.stop()
}
}
The following code performs the wordcount logic:-
package com.spark.scala.demo
import org.apache.spark.sql.SparkSession
object WordCount{
def executeWordCount(sparkSession:SparkSession, read: String, write: String)
{
println("starting word count process ")
//val path = String.format("/mnt/%s", "tejatest\wordcount.txt")
//Reading input file and creating rdd with no of partitions 5
val bookRDD=sparkSession.sparkContext.textFile(read)
//Regex to clean text
val pat = """[^\w\s\$]"""
val cleanBookRDD=bookRDD.map(line=>line.replaceAll(pat, ""))
val wordsRDD=cleanBookRDD.flatMap(line=>line.split(" "))
val wordMapRDD=wordsRDD.map(word=>(word->1))
val wordCountMapRDD=wordMapRDD.reduceByKey(_+_)
wordCountMapRDD.saveAsTextFile(write)
}
}
I have written a mapper to map the paths given and I am passing the read and write locations through command line. My pom.xml is as follows: -
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ex-com.spark.scala</groupId>
<artifactId>ex- demo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>dbutils-api_2.11</artifactId>
<version>0.0.3</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
I have a project where I need to configure spark and hbase in a local environment. I downloaded spark-2.2.1, hadoop 2.7 and hbase 1.1.8 and configured accordingly on standalone single node Ubuntu 14.04 OS.
I am able to pull and push data from spark to HDFS but not with hbase.
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
[root#localhost conf]# cat hdfs-site.xml <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property> </configuration>
spark-env.sh
[root#localhost conf]# cat spark-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/app/spark/tmp
# Options read in YARN client mode
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=1G
export SPARK_DRIVER_MEMORY=1G
export SPARK_YARN_APP_NAME=Spark
export SPARK_CLASSPATH=/opt/hbase/lib/*
hbase-site.xml:
[root#localhost conf]# cat hbase-site.xml <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration> <property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value> </property> <property>
<name>hbase.cluster.distributed</name>
<value>true</value> </property> <property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value> </property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://localhost:9000/zookeeper</value> </property> <property>
<name>hbase.master.dns.interface</name>
<value>default</value> </property> <property>
<name>hbase.master.ipc.address</name>
<value>localhost</value> </property> <property>
<name>hbase.regionserver.dns.interface</name>
<value>default</value> </property> <property>
<name>hbase.regionserver.ipc.address</name>
<value>HOSTNAME</value> </property>
<property>
<name>hbase.zookeeper.dns.interface</name>
<value>default</value> </property>
</configuration>
spark-defaults.conf:
[root#localhost conf]# cat spark-defaults.conf
spark.master
spark://127.0.0.1:7077 spark.yarn.dist.files
/opt/spark/conf/hbase-site.xml
Errors:
Even hbase lib(jars) are exported in spark-env.sh it is unable to import hbase libraries (Ex: HBaseConfiguration).
scala> import org.apache.hadoop.hbase.HBaseConfiguration
<console>:23: error: object hbase is not a member of package org.apache.hadoop
import org.apache.hadoop.hbase.HBaseConfiguration
^
If I load these jar through --drive-class-path
spark-shell --master local --driver-class-path=/opt/hbase/lib/*
scala> conf.set("hbase.zookeeper.quorum","localhost")
scala> conf.set("hbase.zookeeper.property.clientPort", "2181")
scala> val connection: Connection = ConnectionFactory.createConnection(conf)
connection: org.apache.hadoop.hbase.client.Connection = hconnection-0x2a4cb8ae
scala> val tableName = connection.getTable(TableName.valueOf("employee"))
tableName: org.apache.hadoop.hbase.client.Table = employee;hconnection-0x2a4cb8ae
scala> val insertData = new Put(Bytes.toBytes("1"))
insertData: org.apache.hadoop.hbase.client.Put = {"totalColumns":0,"row":"1","families":{}}
scala>
| insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("Name"), Bytes.toBytes("Jeevan"))
res3: org.apache.hadoop.hbase.client.Put = {"totalColumns":1,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("City"), Bytes.toBytes("San Jose"))
res4: org.apache.hadoop.hbase.client.Put = {"totalColumns":2,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("Company"), Bytes.toBytes("Cisco"))
res5: org.apache.hadoop.hbase.client.Put = {"totalColumns":3,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807},{"qualifi
":"Company","vlen":5,"tag":[],"timestamp":9223372036854775807}]}}
scala> insertData.addColumn(Bytes.toBytes("emp personal data "), Bytes.toBytes("location"), Bytes.toBytes("San Jose"))
res6: org.apache.hadoop.hbase.client.Put = {"totalColumns":4,"row":"1","families":{"emp personal data ":[{"qualifier":"Name","v
n":6,"tag":[],"timestamp":9223372036854775807},{"qualifier":"City","vlen":8,"tag":[],"timestamp":9223372036854775807},{"qualifi
":"Company","vlen":5,"tag":[],"timestamp":9223372036854775807},{"qualifier":"location","vlen":8,"tag":[],"timestamp":9223372036
4775807}]}}
but I dont see any new column in Hbase.
Can any one help please. any reference to configuration would be great. do i need to configure any zookeeper? appreciate your help.
I have some Java code that performs introspection on the schema of Cassandra tables. After upgrading the Cassandra driver dependency, this code is no longer working as expected. With the old driver version, the type for a timestamp column was returned from ColumnMetadata#getType() as DataType.Name#TIMESTAMP. With the new driver, the same call returns DataType.Name#CUSTOM and CustomType#getCustomTypeClassName returning org.apache.cassandra.db.marshal.DateType.
The old driver version is com.datastax.cassandra:cassandra-driver-core:2.1.9:
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.9</version>
</dependency>
The new driver version is com.datastax.cassandra:dse-driver:1.1.2:
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>dse-driver</artifactId>
<version>1.1.2</version>
</dependency>
The cluster version is DataStax Enterprise 2.1.11.969:
cqlsh> SELECT release_version FROM system.local;
release_version
-----------------
2.1.11.969
To illustrate the problem, I created a simple console application that prints column metadata for a specified table. (See below.) When built with the old driver, the output looks like this:
# old driver
mvn -Pcassandra-driver clean package
java -jar target/cassandra-print-column-metadata-cassandra-driver.jar <address> <user> <password> <keyspace> <table>
...
ts timestamp
...
When built with the new driver, the output looks like this:
# new driver
mvn -Pdse-driver clean package
java -jar target/cassandra-print-column-metadata-dse-driver.jar <address> <user> <password> <keyspace> <table>
...
ts 'org.apache.cassandra.db.marshal.DateType'
...
So far, I have only encountered this problem with timestamp columns. I have not seen it for any other data types, though my schema does not exhaustively use all of the supported data types.
DESCRIBE TABLE shows that the column is timestamp. system.schema_columns shows that the validator is org.apache.cassandra.db.marshal.DateType.
[cqlsh 3.1.7 | Cassandra 2.1.11.969 | CQL spec 3.0.0 | Thrift protocol 19.39.0]
cqlsh:my_keyspace> DESCRIBE TABLE my_table;
CREATE TABLE my_table (
prim_addr text,
ch text,
received_on timestamp,
...
PRIMARY KEY (prim_addr, ch, received_on)
) WITH
bloom_filter_fp_chance=0.100000 AND
caching='{"keys":"ALL", "rows_per_partition":"NONE"}' AND
comment='emm_ks' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
compaction={'sstable_size_in_mb': '160', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
cqlsh:system> SELECT * FROM system.schema_columns WHERE keyspace_name = 'my_keyspace' AND columnfamily_name = 'my_table' AND column_name IN ('prim_addr', 'ch', 'received_on');
keyspace_name | columnfamily_name | column_name | component_index | index_name | index_options | index_type | type | validator
---------------+-------------------+-------------+-----------------+------------+---------------+------------+----------------+------------------------------------------
my_keyspace | my_table | ch | 0 | null | null | null | clustering_key | org.apache.cassandra.db.marshal.UTF8Type
my_keyspace | my_table | prim_addr | null | null | null | null | partition_key | org.apache.cassandra.db.marshal.UTF8Type
my_keyspace | my_table | received_on | 1 | null | null | null | clustering_key | org.apache.cassandra.db.marshal.DateType
Is this a bug in the driver, an intentional change in behavior, or some kind of misconfiguration on my part?
pom.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cnauroth</groupId>
<artifactId>cassandra-print-column-metadata</artifactId>
<version>0.0.1-SNAPSHOT</version>
<description>Console application that prints Cassandra table column metadata</description>
<name>cassandra-print-column-metadata</name>
<packaging>jar</packaging>
<properties>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<slf4j.version>1.7.25</slf4j.version>
</properties>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addDefaultImplementationEntries>true</addDefaultImplementationEntries>
<mainClass>cnauroth.Main</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<finalName>${project.artifactId}</finalName>
<appendAssemblyId>false</appendAssemblyId>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<profiles>
<profile>
<id>dse-driver</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>dse-driver</artifactId>
<version>1.1.2</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<finalName>${project.artifactId}-dse-driver</finalName>
</configuration>
</plugin>
</plugins>
</build>
</profile>
<profile>
<id>cassandra-driver</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.9</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<finalName>${project.artifactId}-cassandra-driver</finalName>
</configuration>
</plugin>
</plugins>
</build>
</profile>
</profiles>
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
</dependencies>
</project>
Main.java
package cnauroth;
import java.util.List;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.ColumnMetadata;
import com.datastax.driver.core.Session;
class Main {
public static void main(String[] args) throws Exception {
// Skipping validation for brevity
String address = args[0];
String user = args[1];
String password = args[2];
String keyspace = args[3];
String table = args[4];
try (Cluster cluster = new Cluster.Builder()
.addContactPoints(address)
.withCredentials(user, password)
.build()) {
List<ColumnMetadata> columns =
cluster.getMetadata().getKeyspace(keyspace).getTable(table).getColumns();
for (ColumnMetadata column : columns) {
System.out.println(column);
}
}
}
}
It looks like the internal Cassandra type used for Timestamp changed from org.apache.cassandra.db.marshal.DateType and org.apache.cassandra.db.marshal.TimestampType between Cassandra 1.2 and 2.0 (CASSANDRA-5723). If you created the table with Cassandra 1.2 (or a DSE compatible version) DateType would be used (even if you upgraded your cluster later).
It appears that the 2.1 version of the java driver was able to account for this (source) but starting with 3.0 it does not (source). Instead, it parses it as a Custom type.
Fortunately, the driver is still able to serialize and deserialize this column as the cql timestamp type is communicated over the protocol in responses, but it's a bug that the driver parses this as the wrong type. I went ahead and created JAVA-1561 to track this.
If you were to migrate your cluster to C* 3.0+ or DSE 5.0+ I suspect the problem goes away as the schema tables reference the cql name instead of the representative Java class name (unless it is indeed a custom type).