While writing data to cassandra from spark, data is not getting written.
The flash back is:
I am doing a kafka-sparkStreaming-cassandra integration.
I am reading kafka messages and trying to put it in a cassandra table CREATE TABLE TEST_TABLE(key INT PRIMARY KEY, value TEXT).
kafka to spark-streaming is running cool, but spark to cassandra, there is some issue...data not getting written to table.
I am able to create a connection with cassandra, but the data is not getting inserted into the cassandra table. The output shows its getting connected and the next second getting disconnected.
The strings for System.out.print() is all at the output.
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
Cassandra shell shows 0 rows.
the full code and the logs and dependencies are below:
public class SparkStream {
static int key=0;
public static void main(String args[]) throws Exception
{
if(args.length != 3)
{
System.out.println("parameters not given properly");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
Map<String,Integer> topicMap = new HashMap<String,Integer>();
String[] topic = args[2].split(",");
for(String t: topic)
{
topicMap.put(t, new Integer(3));
}
/* Connection to Spark */
SparkConf conf = new SparkConf();
conf.set("spark.cassandra.connection.host", "localhost");
JavaSparkContext sc = new JavaSparkContext("local[4]", "SparkStream",conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(5000));
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
System.out.println("+++++++++++cassandra connector created++++++++++++++++++++++++++++");
/* Receive Kafka streaming inputs */
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
System.out.println("+++++++++++++streaming Connection done!+++++++++++++++++++++++++++");
/* Create DStream */
JavaDStream<TestTable> data = messages.map(new Function< Tuple2<String,String>, TestTable >()
{
public TestTable call(Tuple2<String, String> message)
{
return new TestTable(new Integer(++key), message._2() );
}
}
);
System.out.println("++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++");
/* Write to cassandra */
javaFunctions(data).writerBuilder("testkeyspace", "test_table", mapToRow(TestTable.class)).saveToCassandra();
jssc.start();
jssc.awaitTermination();
}
}
class TestTable implements Serializable
{
Integer key;
String value;
public TestTable() {}
public TestTable(Integer k, String v)
{
key=k;
value=v;
}
public Integer getKey(){
return key;
}
public void setKey(Integer k){
key=k;
}
public String getValue(){
return value;
}
public void setValue(String v){
value=v;
}
public String toString(){
return MessageFormat.format("TestTable'{'key={0}, value={1}'}'", key, value);
}
}
The log is:
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
14/12/09 12:07:33 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:33 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:34 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:45 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:46 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
The POM.xml dependencies are:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>com.msiops.footing</groupId>
<artifactId>footing-tuple</artifactId>
<version>0.2</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.3</version>
</dependency>
is there something wrong with the code? or cassandra configuration?
solved the issue.
the columnMapper wasnt able to access the getters and setters of class TestTable.
So changed the access modifier to public.
but now i had 2 public classes in one file. which is an error.
so created another java file TestTable.java with class as
public class TestTable implements Serializable {
//code
}
now the messages are being read from kafka and getting stored in cassandra table
Related
using Java API, I wrote a spark streaming application that process and print results correctly and now I want to write the results to HDFS. Versions are the following:
Hadoop 2.7.3
Spark 2.2.0
Java 1.8
The following is the code:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
public class Spark {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("Spark Streaming").setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "kafka1:9092,kafka2:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
kafkaParams.put("group.id", "use");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("testStr");
JavaInputDStream<ConsumerRecord<String, byte[]>> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)
);
stream.map(record -> finall(record.value())).map(record -> Arrays.deepToString(record)).dstream().saveAsTextFiles(
"spark", "txt"
);
ssc.start();
ssc.awaitTermination();
}
public static String[][] finall(byte[] record){
String[][] result = new String[4][];
result[0] = javaTest.bytePrintable(record);
result[1] = javaTest.hexTodecimal(record);
result[2] = javaTest.hexToOctal(record);
result[3] = javaTest.hexTobin(record);
return result;
}
}
But neither on HDFS nor on local filesystem it errors:
ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()Lorg/apache/hadoop/fs/FileSystem$Statistics$StatisticsData;
What's the issue? Is it needed to import some libraries from Hadoop?
UPDATE
Instead of Maven dependencies, I use local spark jars and it works. So something in dependencies is wrong. Following are POM.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Which one is incompatible? Or maybe something is missing!
I created the following test class using Cassandra.
private static String server_ip = "127.0.0.1";
private static String keyspace = "hr";
private static Cluster cluster = null;
private static Session session = null;
public static void main (String [] args)
{
if(cluster!=null) return;
cluster = Cluster.builder().addContactPoints(server_ip).withPort(9042).build();
final Metadata metadata = cluster.getMetadata();
String msg = String.format("Connected to cluster: %s", metadata.getClusterName());
System.out.println(msg);
System.out.println("List of hosts");
for (final Host host : metadata.getAllHosts())
{
msg = String.format("Datacenter: %s; Host: %s; Rack: %s",
host.getDatacenter(),
host.getAddress(),
host.getRack());
System.out.println(msg);
}
session = cluster.connect(keyspace);
}
(Cluster.java:407)
at samples.SampleB.main(SampleB.java:28)
After googling, I try all the proposing solution without success. Could you please help me solving this issue?. Thanks a lot.
you should verify those dependencies:
<!-- Apache Cassandra Datastax's CQL driver. -->
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.1.0</version>
</dependency>
HTH
Below is the code that I am using to connect to oracle local database from apache spark java program, but I am getting error:
java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name
Here is my code:
public static void main( String[] args )
{
String MYSQL_DRIVER = "oracle.jdbc.driver.OracleDriver";
String MYSQL_CONNECTION_URL = "jdbc:oracle:thin:#//192.168.2.85:1521/stage";
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "select LOC_ITEM_ID from loc_item_master");
options.put("partitionColumn", "LOC_ITEM_ID");
options.put("user", "ad_tables");
options.put("password", "ad_tables");
options.put("lowerBound", "4467");
options.put("upperBound", "4749");
options.put("numPartitions", "10");
try {
Dataset<Row> jdbcDF = sqlContext.load("jdbc", options);
List<Row> employeeFullNameRows = jdbcDF.collectAsList();
for (Row employeeFullNameRow : employeeFullNameRows) {
System.out.println(employeeFullNameRow);
}
sc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Below is the POM file
<dependencies>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.univocity/univocity-parsers -->
<dependency>
<groupId>com.univocity</groupId>
<artifactId>univocity-parsers</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>com.oracle</groupId>
<artifactId>ojdbc6</artifactId>
<version>11.2.0.3</version>
</dependency>
</dependencies>
You can try this below approach.. It will work.
public class OracleDatabaseConnection {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
Properties connectionProperties = new Properties();
connectionProperties.put("user", "ad_tables");
connectionProperties.put("password", "ad_tables");
String query = "sparkour.people";
query = "(select LOC_ITEM_ID from loc_item_master)";
Dataset<Row> jdbcDF2 = spark.read().jdbc("jdbc:oracle:thin:#//192.168.2.85:1522/stage", query,connectionProperties);
jdbcDF2.show();
}
}
Your dbtable is incorrect, try:
options.put("dbtable", "(select LOC_ITEM_ID from loc_item_master) as loc_item_id");
Everything in dbtable should be a valid FROM fragment.
You cannot write: select ... from select LOC_ITEM_ID from loc_item_master, but you can write select ... from (select LOC_ITEM_ID from loc_item_master) as loc_item_id
Of course also table's schema must be ok, you can specify it using schema.table style in query
I have two clusters running Kafka and spark separately. I want to create a kafka-topic from spark cluster. I have noticed to create a topic we need to invoke Kafka-topics.sh which wont be available in spark cluster. command should invoked through shell.
eg: /kafka_topics.sh --zookeeper :2181 --create --topic test_topic
This script should be called from spark cluster and it should get executed on Kafka cluster.
Can anyone help me?
You can have java api and maven dependencies(kafka and zookeeper) to create kafka topic as below. You can invoke the code from the code where you are submitting spark application.
<dependency>
<groupId>com.101tec</groupId>
<artifactId>zkclient</artifactId>
<version>0.3</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
import java.util.Properties;
import org.I0Itec.zkclient.ZkClient;
import org.I0Itec.zkclient.exception.ZkInterruptedException;
import kafka.utils.ZKStringSerializer$;
import kafka.admin.AdminUtils;
public final class KafkaUtils {
public static void main(String[] args) throws Exception {
KafkaUtils.createTopic("x.x.x.x:2181,y.y.y.y:2181", "topicName", 1, 0, new Properties());
}
public static void createTopic(String zkHosts, String topicName, int numberOfPartition, int replicationFactor, Properties properties) {
ZkClient zkClient = null;
try {
zkClient = getZkClient(zkHosts);
AdminUtils.createTopic(zkClient, topicName, numberOfPartition, replicationFactor, properties);
} catch (Exception exception) {
exception.printStackTrace();
} finally {
if (zkClient != null) {
try {
zkClient.close();
} catch (ZkInterruptedException ex) {
ex.printStackTrace();
}
}
}
}
private static ZkClient getZkClient(String zkHosts) {
ZkClient zkClient = null;
// Zookeeper sessionTimeoutMs
final int sessionTimeoutMs = 10000;
// Zookeeper connectionTimeoutMs
final int connectionTimeoutMs = 10000;
zkClient = new ZkClient(zkHosts, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer$.MODULE$);
return zkClient;
}
}
Here x.x.x.x and y.y.y.y are zk cluster hosts for kafka. Hope this helps.
So I am running in to issues connecting to a single node cassandra cluster using spring data-cassandra. I am using the docker image found at: https://hub.docker.com/_/cassandra/
using a docker-compose with the following environment variables set:
cassandra_n1:
image: cassandra:latest
ports:
- "9042:9042"
- "9160:9160"
hostname: cassandra_n1
environment:
CASSANDRA_CLUSTER_NAME: "mycluster"
CASSANDRA_ENDPOINT_SNITCH: "PropertyFileSnitch"
CASSANDRA_DC: "DC1"
CASSANDRA_RACK: "R1"
then after this starts I try to connect to it using my spring boot application which is as simple as:
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
#SpringBootApplication
public class MvcApplication {
public static void main(String[] args) {
SpringApplication.run(MvcApplication.class);
}
}
which scans for a configuration file which is:
#Configuration
#PropertySource(value = { "classpath:cassandra.properties" })
#EnableCassandraRepositories(basePackages = { "myproject.repository" })
public class CassandraConfig {
private static final Logger LOG = LoggerFactory.getLogger(CassandraConfig.class);
#Autowired
private Environment env;
#Bean
public CassandraClusterFactoryBean cluster() {
CassandraClusterFactoryBean cluster = new CassandraClusterFactoryBean();
cluster.setContactPoints(env.getProperty("cassandra.contactpoints"));
cluster.setPort(Integer.parseInt(env.getProperty("cassandra.port")));
return cluster;
}
#Bean
public CassandraMappingContext mappingContext() {
return new BasicCassandraMappingContext();
}
#Bean
public CassandraConverter converter() {
return new MappingCassandraConverter(mappingContext());
}
#Bean
public CassandraSessionFactoryBean session() throws Exception {
CassandraSessionFactoryBean session = new CassandraSessionFactoryBean();
session.setCluster(cluster().getObject());
session.setKeyspaceName(env.getProperty("cassandra.keyspace"));
session.setConverter(converter());
session.setSchemaAction(SchemaAction.NONE);
return session;
}
#Bean
public CassandraOperations cassandraTemplate() throws Exception {
return new CassandraTemplate(session().getObject());
}
}
that looks at the property file cassandra.properties which is:
cassandra.contactpoints=192.168.99.100
cassandra.port=9042
cassandra.keyspace=mykeyspace
I am using docker-machine as a docker daemon which has a address of 192.168.99.100
in my pom i'm using the dependencies:
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-cassandra</artifactId>
<version>1.0.0.RELEASE</version>
<exclusions>
<exclusion>
<groupId>org.springframework</groupId>
<artifactId>spring-expression</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-expression</artifactId>
<version>4.1.7.RELEASE</version>
</dependency>
After i build and run my application, the application fails to connect to cassandra showing a message:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /192.168.99.100:9042 (com.datastax.driver.core.ConnectionException:
[/192.168.99.100:9042] Unexpected error during transport initialization (com.datastax.driver.core.TransportException:
[/192.168.99.100:9042] Unexpected exception triggered (java.lang.IndexOutOfBoundsException:
Not enough readable bytes - Need 4, maximum is 0))))
I have tried to set the listen_address, broadcast_address and the rpc_address to the docker daemon ip but have not had success.
Any help would be appriciated