Now the spark version is 2.3.
I have seen the maven central repository: https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22
the shown jar is spark-streaming-kafka-0-10_2.11
so the kafka1.1.0 isn't supported now?
i still should install kafka 0.10.x
Based on the following link:
you should use spark-streaming-kafka-0-10 for kafka 0.10.0 or higher.
I have been test spark2.3 with kafka1.1.0 using jars
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
that runs well.
example code:
SparkConf conf = new SparkConf().setAppName("stream test").setMaster("local[*]");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "master:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("enable.auto.commit", false);
List<String> topics = Arrays.asList("A29");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
lines.print(30);
streamingContext.start();
streamingContext.awaitTermination();
Related
I am currently trying to use Spark streaming to get input from a Kafka topic and hence save that input in a Json file. I got so far, that I can save my InputDStream as a textFile, but the Problem is, after each batch-process the File gets overwritten, and it seems like I cannot do anything about this.
Is there a method or config option at all to change this?
I tried spark.files.overwrite ,false
but it did not work.
My Code is:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("local-test").setMaster("local[*]")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true").set("spark.executor.instances","4").set("spark.executor.memory","6G")
.set("spark.executor.cores","6")
.set("spark.cores.max","8")
.set("spark.driver.memory","2g")
.set("spark.files.overwrite","false");
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.seconds(4));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "xxxxx");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "ID2");
List<String> topics = Arrays.asList("LEGO_MAX");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
JavaDStream<String> first = stream.map(record -> (record.value()).toString());
first.foreachRDD(rdd -> rdd.saveAsTextFile("C:\\Users\\A675866\\Hallo.txt"));
ssc.start();
try {
ssc.awaitTermination();
} catch (InterruptedException e) {
System.out.println("Failed to cut connection -> Throwing Error");
e.printStackTrace();
}
}
using Java API, I wrote a spark streaming application that process and print results correctly and now I want to write the results to HDFS. Versions are the following:
Hadoop 2.7.3
Spark 2.2.0
Java 1.8
The following is the code:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
public class Spark {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("Spark Streaming").setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "kafka1:9092,kafka2:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
kafkaParams.put("group.id", "use");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("testStr");
JavaInputDStream<ConsumerRecord<String, byte[]>> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)
);
stream.map(record -> finall(record.value())).map(record -> Arrays.deepToString(record)).dstream().saveAsTextFiles(
"spark", "txt"
);
ssc.start();
ssc.awaitTermination();
}
public static String[][] finall(byte[] record){
String[][] result = new String[4][];
result[0] = javaTest.bytePrintable(record);
result[1] = javaTest.hexTodecimal(record);
result[2] = javaTest.hexToOctal(record);
result[3] = javaTest.hexTobin(record);
return result;
}
}
But neither on HDFS nor on local filesystem it errors:
ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()Lorg/apache/hadoop/fs/FileSystem$Statistics$StatisticsData;
What's the issue? Is it needed to import some libraries from Hadoop?
UPDATE
Instead of Maven dependencies, I use local spark jars and it works. So something in dependencies is wrong. Following are POM.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Which one is incompatible? Or maybe something is missing!
I am trying to connect to Kafka from apache Spark streaming using Java API not scala.
I am getting following error:
Caused by: java.lang.NoSuchMethodException:
org.apache.spark.util.SystemClock.() at
java.lang.Class.getConstructor0(Class.java:3082) at
java.lang.Class.newInstance(Class.java:412)
Error is popping up when creating JavaStreamingContext object.
This is my code:
JavaStreamingContext jssc = new JavaStreamingContext("local[2]", "KafkaWordCount",
new Duration(2000), System.getenv("SPARK_HOME"),
JavaStreamingContext.jarOfClass(JavaKafkaWordCount.class));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put("fast-messages", 2);
JavaPairDStream<String, String> messages = KafkaUtils.createStream(jssc, "localhost:9092", "test", topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
// #Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
Jar versions:
spark-streaming-kafka_2.10 - 1.6.2
spark-core_2.11 - 1.2.1
spark-streaming_2.10 - 1.6.0
Please let me know if you need more information.
TL;DR Use spark-core_2.10
That's because of dependency mismatch regarding their Scala version in use.
See spark-streaming-kafka_2.10 and spark-core_2.11 and spark-streaming_2.10. The streaming jars are compiled for Scala 2.10 while spark-core is for 2.11.
It does not force you to use Scala over Java. Spark is written using Scala mostly and dependencies built using Scala have to match their Scala version.
Below is the code that I am using to connect to oracle local database from apache spark java program, but I am getting error:
java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name
Here is my code:
public static void main( String[] args )
{
String MYSQL_DRIVER = "oracle.jdbc.driver.OracleDriver";
String MYSQL_CONNECTION_URL = "jdbc:oracle:thin:#//192.168.2.85:1521/stage";
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "select LOC_ITEM_ID from loc_item_master");
options.put("partitionColumn", "LOC_ITEM_ID");
options.put("user", "ad_tables");
options.put("password", "ad_tables");
options.put("lowerBound", "4467");
options.put("upperBound", "4749");
options.put("numPartitions", "10");
try {
Dataset<Row> jdbcDF = sqlContext.load("jdbc", options);
List<Row> employeeFullNameRows = jdbcDF.collectAsList();
for (Row employeeFullNameRow : employeeFullNameRows) {
System.out.println(employeeFullNameRow);
}
sc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Below is the POM file
<dependencies>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.univocity/univocity-parsers -->
<dependency>
<groupId>com.univocity</groupId>
<artifactId>univocity-parsers</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>com.oracle</groupId>
<artifactId>ojdbc6</artifactId>
<version>11.2.0.3</version>
</dependency>
</dependencies>
You can try this below approach.. It will work.
public class OracleDatabaseConnection {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
Properties connectionProperties = new Properties();
connectionProperties.put("user", "ad_tables");
connectionProperties.put("password", "ad_tables");
String query = "sparkour.people";
query = "(select LOC_ITEM_ID from loc_item_master)";
Dataset<Row> jdbcDF2 = spark.read().jdbc("jdbc:oracle:thin:#//192.168.2.85:1522/stage", query,connectionProperties);
jdbcDF2.show();
}
}
Your dbtable is incorrect, try:
options.put("dbtable", "(select LOC_ITEM_ID from loc_item_master) as loc_item_id");
Everything in dbtable should be a valid FROM fragment.
You cannot write: select ... from select LOC_ITEM_ID from loc_item_master, but you can write select ... from (select LOC_ITEM_ID from loc_item_master) as loc_item_id
Of course also table's schema must be ok, you can specify it using schema.table style in query
While writing data to cassandra from spark, data is not getting written.
The flash back is:
I am doing a kafka-sparkStreaming-cassandra integration.
I am reading kafka messages and trying to put it in a cassandra table CREATE TABLE TEST_TABLE(key INT PRIMARY KEY, value TEXT).
kafka to spark-streaming is running cool, but spark to cassandra, there is some issue...data not getting written to table.
I am able to create a connection with cassandra, but the data is not getting inserted into the cassandra table. The output shows its getting connected and the next second getting disconnected.
The strings for System.out.print() is all at the output.
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
Cassandra shell shows 0 rows.
the full code and the logs and dependencies are below:
public class SparkStream {
static int key=0;
public static void main(String args[]) throws Exception
{
if(args.length != 3)
{
System.out.println("parameters not given properly");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
Map<String,Integer> topicMap = new HashMap<String,Integer>();
String[] topic = args[2].split(",");
for(String t: topic)
{
topicMap.put(t, new Integer(3));
}
/* Connection to Spark */
SparkConf conf = new SparkConf();
conf.set("spark.cassandra.connection.host", "localhost");
JavaSparkContext sc = new JavaSparkContext("local[4]", "SparkStream",conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(5000));
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
System.out.println("+++++++++++cassandra connector created++++++++++++++++++++++++++++");
/* Receive Kafka streaming inputs */
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
System.out.println("+++++++++++++streaming Connection done!+++++++++++++++++++++++++++");
/* Create DStream */
JavaDStream<TestTable> data = messages.map(new Function< Tuple2<String,String>, TestTable >()
{
public TestTable call(Tuple2<String, String> message)
{
return new TestTable(new Integer(++key), message._2() );
}
}
);
System.out.println("++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++");
/* Write to cassandra */
javaFunctions(data).writerBuilder("testkeyspace", "test_table", mapToRow(TestTable.class)).saveToCassandra();
jssc.start();
jssc.awaitTermination();
}
}
class TestTable implements Serializable
{
Integer key;
String value;
public TestTable() {}
public TestTable(Integer k, String v)
{
key=k;
value=v;
}
public Integer getKey(){
return key;
}
public void setKey(Integer k){
key=k;
}
public String getValue(){
return value;
}
public void setValue(String v){
value=v;
}
public String toString(){
return MessageFormat.format("TestTable'{'key={0}, value={1}'}'", key, value);
}
}
The log is:
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
14/12/09 12:07:33 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:33 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:34 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:45 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:46 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
The POM.xml dependencies are:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>com.msiops.footing</groupId>
<artifactId>footing-tuple</artifactId>
<version>0.2</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.3</version>
</dependency>
is there something wrong with the code? or cassandra configuration?
solved the issue.
the columnMapper wasnt able to access the getters and setters of class TestTable.
So changed the access modifier to public.
but now i had 2 public classes in one file. which is an error.
so created another java file TestTable.java with class as
public class TestTable implements Serializable {
//code
}
now the messages are being read from kafka and getting stored in cassandra table