Issue on writing spark streaming results to HDFS or local filesystem

Issue on writing spark streaming results to HDFS or local filesystem - apache-spark

using Java API, I wrote a spark streaming application that process and print results correctly and now I want to write the results to HDFS. Versions are the following:
Hadoop 2.7.3
Spark 2.2.0
Java 1.8
The following is the code:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
public class Spark {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setAppName("Spark Streaming").setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "kafka1:9092,kafka2:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class);
kafkaParams.put("group.id", "use");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("testStr");
JavaInputDStream<ConsumerRecord<String, byte[]>> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)
);
stream.map(record -> finall(record.value())).map(record -> Arrays.deepToString(record)).dstream().saveAsTextFiles(
"spark", "txt"
);
ssc.start();
ssc.awaitTermination();
}
public static String[][] finall(byte[] record){
String[][] result = new String[4][];
result[0] = javaTest.bytePrintable(record);
result[1] = javaTest.hexTodecimal(record);
result[2] = javaTest.hexToOctal(record);
result[3] = javaTest.hexTobin(record);
return result;
}
}
But neither on HDFS nor on local filesystem it errors:
ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()Lorg/apache/hadoop/fs/FileSystem$Statistics$StatisticsData;
What's the issue? Is it needed to import some libraries from Hadoop?
UPDATE
Instead of Maven dependencies, I use local spark jars and it works. So something in dependencies is wrong. Following are POM.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Which one is incompatible? Or maybe something is missing!

Related

org.graalvm.polyglot.PolyglotException when running through master runner(cucumber report generation) running fine individually through karate

org.graalvm.polyglot.PolyglotException: java.lang.AssertionError
com.oracle.truffle.polyglot.PolyglotList$Cache.lookup(PolyglotList.java:180)
com.oracle.truffle.polyglot.PolyglotList.<init>(PolyglotList.java:73)
com.oracle.truffle.polyglot.PolyglotList.create(PolyglotList.java:96)
com.oracle.truffle.polyglot.PolyglotHostAccess.toList(PolyglotHostAccess.java:107)
com.oracle.truffle.host.HostToTypeNode.asJavaObject(HostToTypeNode.java:384)
com.oracle.truffle.host.HostToTypeNode.convertImpl(HostToTypeNode.java:194)
com.oracle.truffle.host.HostToTypeNode.doCached(HostToTypeNode.java:118)
It throws the above error when it is run using below master runner and fails to run the java method.
package tests;
import com.intuit.karate.Results;
import com.intuit.karate.Runner;
import net.masterthought.cucumber.Configuration;
import net.masterthought.cucumber.ReportBuilder;
import org.apache.commons.io.FileUtils;
import org.junit.jupiter.api.Test;
import reporting.ExtentReport;
import java.io.File;
import java.util.ArrayList;
import java.util.Collection;
import static org.junit.jupiter.api.Assertions.assertEquals;
class MasterTestRunner {
// this will run all *.feature files
/*#Karate.Test
Karate testAll() {
return Karate.run().relativeTo(getClass());
}*/
//Karate Parallel Runner
#Test
public void testParallel() {
final Results results = Runner.path("classpath:tests/features").hook(new ExtentReport()).tags("#test")
.outputCucumberJson(true).outputJunitXml(true).parallel(1);
assertEquals(0, results.getFailCount(), results.getErrorMessages());
}
//Boiler Plate for Cucumber report generation
public static void generateReport(final String karateOutputPath) {
final Collection<File> jsonFiles = FileUtils.listFiles(new File(karateOutputPath), new String[]{"json"}, true);
final ArrayList<String> jsonPaths = new ArrayList<>(jsonFiles.size());
jsonFiles.forEach(file -> jsonPaths.add(file.getAbsolutePath()));
final Configuration config = new Configuration(new File("target"), "Karate test");
System.out.println("config report " + config.getBuildNumber());
final ReportBuilder reportBuilder = new ReportBuilder(jsonPaths, config);
reportBuilder.generateReports();
}
}
when it is run individually in feature file it is successful.
The carate test is like below.
Feature: karate testing
#test
Scenario: Karate test case
* def acts = [ { "label": "BUSINESS","name": "David","id": "u374892","type": "DIVISION"},{"label": "Division","name":"MILLER","id": "u236478","type": "BUSINESS"}]
* def data1 = [{"UserName": "ABENANTE","UserID": "u109238","Role": "BUSINESS"},{"UserName": "Louis","UserID": "u784784","Role": "DIVISION"}]
* def desk = utils.returnIds(data1, 2, acts, "DESK")
* print desk
The java method from utils starts like
public static List<Map<String, Object>> returnIds(List<Map<String, Object>> listOfAllIds, int numberOfIds, List<Map<String, Object>> listOfIds, String role)
For Karate the dependencies are
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<maven.compiler.version>3.6.0</maven.compiler.version>
<karate.version>1.1.0</karate.version>
</properties>
<dependencies>
<!-- For Karate Begin -->
<dependency>
<groupId>com.intuit.karate</groupId>
<artifactId>karate-junit5</artifactId>
<version>${karate.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.intuit.karate</groupId>
<artifactId>karate-gatling</artifactId>
<version>${karate.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
I am not sure what might cause this issue.

Does Spark streaming support to Kafka 1.1.0 now?

Now the spark version is 2.3.
I have seen the maven central repository: https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22
the shown jar is spark-streaming-kafka-0-10_2.11
so the kafka1.1.0 isn't supported now?
i still should install kafka 0.10.x

Based on the following link:
you should use spark-streaming-kafka-0-10 for kafka 0.10.0 or higher.

I have been test spark2.3 with kafka1.1.0 using jars
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
that runs well.
example code:
SparkConf conf = new SparkConf().setAppName("stream test").setMaster("local[*]");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "master:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("enable.auto.commit", false);
List<String> topics = Arrays.asList("A29");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
lines.print(30);
streamingContext.start();
streamingContext.awaitTermination();

com.datastax.driver.core.OperationTimedOutException

I created the following test class using Cassandra.
private static String server_ip = "127.0.0.1";
private static String keyspace = "hr";
private static Cluster cluster = null;
private static Session session = null;
public static void main (String [] args)
{
if(cluster!=null) return;
cluster = Cluster.builder().addContactPoints(server_ip).withPort(9042).build();
final Metadata metadata = cluster.getMetadata();
String msg = String.format("Connected to cluster: %s", metadata.getClusterName());
System.out.println(msg);
System.out.println("List of hosts");
for (final Host host : metadata.getAllHosts())
{
msg = String.format("Datacenter: %s; Host: %s; Rack: %s",
host.getDatacenter(),
host.getAddress(),
host.getRack());
System.out.println(msg);
}
session = cluster.connect(keyspace);
}
(Cluster.java:407)
at samples.SampleB.main(SampleB.java:28)
After googling, I try all the proposing solution without success. Could you please help me solving this issue?. Thanks a lot.

you should verify those dependencies:
<!-- Apache Cassandra Datastax's CQL driver. -->
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-mapping</artifactId>
<version>3.1.0</version>
</dependency>
HTH

Error in connecting to oracle local database from apache spark java

Below is the code that I am using to connect to oracle local database from apache spark java program, but I am getting error:
java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name
Here is my code:
public static void main( String[] args )
{
String MYSQL_DRIVER = "oracle.jdbc.driver.OracleDriver";
String MYSQL_CONNECTION_URL = "jdbc:oracle:thin:#//192.168.2.85:1521/stage";
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> options = new HashMap<String, String>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "select LOC_ITEM_ID from loc_item_master");
options.put("partitionColumn", "LOC_ITEM_ID");
options.put("user", "ad_tables");
options.put("password", "ad_tables");
options.put("lowerBound", "4467");
options.put("upperBound", "4749");
options.put("numPartitions", "10");
try {
Dataset<Row> jdbcDF = sqlContext.load("jdbc", options);
List<Row> employeeFullNameRows = jdbcDF.collectAsList();
for (Row employeeFullNameRow : employeeFullNameRows) {
System.out.println(employeeFullNameRow);
}
sc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Below is the POM file
<dependencies>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.univocity/univocity-parsers -->
<dependency>
<groupId>com.univocity</groupId>
<artifactId>univocity-parsers</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>com.oracle</groupId>
<artifactId>ojdbc6</artifactId>
<version>11.2.0.3</version>
</dependency>
</dependencies>

You can try this below approach.. It will work.
public class OracleDatabaseConnection {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
Properties connectionProperties = new Properties();
connectionProperties.put("user", "ad_tables");
connectionProperties.put("password", "ad_tables");
String query = "sparkour.people";
query = "(select LOC_ITEM_ID from loc_item_master)";
Dataset<Row> jdbcDF2 = spark.read().jdbc("jdbc:oracle:thin:#//192.168.2.85:1522/stage", query,connectionProperties);
jdbcDF2.show();
}
}

Your dbtable is incorrect, try:
options.put("dbtable", "(select LOC_ITEM_ID from loc_item_master) as loc_item_id");
Everything in dbtable should be a valid FROM fragment.
You cannot write: select ... from select LOC_ITEM_ID from loc_item_master, but you can write select ... from (select LOC_ITEM_ID from loc_item_master) as loc_item_id
Of course also table's schema must be ok, you can specify it using schema.table style in query

kafka spark-streaming data not getting written into cassandra. zero rows inserted

While writing data to cassandra from spark, data is not getting written.
The flash back is:
I am doing a kafka-sparkStreaming-cassandra integration.
I am reading kafka messages and trying to put it in a cassandra table CREATE TABLE TEST_TABLE(key INT PRIMARY KEY, value TEXT).
kafka to spark-streaming is running cool, but spark to cassandra, there is some issue...data not getting written to table.
I am able to create a connection with cassandra, but the data is not getting inserted into the cassandra table. The output shows its getting connected and the next second getting disconnected.
The strings for System.out.print() is all at the output.
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
Cassandra shell shows 0 rows.
the full code and the logs and dependencies are below:
public class SparkStream {
static int key=0;
public static void main(String args[]) throws Exception
{
if(args.length != 3)
{
System.out.println("parameters not given properly");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
Map<String,Integer> topicMap = new HashMap<String,Integer>();
String[] topic = args[2].split(",");
for(String t: topic)
{
topicMap.put(t, new Integer(3));
}
/* Connection to Spark */
SparkConf conf = new SparkConf();
conf.set("spark.cassandra.connection.host", "localhost");
JavaSparkContext sc = new JavaSparkContext("local[4]", "SparkStream",conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(5000));
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
System.out.println("+++++++++++cassandra connector created++++++++++++++++++++++++++++");
/* Receive Kafka streaming inputs */
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
System.out.println("+++++++++++++streaming Connection done!+++++++++++++++++++++++++++");
/* Create DStream */
JavaDStream<TestTable> data = messages.map(new Function< Tuple2<String,String>, TestTable >()
{
public TestTable call(Tuple2<String, String> message)
{
return new TestTable(new Integer(++key), message._2() );
}
}
);
System.out.println("++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++");
/* Write to cassandra */
javaFunctions(data).writerBuilder("testkeyspace", "test_table", mapToRow(TestTable.class)).saveToCassandra();
jssc.start();
jssc.awaitTermination();
}
}
class TestTable implements Serializable
{
Integer key;
String value;
public TestTable() {}
public TestTable(Integer k, String v)
{
key=k;
value=v;
}
public Integer getKey(){
return key;
}
public void setKey(Integer k){
key=k;
}
public String getValue(){
return value;
}
public void setValue(String v){
value=v;
}
public String toString(){
return MessageFormat.format("TestTable'{'key={0}, value={1}'}'", key, value);
}
}
The log is:
+++++++++++cassandra connector created++++++++++++++++++++++++++++
+++++++++++++streaming Connection done!+++++++++++++++++++++++++++
++++++++++++++++JavaDStream<TestTable> created++++++++++++++++++++++++++++
14/12/09 12:07:33 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:33 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:34 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO core.Cluster: New Cassandra host localhost/127.0.0.1:9042 added
14/12/09 12:07:45 INFO cql.CassandraConnector: Connected to Cassandra cluster: Test Cluster
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 127.0.0.1 (datacenter1)
14/12/09 12:07:46 INFO cql.CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
The POM.xml dependencies are:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>com.msiops.footing</groupId>
<artifactId>footing-tuple</artifactId>
<version>0.2</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.1.3</version>
</dependency>
is there something wrong with the code? or cassandra configuration?

solved the issue.
the columnMapper wasnt able to access the getters and setters of class TestTable.
So changed the access modifier to public.
but now i had 2 public classes in one file. which is an error.
so created another java file TestTable.java with class as
public class TestTable implements Serializable {
//code
}
now the messages are being read from kafka and getting stored in cassandra table

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issue on writing spark streaming results to HDFS or local filesystem - apache-spark

Related

org.graalvm.polyglot.PolyglotException when running through master runner(cucumber report generation) running fine individually through karate

Does Spark streaming support to Kafka 1.1.0 now?

com.datastax.driver.core.OperationTimedOutException

Error in connecting to oracle local database from apache spark java

kafka spark-streaming data not getting written into cassandra. zero rows inserted

Categories

Resources