I'm trying to use the Spark JdbcRDD to load data from a SQL Server database. I'm using version 4.0 of the Microsoft JDBC driver. Here is a snippet of code:
public JdbcRDD<Object[]> load(){
SparkConf conf = new SparkConf().setMaster("local").setAppName("myapp");
JavaSparkContext context = new JavaSparkContext(conf);
DbConnection connection = new DbConnection("com.microsoft.sqlserver.jdbc.SQLServerDriver","my-connection-string","test","test");
JdbcRDD<Object[]> jdbcRDD = new JdbcRDD<Object[]>(context.sc(),connection,"select * from <table>",1,1000,1,new JobMapper(),ClassManifestFactory$.MODULE$.fromClass(Object[].class));
return jdbcRDD;
}
public static void main(String[] args) {
JdbcRDD<Object[]> jdbcRDD = load();
JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));
List<String> ids = javaRDD.map(new Function<Object[],String>(){
public String call(final Object[] record){
return (String)record[0];
}
}).collect();
System.out.println(ids);
}
I get the following exception:
java.lang.AbstractMethodError: com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.isClosed()Z
at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:109)
at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:74)
at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:74)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:58)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Here is the definition of JobMapper:
public class JobMapper extends AbstractFunction1<ResultSet, Object[]> implements Serializable {
private static final Logger logger = Logger.getLogger(JobMapper.class);
public Object[] apply(ResultSet row){
return JdbcRDD.resultSetToObjectArray(row);
}
}
I found the issue with what I was doing. There were a couple of things:
It does not seem to work with version 4.0 of the driver. So I changed it to use version 3.0
The documentation for JdbcRDD states that the SQL string must include two parameters that indicate the range for the query. So I had to change the query.
JdbcRDD<Object[]> jdbcRDD = new JdbcRDD<Object[]>(context.sc(),connection,"SELECT * FROM <table> where Id >= ? and Id <= ?",1,20,1,new JobMapper(),ClassManifestFactory$.MODULE$.fromClass(Object[].class));
The parameters 1 and 20 indicate the range for the query.
NOTE: This solution assumes you have the latest build of Spark (1.3.0). And I have only tried this in standalone mode.
I was having a similar issue, but here is how I got it to work.
First make sure that the driver jar (sqljdbc40.jar) for SQL Server is placed in the following directory:
YOUR_SPARK_HOME/core/target/jars
This will ensure that the driver is loaded when Spark computes its classpath.
Next in your code, have the following:
JavaSparkContext sc = new JavaSparkContext("local", appName); //master is set to local
SQLContext sqlContext = new SQLContext(sc);
//This url connection string is not complete (include your credentials or integrated security options)
String url = "jdbc:sqlserver://" + host + ":1433;DatabaseName=" + database;
String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";
//Settings for SQL Server jdbc connection
Map<String, String> options = new HashMap<>();
options.put("driver", driver);
options.put("url", url);
options.put("dbtable", tablename);
//Get table from SQL Server and save data in a DataFrame using JDBC
DataFrame jdbcDF = sqlContext.load("jdbc", options);
jdbcDF.printSchema();
long numRecords = jdbcDF.count();
System.out.println("Number of records in jdbcDF: " + numRecords);
jdbcDF.show();
Related
Trying to get data from Cassandra using Apache Flink, referencing this post I can read data, but I don't how to load it into a DataStream object. The following is the code:
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost")
/*.withCredentials("hduser".trim(), "hadoop".trim())*/
.build();
}
};
CassandraInputFormat<Tuple2<UUID, String>> cassandraInputFormat = new CassandraInputFormat<Tuple2<UUID, String>>(query, cb);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple2<UUID, String> testOutputTuple = new Tuple2<>();
ByteArrayOutputStream res = new ByteArrayOutputStream();
res.reset();
while (!cassandraInputFormat.reachedEnd()) {
cassandraInputFormat.nextRecord(testOutputTuple);
res.write((testOutputTuple.f0.toString() + "," + testOutputTuple.f1).getBytes());
}
DataStream<byte[]> temp = new DataStream<byte[]>(env, new StreamTransformation<byte[]>(res.toByteArray()));
I tried
DataStream<byte[]> temp = new DataStream<byte[]>(env, new StreamTransformation<byte[]>(res.toByteArray()));
to load data in res variable into DataStream<byte[]> object, but it's not a correct way. How can I do that? and is my approach of reading cassandra suitable for stream processing?
reading data from DB - is a finite task.
You should use a DataSet API, not a DataStream when using CassandraInputFormat.
for example:
DataSet<Tuple2<Long, Date>> ds = env.createInput(executeQuery(YOUR_QUERY), TupleTypeInfo.of(new TypeHint<Tuple2<Long, Date>>() {}));
private static CassandraInputFormat<Tuple2<Long, Date>> executeQuery(String YOUR_QUERY) throws IOException {
return new CassandraInputFormat<>(YOUR_QUERY, new ClusterBuilder() {
private static final long serialVersionUID = 1;
#Override
protected Cluster buildCluster(com.datastax.driver.core.Cluster.Builder builder) {
return builder.addContactPoints(CASSANDRA_HOST).build();
}
});
}
}
Creating a DataStream in Flink always starts with the ExecutionEnvironment.
Instead of:
DataStream<byte[]> temp = new DataStream<byte[]>(env, new StreamTransformation<byte[]>(res.toByteArray()));
Try:
DataStream<Tuple2<UUID, String>> raw = ExecutionEnvironment.createInput(cassandraInputFormat);
You can then use a map function to change the datatype to DataStream
I have not used the Cassandra connector itself so I don't know if you use that part correctly.
We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.
I was following up with an example at https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/cassandra.html to connect Cassandra as sink in Flink
My code for is shown below
public class writeToCassandra {
private static final String CREATE_KEYSPACE_QUERY = "CREATE KEYSPACE test WITH replication= {'class':'SimpleStrategy', 'replication_factor':1};";
private static final String createTable = "CREATE TABLE test.cassandraData(id varchar, heart_rate varchar, PRIMARY KEY(id));" ;
private final static Collection<String> collection = new ArrayList<>(50);
static {
for (int i = 1; i <= 50; ++i) {
collection.add("element " + i);
}
}
public static void main(String[] args) throws Exception {
//setting the env variable to local
StreamExecutionEnvironment envrionment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple2<String, String>> dataStream = envrionment
.fromCollection(collection)
.map(new MapFunction<String, Tuple2<String, String>>() {
final String mapped = " mapped ";
String[] splitted;
#Override
public Tuple2<String, String> map(String s) throws Exception {
splitted = s.split("\\s+");
return Tuple2.of(
UUID.randomUUID().toString(),
splitted[0] + mapped + splitted[1]
);
}
});
CassandraSink.addSink(dataStream)
.setQuery("INSERT INTO test.cassandraData(id,heart_rate) values (?,?);")
.setHost("127.0.0.1")
.build();
envrionment.execute();
} //main
} //writeToCassandra
I am getting the following error
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
Not sure if this is always required, but the way that I set up my CassandraSink is like this:
CassandraSink
.addSink(dataStream)
.setClusterBuilder(new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return Cluster.builder()
.addContactPoints(myListOfCassandraUrlsString.split(","))
.withPort(portNumber)
.build();
}
})
.build();
I have annotated POJOs that are returned by the dataStream so I don't need the query, but you would just include ".setQuery(...)" after the ".addSink(...)" line.
The exception simply indicates that the example program cannot reach the C* database.
flink-cassandra-connector offers streaming API to connect to designated C* database. Thus, you need to have a C* instance running.
Each streaming job is pushed/serialized to the node that Task Manager runs at. In your example, you assume C* is running on the same node as the TM node. An alternative is to change the C* address from 127.0.0.1 to a public address.
Goal: Read kafka with spark streaming and store data in cassandra
By: Java Spark cassandra connector 1.6
Data input: simple json line object {"id":"1","field1":"value1}
i´ve a java class to read from kafka by spark streaming, processing the data read and then store it in cassandra.
here is the main code:
**JavaPairReceiverInputDStream**<String, String> messages =
KafkaUtils.createStream(ssc,
targetKafkaServerPort, targetTopic, topicMap);
**JavaDStream** list = messages.map(new Function<Tuple2<String,String>,List<Object>>(){
public List<Object> call( Tuple2<String,String> tuple2){
List<Object> **list**=new ArrayList<Object>();
Gson gson = new Gson();
MyClass myclass = gson.fromJson(tuple2._2(), MyClass.class);
myclass.setNewData("new_data");
String jsonInString = gson.toJson(myclass);
list.add(jsonInString);
return list;
}
});
The next code is incorrect:
**javaFunctions**(list)
.writerBuilder("schema", "table", mapToRow(JavaDStream.class))
.saveToCassandra();
Because "javaFunctions" method expect a JavaRDD object and "list" is a JavaDStream...
I´d need to cast JavaDStream to JavaRDD but I don´t find the right way...
Any help?
Let's use
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.* instead of com.datastax.spark.connector.japi.CassandraJavaUtil.*
ummmm not really...What I´ve done is use a foreachRDD after create the dsStream:
dStream.foreachRDD(new Function<JavaRDD<MyObject>, Void>() {
#Override
public Void call(JavaRDD<MyObject> rdd) throws Exception {
if (rdd != null) {
javaFunctions(rdd)
.writerBuilder("schema", "table", mapToRow(MyObject.class))
.saveToCassandra();
logging(" --> Saved data to cassandra",1,null);
}
return null;
}
});
Hope to be usefull...
I have an SPARK application that uses TwitterUtils to read a Twitter stream and uses a map and a foreachRDD on the stream to put Twitter messages into a database. That all works great.
My question: What is the most appropriate way to detach from the Twitter stream once everything is running. Suppose I want to only collect 1000 messages or run the collection for 60 seconds.
The code is as follows:
SparkConf sparkConf = new SparkConf().setAppName("Java spark twitter stream");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000));
JavaDStream<Status> tweets = TwitterUtils.createStream(ssc, filters);
JavaDStream<String> statuses = tweets.map(
new Function<Status, String>() {
public String call(Status status) {
//combine the strings here.
GeoLocation geoLocation = status.getGeoLocation();
if (geoLocation != null) {
String text = status.getText().replaceAll("[\r\n]", " ");
String line = geoLocation.getLongitude() + ",,,,"
+ geoLocation.getLatitude() + ",,,,"
+ status.getCreatedAt().getTime()
+ ",,,," + status.getUser().getId()
+ ",,,," + text;
return line;
} else {
return null;
}
}
}
).filter(new Function<String, Boolean>() {
public Boolean call(String input) {
return input != null;
}
});
statuses.print();
statuses.foreachRDD(new Function2<JavaRDD<String>, Time, Void>() {
#Override
public Void call(JavaRDD<String> rdd, Time time) {
SQLContext sqlContext
= JavaSQLContextSingleton
.getInstance(rdd.context());
sqlContext.setConf("spark.sql.tungsten.enabled", "false");
JavaRDD<Row> tweetRowRDD
= rdd.map(new TweetMapLoadFunction());
DataFrame statusesDataFrame
= sqlContext
.createDataFrame(
tweetRowRDD,
tweetSchema.createTweetStructType());
return null;
}
});
ssc.start();
ssc.awaitTermination();
This is straight from the documentation:
The processing can be manually stopped using streamingContext.stop().
Points to remember:
Once a context has been started, no new streaming computations can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.