i have tried to use spring-boot to give web service in spark,but there seems some problems.
when i debug on local,it run correctly.but as i put it on cluster and run it with spark-submit,something wrong.
source list
public List<Person> txtlist(String sqlstr) {
JavaRDD<Person> peopleRDD = sparkSession.read().textFile("hdfs://master:9000/user/root/people.txt").javaRDD().map(line -> {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Long.parseLong(parts[1].trim()));
return person;
});
Dataset<Row> peopleDF = sparkSession.createDataFrame(peopleRDD, Person.class);
peopleDF.createOrReplaceTempView("people");
Dataset<Person> sqlFrame = sparkSession.sql(sqlstr).as(Encoders.bean(Person.class));
List<Person> plist = sqlFrame.collectAsList();
return plist;
}
error list
17/11/24 00:35:30 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, slave3, executor 2):
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
I tried some solutions from internet, like .setJars(), also spark-submit, but no use.
have someone used spring-boot with lamba in spark, met and solved this problem or correctly run in cluster?
Update:
I tried another kind of codes. but false
JavaRDD<String> lines = sparkSession.read().textFile("hdfs://master:9000/user/root/people.txt").javaRDD();
JavaRDD<Row> peopleRDD = lines.map(new Function<String,Row>()
{
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Row call(String line) throws Exception {
String[] splited = line.split(",");
return RowFactory.create(splited[0],Integer.valueOf(splited[1]));
}
});
List<StructField> structFields = new ArrayList<StructField>();
structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
StructType structType = DataTypes.createStructType(structFields);
the error code:
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.
Is it maybe a Serialization problem?
Related
I am facing strange issue and tried using Custom Receiver as well.
Issue - Spark Driver/Executor stop receiving and displaying data in stdout after 5 min of Activity. It continuously work if we keep writing data to server socket at the other end.
There is no error reported in driver or executors logs.
Code snippet
SparkConf sparkConf = new SparkConf().setMaster("spark://10.0.0.5:7077").setAppName("SmartAudioAnalytics")
.set("spark.executor.memory", "1g").set("spark.cores.max", "5").set("spark.driver.cores", "2")
.set("spark.driver.memory", "2g");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(3000));
JavaDStream<String> JsonReq1 = ssc.socketTextStream("myip", 9997, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> JsonReq2 = ssc.socketTextStream("myIP", 9997, StorageLevels.MEMORY_AND_DISK_SER);
ArrayList<JavaDStream<String>> streamList = new ArrayList<JavaDStream<String>>();
streamList.add(JsonReq1);
JavaDStream<String> UnionStream = ssc.union(JsonReq2, streamList);
UnionStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
int total = 0;
#Override
public void call(JavaRDD<String> rdd) throws Exception {
long count = rdd.count();
total += count;
System.out.println(total);
rdd.foreach(new VoidFunction<String>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String s) throws Exception {
System.out.println(s);
}
});
}
});
System.out.println(UnionStream.count());
ssc.start();
ssc.awaitTermination();
I have opened Spark UI and find all threads are working properly even after 24 hours. Please see pictures
I was following up with an example at https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/cassandra.html to connect Cassandra as sink in Flink
My code for is shown below
public class writeToCassandra {
private static final String CREATE_KEYSPACE_QUERY = "CREATE KEYSPACE test WITH replication= {'class':'SimpleStrategy', 'replication_factor':1};";
private static final String createTable = "CREATE TABLE test.cassandraData(id varchar, heart_rate varchar, PRIMARY KEY(id));" ;
private final static Collection<String> collection = new ArrayList<>(50);
static {
for (int i = 1; i <= 50; ++i) {
collection.add("element " + i);
}
}
public static void main(String[] args) throws Exception {
//setting the env variable to local
StreamExecutionEnvironment envrionment = StreamExecutionEnvironment.createLocalEnvironment(1);
DataStream<Tuple2<String, String>> dataStream = envrionment
.fromCollection(collection)
.map(new MapFunction<String, Tuple2<String, String>>() {
final String mapped = " mapped ";
String[] splitted;
#Override
public Tuple2<String, String> map(String s) throws Exception {
splitted = s.split("\\s+");
return Tuple2.of(
UUID.randomUUID().toString(),
splitted[0] + mapped + splitted[1]
);
}
});
CassandraSink.addSink(dataStream)
.setQuery("INSERT INTO test.cassandraData(id,heart_rate) values (?,?);")
.setHost("127.0.0.1")
.build();
envrionment.execute();
} //main
} //writeToCassandra
I am getting the following error
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:231)
Not sure if this is always required, but the way that I set up my CassandraSink is like this:
CassandraSink
.addSink(dataStream)
.setClusterBuilder(new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return Cluster.builder()
.addContactPoints(myListOfCassandraUrlsString.split(","))
.withPort(portNumber)
.build();
}
})
.build();
I have annotated POJOs that are returned by the dataStream so I don't need the query, but you would just include ".setQuery(...)" after the ".addSink(...)" line.
The exception simply indicates that the example program cannot reach the C* database.
flink-cassandra-connector offers streaming API to connect to designated C* database. Thus, you need to have a C* instance running.
Each streaming job is pushed/serialized to the node that Task Manager runs at. In your example, you assume C* is running on the same node as the TM node. An alternative is to change the C* address from 127.0.0.1 to a public address.
I've a Spark Streaming object that fetches data from RabbitMQ and saves it into HBase. This save is an Increment operation. I'm using the saveAsNewAPIHadoopDataset but I keep getting below Exception
Code:
pairDStream.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
#Override
public void call(JavaPairRDD<String, Integer> arg0)
throws Exception {
Configuration dbConf = HBaseConfiguration.create();
dbConf.set("hbase.table.namespace.mappings", "tablename:/mapr/tablename");
Job jobConf = Job.getInstance(dbConf);
jobConf.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "tablename");
jobConf.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);
JavaPairRDD<ImmutableBytesWritable, Increment> hbasePuts = arg0.mapToPair(
new PairFunction<Tuple2<String,Integer>, ImmutableBytesWritable, Increment>() {
#Override
public Tuple2<ImmutableBytesWritable, Increment> call(
Tuple2<String, Integer> arg0)
throws Exception {
String[] keys = arg0._1.split("_");
Increment inc = new Increment(Bytes.toBytes(keys[0]));
inc.addColumn(Bytes.toBytes("data"),
Bytes.toBytes(keys[1]),
arg0._2);
return new Tuple2<ImmutableBytesWritable, Increment>(new ImmutableBytesWritable(), inc);
}
});
// save to HBase- Spark built-in API method
hbasePuts.saveAsNewAPIHadoopDataset(jobConf.getConfiguration());
}
});
Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task 1.3 in stage 6.0 (TID 100, dev-arc-app036.vega.cloud.ironport.com): java.io.IOException: Pass a Delete or a Put
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:128)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:87)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1113)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
Is it possible to use "saveAsNewAPIHadoopDataset" method to Increment rather than Put?
Any help is greatly appreciated.
Thanks
Akhila.
I want to read file data and check if file line data is present in Cassandra if it's present then needs to merge otherwise fresh insert to C*.
File data just contains name,address in json format, in Cassandra student table have UUID as primary key and there is secondry index on name
Once data is merged to cassandra I want to send new UUID or existing UUID to KAfka.
When I run on locally or single machine on mesos cluster(keeping line sparkConf.setMaster("local[4]");) this program works but when I submit to mesos master with 4 slaves(commenting line //sparkConf.setMaster("local[4]"); on cluster) there is nullpointer while selecting data from Cassandra on javastreaming context
I made streaming context static as earliar it was throwing serialization exception as it was getting accessed inside map transformation for file dstream.
Is there something wrong with the approach or ? is it because I am trying build Cassandra RDD withing DStream map tranformation which causing issue?
import kafka.producer.KeyedMessage;
import com.datastax.spark.connector.japi.CassandraStreamingJavaUtil;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.util.Properties;
import java.util.UUID;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.cloudera.spark.streaming.kafka.JavaDStreamKafkaWriter;
import org.cloudera.spark.streaming.kafka.JavaDStreamKafkaWriterFactory;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapRowTo;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow;
public class DStreamExample {
public DStreamExample() {
}
private static JavaStreamingContext ssc;
public static void main(final String[] args) {
final SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("SparkJob");
sparkConf.setMaster("local[4]"); // for local
sparkConf.set("spark.cassandra.connection.host", cassandra_hosts);
ssc = new JavaStreamingContext(sparkConf,new Duration(2000));
final JavaDStream<Student> studentFileDStream = ssc.textFileStream(
"/usr/local/fileDir/").map(line -> {
final Gson gson = new Gson();
final JsonParser parser = new JsonParser();
final JsonObject jsonObject = parser.parse(line)
.getAsJsonObject();
// generating new UUID
studentFile.setId(UUID.randomUUID());
final Student studentFile = gson.fromJson(jsonObject,
Student.class);
try{
//NullPointer at this line while running on cluster
final JavaRDD<Student> cassandraStudentRDD =
CassandraStreamingJavaUtil.javaFunctions(ssc)
.cassandraTable("keyspace", "student",
mapRowTo(Student.class)).where("name=?",
studentFile.getName());
//If student name is found in cassandra table then assign UUID to fileStudent object
//This way i wont create multiple records for same name student
final Student studentCassandra = cassandraStudentRDD.first();
studentFile.setId(studentCassandra.getId());
}catch(Exception e){
}
return studentFile;
});
//Save student to Cassandra
CassandraStreamingJavaUtil.javaFunctions(studentFileDStream)
.writerBuilder("keyspace", "student", mapToRow(Student.class))
.saveToCassandra();
final JavaDStreamKafkaWriter<Student> writer =
JavaDStreamKafkaWriterFactory.fromJavaDStream(studentFileDStream);
final Properties properties = new Properties();
properties.put("metadata.broker.list", "server:9092");
properties.put("serializer.class", "kafka.serializer.StringEncoder");
//Just send studnet UUID_PUT to kafka
writer.writeToKafka(properties,
student ->
new KeyedMessage<>("TOPICNAME", student.getId() + "_PUT"));
ssc.start();
ssc.awaitTermination();
}
}
class Student {
private String address;
private UUID id;
private String name;
public Student() {
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public UUID getId() {
return id;
}
public void setId(UUID id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
}
Exception Stacktrac::
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, servername): java.lang.NullPointerException
at com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.javaFunctions(CassandraStreamingJavaUtil.java:39)
at com.ebates.ps.batch.sparkpoc.DStreamPOCExample.lambda$main$d2c4cc2c$1(DStreamPOCExample.java:109)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.cloudera.spark.streaming.kafka.RDDKafkaWriter$$anonfun$writeToKafka$1.apply(RDDKafkaWriter.scala:47)
at org.cloudera.spark.streaming.kafka.RDDKafkaWriter$$anonfun$writeToKafka$1.apply(RDDKafkaWriter.scala:45)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:896)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:896)
I'm trying to use the Spark JdbcRDD to load data from a SQL Server database. I'm using version 4.0 of the Microsoft JDBC driver. Here is a snippet of code:
public JdbcRDD<Object[]> load(){
SparkConf conf = new SparkConf().setMaster("local").setAppName("myapp");
JavaSparkContext context = new JavaSparkContext(conf);
DbConnection connection = new DbConnection("com.microsoft.sqlserver.jdbc.SQLServerDriver","my-connection-string","test","test");
JdbcRDD<Object[]> jdbcRDD = new JdbcRDD<Object[]>(context.sc(),connection,"select * from <table>",1,1000,1,new JobMapper(),ClassManifestFactory$.MODULE$.fromClass(Object[].class));
return jdbcRDD;
}
public static void main(String[] args) {
JdbcRDD<Object[]> jdbcRDD = load();
JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));
List<String> ids = javaRDD.map(new Function<Object[],String>(){
public String call(final Object[] record){
return (String)record[0];
}
}).collect();
System.out.println(ids);
}
I get the following exception:
java.lang.AbstractMethodError: com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.isClosed()Z
at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:109)
at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:74)
at org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:74)
at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:58)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Here is the definition of JobMapper:
public class JobMapper extends AbstractFunction1<ResultSet, Object[]> implements Serializable {
private static final Logger logger = Logger.getLogger(JobMapper.class);
public Object[] apply(ResultSet row){
return JdbcRDD.resultSetToObjectArray(row);
}
}
I found the issue with what I was doing. There were a couple of things:
It does not seem to work with version 4.0 of the driver. So I changed it to use version 3.0
The documentation for JdbcRDD states that the SQL string must include two parameters that indicate the range for the query. So I had to change the query.
JdbcRDD<Object[]> jdbcRDD = new JdbcRDD<Object[]>(context.sc(),connection,"SELECT * FROM <table> where Id >= ? and Id <= ?",1,20,1,new JobMapper(),ClassManifestFactory$.MODULE$.fromClass(Object[].class));
The parameters 1 and 20 indicate the range for the query.
NOTE: This solution assumes you have the latest build of Spark (1.3.0). And I have only tried this in standalone mode.
I was having a similar issue, but here is how I got it to work.
First make sure that the driver jar (sqljdbc40.jar) for SQL Server is placed in the following directory:
YOUR_SPARK_HOME/core/target/jars
This will ensure that the driver is loaded when Spark computes its classpath.
Next in your code, have the following:
JavaSparkContext sc = new JavaSparkContext("local", appName); //master is set to local
SQLContext sqlContext = new SQLContext(sc);
//This url connection string is not complete (include your credentials or integrated security options)
String url = "jdbc:sqlserver://" + host + ":1433;DatabaseName=" + database;
String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";
//Settings for SQL Server jdbc connection
Map<String, String> options = new HashMap<>();
options.put("driver", driver);
options.put("url", url);
options.put("dbtable", tablename);
//Get table from SQL Server and save data in a DataFrame using JDBC
DataFrame jdbcDF = sqlContext.load("jdbc", options);
jdbcDF.printSchema();
long numRecords = jdbcDF.count();
System.out.println("Number of records in jdbcDF: " + numRecords);
jdbcDF.show();