teradata jdbc jar not loading in spark - apache-spark

I'm trying to load the teradata jar file in spark but can't get it to load. I start spark shell like this:
spark-shell --jars ~/*.jar --driver-class-path ~/*.jar
in there I have a jar file called terajdbc4.jar
when spark shell starts...I do this
scala> sc.addJar("terajdbc4.jar")
15/12/07 12:27:55 INFO SparkContext: Added JAR terajdbc4.jar at http://1.2.4.4:41601/jars/terajdbc4.jar with timestamp 1449509275187
scala> sc.jars
res1: Seq[String] = List(file:/home/user1/spark-cassandra-connector_2.10-1.0.0-beta1.jar)
scala>
but its not there in the jars. why is it missing still?
EDIT:
ok. I got the jar to load, but I'm getting this error:
java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
I do the following:
scala> sc.jars
res4: Seq[String] = List(file:/home/user/terajdbc4.jar)
scala> import com.teradata.jdbc.TeraDriver
import com.teradata.jdbc.TeraDriver
scala> Class.forName("com.teradata.jdbc.TeraDriver")
res5: Class[_] = class com.teradata.jdbc.TeraDriver
and then this:
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:teradata://dbinstn, TMODE=TERA, user=user1, password=pass1",
"dbtable" -> "db1a.table1a",
"driver" -> "com.teradata.jdbc.TeraDriver"))
and then I get this:
java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

spark-shell --jars ~/*.jar --driver-class-path ~/*.jar
please refer to Using wildcards in java classpath
The wildcards like *.jar is not supported, please try to add specific jar file path.

Related

Error initializing SparkContext when using SPARK-SHELL in spark standalone

I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2
Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up

Unable to use a local file using spark-submit

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?
A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

Save CSV file to hbase table using Spark and Phoenix

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2
Options that I tried and failed (Note: all of them work with Spark 1.6 for me)
phoenix-spark
hbase-spark
it.nerdammer.bigdata : spark-hbase-connector_2.10
All of them finally after fixing everything give similar error to this Spark HBase
Thanks
Add below parameters to your spark job-
spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"
Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html
Option 1 : Connect via zookeeper url - phoenix plugin
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
df
.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
.select(df("ID"))
.show
Option 2 : Use JDBC thin client provied by phoenix query server
more info on https://phoenix.apache.org/server.html
jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF

connect to mysql from spark

I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.

Using Hive along with Spark Cassandra connector?

Can I use Hive in concert with the Spark cassandra connector ?
scala> import org.apache.spark.sql.hive.HiveContext
scala> hiveCtx = new HiveContext(sc)
This produces:
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,
/etc/hive/conf.dist/ivysettings.xml will be used
and then
scala> val rows = hiveCtx.sql("SELECT first_name,last_name,house FROM
test_gce.students WHERE student_id=1")
results in this error:
org.apache.spark.sql.AnalysisException: no such table test_gce.students; line 1 pos 48
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
...
Is it possible to create a HiveContext from the SparkContext and use it as I am trying to do while using the Spark cassandra connector ?
Here is how I called spark-shell:
spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar --conf spark.cassandra.connection.host=10.240.0.0
Also, I am able to successfully access Cassandra with the pure connector code rather than just using Hive:
scala> val cRDD=sc.cassandraTable("test_gce", "students")
scala>cRDD.select("first_name","last_name","house").where("student_id=?",1).collect()
res0: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{first_name: Harry, last_name: Potter, house: Godric Gryffindor})

Resources