Zeppling twitter streaming example, unable to - apache-spark

When following the zeppelin tutorial for streaming tweets and querying them using SparkSQL, am running into error where the 'tweets' temp table is not found. The exact code being used and links referred as as follows
Ref: https://zeppelin.apache.org/docs/0.6.2/quickstart/tutorial.html
import scala.collection.mutable.HashMap
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess
/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
println("Configuring Twitter OAuth")
configs.foreach{ case(key, value) =>
if (value.trim.isEmpty) {
throw new Exception("Error setting authentication - value for " + key + " not set")
}
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
System.setProperty(fullKey, value.trim)
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
}
println()
}
// Configure Twitter credentials
val apiKey = "xxx"
val apiSecret = "xxx"
val accessToken = "xx-xxx"
val accessTokenSecret = "xxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
#transient val ssc = new StreamingContext(sc, Seconds(2))
#transient val tweets = TwitterUtils.createStream(ssc, None)
#transient val twt = tweets.window(Seconds(60), Seconds(2))
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
rdd.toDF().registerTempTable("tweets")
)
ssc.start()
In the next paragraph, i have the SQL select statement
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
Which throws the following exception
org.apache.spark.sql.AnalysisException: Table not found: tweets;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:305)

Was able to get the above example running by making following edits. Am not sure, if this change was needed due to version upgrade of Spark (v1.6.3) or some other underlying architecture nuances i might be missing, but eitherways
REF: SparkSQL error Table Not Found
In the second para' instead of directly invoking as SQL syntax, try using the sqlContext as follows
val my_df = sqlContext.sql("SELECT * from sweets LIMIT 5")
my_df.collect().foreach(println)

Related

Neo4j thinks that password is database

I am trying to integrate Spark and Neo4j. I am new to Neo4j. I have the following short Spark app
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.spark._
object Neo4jStorer {
var conf :Config = null
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
val g = Neo4jGraph.loadGraph(sc, label1="a", relTypes=Seq("rel"), label2 = "b")
val vCount = g.toString
println("Count= " + vCount)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.config("spark.neo4j.bolt.url", "neo4j://127.0.0.1:7687")
.config("spark.neo4j.bolt.user", "neo4j")
.config("spark.neo4j.bolt.password", "FakePassword")
.getOrCreate()
}
}
I used https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/ as an example for this code as I am using Spark 3.0. When I run this I get the following
20/10/17 14:36:36 ERROR LoadBalancer: Failed to update routing table for database 'FakePassword'. Current routing table: Ttl 1602963396190, currentTime 1602963396527, routers AddressSet=[], writers AddressSet=[], readers AddressSet=[], database 'FakePassword'.
org.neo4j.driver.exceptions.FatalDiscoveryException: Unable to get a routing table for database 'FakePassword' because this database does not exist
If I change the password I get an authentication error and I see that again the incorrect password is shown as being a database. I created a database with the name FakePassword and I still got the same error. Why is this happening and how can I fix it?
Also when I tried to get g.vertices.count as is shown in the example I am following I get a compilation error.
With this code I am able to get data from a DataFrame into Neo4j, which is what I really wanted to do. This does not seem to be the ideal solution as it uses foreach. I am open to improvements.
import com.typesafe.config._
import org.apache.spark.sql.SparkSession
import org.neo4j.driver.{AuthTokens, GraphDatabase, Session}
import org.neo4j.spark._
object StackoverflowAnswer {
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val sc = spark.sparkContext
import spark.implicits._
val df = sc.parallelize(List(1, 2, 3)).toDF
df.foreach(
row => {
val query = "CREATE (n:NumLable {num: " + row.get(0).toString +"})"
Neo4jSess.session.run(query)
()
}
)
}
def getSparkSession(): SparkSession = {
SparkSession
.builder
.appName("SparkNeo4j")
.getOrCreate()
}
}
object Neo4jSess {
/**
* Store a Neo4j session in a object so that it can be used by Spark
*/
var conf :Config = null
this.conf = ConfigFactory.load().getConfig("DeltaStorer")
val neo4jUrl: String = "bolt://127.0.0.1:7687"
val neo4jUser: String = "neo4j"
val neo4jPassword: String = "FakePassword"
val driver = GraphDatabase.driver(neo4jUrl, AuthTokens.basic(neo4jUser, neo4jPassword))
val session: Session = driver.session()
}
Please try to update spark-defaults.conf:
spark.jars.packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
spark.neo4j.url bolt://XX.XXX.X.XXX:7687
spark.neo4j.user neo4j
spark.neo4j.password test

How to convert java Resultset into Spark dataframe

I am trying to use preparestatement with JDBC. It results ResultSet object. I want to convert it into spark dataframe.
object JDBCRead {
val tableName:String = "TABLENAME"
val url :String = "jdbc:teradata://TERADATA_URL/user=USERNAME,password=PWD,charset=UTF8,TYPE=FASTEXPORT,SESSIONS=10"
val selectTable:String = "SELECT * FROM " + tableName +" sample 10";
val con : Connection = DriverManager.getConnection(url);
val pstmt2: PreparedStatement = con.prepareStatement(selectTable)
import java.sql.ResultSet
val rs: ResultSet = pstmt2.executeQuery
val rsmd: ResultSetMetaData = rs.getMetaData
while(rs.next()!=null)
{
val k: Boolean = rs.next()
for(i<-1 to rsmd.getColumnCount) {
print(" " + rs.getObject(i))
}
println()
}
}
I want to call above code from Spark Dataframe so that I can load the data into dataframe and get the results faster distributedly.
I must use PreparedStatement. I can not use spark.jdbc.load since FASTEXPORT of Teradata does not work with jdbc load. It has to be used with PreparedStatement
How to achieve this? How can I user preparestatement along with SELECT statement to load into Spark Dataframe.
-
AFAIK there are 2 options available for this kind of requirements
1. DataFrame 2. JdbcRDD
I'd offer JdbcRDD (since you are so specific to preparedstatement)
Which used prepareStatement internally in compute method. Therefore, you don't need to create connection and maintain it explicitly(error prone).
Later you can convert result in to dataframe
For speed you can configure other parameters.
Example code usage of JdbcRDD is below..
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext.__
import org.apache.spark.SparkConf
import org.apache.spark.rdd.JdbcRDD
import java.sql.{connection, DriverManager,ResultSet}
object jdbcRddExample {
def main(args: Array[String]) {
// Connection String
VAL URL = "jdbc:teradata://SERVER/demo"
val username = "demo"
val password = "Spark"
Class.forName("com.teradata.jdbc.Driver").newInstance
// Creating & Configuring Spark Context
val conf = new SparkConf().setAppName("App1").setMaster("local[2]").set("spark.executor.memory",1)
val sc = new SparkContext(conf)
println("Start...")
// Fetching data from Database
val myRDD = new JdbcRDD(sc,() => DriverManager.getConnection(url,username,password),
"select first_name, last_name, gender from person limit ?,?",
3,5,1,r => r.getString("last_name") + "," +r.getString("first_name"))
// Displaying the content
myRDD.foreach(println)
// Saving the content inside Text File
myRDD.saveAsTextFile("c://jdbcrdd")
println("End...")
}
}

could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[String]

I am trying to load a dataframe into a Hive table.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
object SparkToHive {
def main(args: Array[String]) {
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val sparkSession = SparkSession.builder.master("local[2]").appName("Saving data into HiveTable using Spark")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate()
**import sparkSession.implicits._**
val partfile = sparkSession.read.text("partfile").as[String]
val partdata = partfile.map(part => part.split(","))
case class Partclass(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(line => PartClass(line(0).toInt, line(1), line(2).toInt, line(3), line(4)))
val partDF = partRDD.toDF()
partDF.write.mode(SaveMode.Append).insertInto("parttab")
}
}
I haven't executed it yet but I am getting the following error at this line:
import sparkSession.implicits._
could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[String]
How can I fix this ?
Please move your case class Partclass outside of SparkToHive object. It should be fine then
And there are ** in you implicits import statement. Try
import sparkSession.sqlContext.implicits._
The mistake I made was
Case class should be outside the main and inside the object
In this line: val partfile = sparkSession.read.text("partfile").as[String], I used read.text("..") to get a file into Spark where we can use read.textFile("...")

Spark: Read HBase in secured cluster

I have an easy task: I want to read HBase data in a Kerberos secured cluster.
So far I tried 2 approaches:
sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs
Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).
Do you have any example or tutorials or ....?
I appreciate any help and hints.
Thanks in advance!
I assume that you are using spark + scala +Hbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
object SparkWithMyTable {
def main(args: Array[String]) {
//Initiate spark context with spark master URL. You can modify the URL per your environment.
val sc = new SparkContext("spark://ip:port", "MyTableTest")
val tableName = "myTable"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user#---", keyTabPath);
// Add local HBase conf
// conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// create my table with column family
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating MyTable")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
val columnDesc = new HColumnDescriptor("cf1");
admin.disableTable(Bytes.toBytes(tableName));
admin.addColumn(tableName, columnDesc);
admin.enableTable(Bytes.toBytes(tableName));
}
//first put data into table
val myTable = new HTable(conf, tableName);
for (i <- 0 to 5) {
var p = new Put();
p = new Put(new String("row" + i).getBytes());
p.add("cf1".getBytes(), "column-1".getBytes(), new String(
"value " + i).getBytes());
myTable.put(p);
}
myTable.flushCommits();
//how to create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//get the row count
val count = hBaseRDD.count()
print("HBase RDD count:"+count)
System.exit(0)
}
}
Maven Artifact
<dependency>
<groupId>it.nerdammer.bigdata</groupId>
<artifactId>spark-hbase-connector_2.10</artifactId>
<version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>
Can also have a look at
Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples
scan-that-works-on-kerberos
HBaseScanRDDExample.scala

Spark and Drools integration (Reading rules from a drl file)

I am working on a spark program that takes input from the RDD and runs a few drool rules on it reading from a drl file.
in the drl file i have made a rule that wherever the hz attribute of the object is 0 it should increment the counter attribute by 1.
I have no clue why is that not working, it gives me an output of 0 for all the data in the stream (Yes, there is data with hz attribute equal to 0 and yes, I can print all the attributes and verify that even for them counter is 0)
I am using the KieSessionFactory class that I found on a github project here https://github.com/mganta/sprue/blob/master/src/main/java/com/cloudera/sprue/KieSessionFactory.java
But I am quite sure that this part not where the problem is, it only reads from the drl file and applies the rules.
below is my scala code: (I have marked the part where I think the problem lies, but please take a look at the drl file first)
package com.streams.Scala_Consumer
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.{ DStream, InputDStream, ConstantInputDStream }
import org.apache.spark.streaming.kafka.v09.KafkaUtils
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.sql.functions.avg
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.kafka.producer._
import org.apache.kafka.common.serialization.{ Deserializer, Serializer }
import org.apache.kafka.common.serialization.StringSerializer
import org.kie.api.runtime.StatelessKieSession
//import KieSessionFactory.getKieSession;
//import Sensor
object scala_consumer extends Serializable {
// schema for sensor data
class Sensor(resid_1: String, date_1: String, time_1: String, hz_1: Double, disp_1: Double, flo_1: Double, sedPPM_1: Double, psi_1: Double, chlPPM_1: Double, counter_1: Int) extends Serializable
{
var resid = resid_1
var date = date_1
var time = time_1
var hz = hz_1
var disp = disp_1
var flo = flo_1
var sedPPM = sedPPM_1
var psi = psi_1
var chlPPM = chlPPM_1
var counter = counter_1
def IncrementCounter (param: Int) =
{
counter = counter + param
}
}
// function to parse line of sensor data into Sensor class
def parseSensor(str: String): Sensor = {
val p = str.split(",")
//println("printing p: " + p)
new Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble, 0)
}
var counter = 0
val timeout = 10 // Terminate after N seconds
val batchSeconds = 2 // Size of batch intervals
def main(args: Array[String]): Unit = {
val brokers = "maprdemo:9092" // not needed for MapR Streams, needed for Kafka
val groupId = "testgroup"
val offsetReset = "latest"
val batchInterval = "2"
val pollTimeout = "1000"
val topics = "/user/vipulrajan/streaming/original:sensor"
val topica = "/user/vipulrajan/streaming/fail:test"
val xlsFileName = "./src/main/Rules.drl"
val sparkConf = new SparkConf().setAppName("SensorStream").setMaster("local[1]").set("spark.testing.memory", "536870912")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.receiver.maxRate", Integer.toString(2000000))
.set("spark.streaming.kafka.maxRatePerPartition", Integer.toString(2000000));
val ssc = new StreamingContext(sparkConf, Seconds(batchInterval.toInt))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG ->
"org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> offsetReset,
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
"spark.kafka.poll.time" -> pollTimeout
)
val producerConf = new ProducerConf(
bootstrapServers = brokers.split(",").toList
)
val messages = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)
val values: DStream[String] = messages.map(_._2)
println("message values received")
//values.print(10)
///////////*************************PART THAT COULD BE CAUSING A PROBLEM**************************/////////////
values.foreachRDD(x => try{
print("did 1\n") //markers for manual and minor debugging
val myData = x.mapPartitions(s => {s.map(sens => {parseSensor(sens)})})
//myData.collect().foreach(println)
//println(youData.date)
print("did 2\n")
val evalData = myData.mapPartitions(s => {
val ksession = KieSessionFactory.getKieSession(xlsFileName)
val retData = s.map(sens => {ksession.execute(sens); sens;})
retData
})
evalData.foreach(t => {println(t.counter)})
print("did 3\n")
}
catch{case e1: ArrayIndexOutOfBoundsException => println("exception in line " )})
///////////*************************PART THAT COULD BE CAUSING A PROBLEM**************************/////////////
println("filtered alert messages ")
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
}
the drl file
package droolsexample
import com.streams.Scala_Consumer.Sensor;
import scala.com.streams.Scala_Consumer.Sensor; //imported because my rules file lies in the src/main folder
//and code lies in src/main/scala
// declare any global variables here
dialect "java"
rule "Counter Incrementer"
when
sens : Sensor (hz == 0)
then
sens.IncrementCounter(1);
end
I have tried using an xls file instead of the drl file, I have tried creating the class in java and the object in scala. I have tried a lot of other things, but all I get in the output is a warning:
6/06/27 16:38:30.462 Executor task launch worker-0 WARN AbstractKieModule: No files found for KieBase defaultKieBase
and when I print the counter values I get all zeroes. Anybody to the rescue?
When you are doing the spark submit and passing your JAR for execution, pls ensure that other dependency JARs from KIE, etc are also included with in the same JAR and then run it with Spark-Submit.
alternate is to have two separate projects one where you have your spark program another is your KIE project so you will have two Jars and you run it something like below:
nohup spark-submit --conf "spark.driver.extraJavaOptions -Dlog4j.configuration=file:/log4j.properties" \
--queue abc \
--master yarn \
--deploy-mode cluster \
--jars drools-kie-project-0.0.1-SNAPSHOT.jar --class com.abc.DroolsSparkJob SparkcallingDrools-0.0.1-SNAPSHOT.jar \
-inputfile /user/hive/warehouse/abc/* -output /user/hive/warehouse/drools-Op > app.log &

Resources