Confused by SparkContext import statements - apache-spark

I am trying to learn Apache Spark and cannot wrap my head wround this:
import spark.SparkContext
import SparkContext._
Why do we need the second line that almost looks like the first? And what does the '._' man after SparkContext?

You do not need to execute the 2nd line import SparkContext._ . Given the old approach of, say, Spark 1.6.x for a self-contained Spark App, the following from https://github.com/mk6502/spark-1.6-scala-boilerplate/blob/master/src/main/scala/HelloSpark.scala clearly and briefly demonstrates this:
import org.apache.spark.{SparkContext, SparkConf}
object HelloSpark {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("hello spark").setMaster("local"))
val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
println("count: ")
println(rdd.count())
sc.stop()
}
}
In notebooks the settings and configs and entry points are automatic.
As stated in my comment, move on to Spark 2.x, 3.x and look at SparkSession via https://data-flair.training/forums/topic/sparksession-vs-sparkcontext-in-apache-spark/
In the 1.6 Spark Guide on Self-Contained Applications we see the 2nd line indeed, but no reference to underlying classes explicitly. E.g.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}

Related

PySpark Streaming + Kafka Word Count not printing any results

This is my first interaction with Kafka and Spark Streaming and i am trying to run WordCount script given below. The script is pretty standard as given in many online blogs. But for whatever reason, spark streaming is not printing the word counts. It is not throwing any error, just does not display the counts.
I have tested the topic via console consumer, and there messages are showing up correctly. I even tried to use foreachRDD to see the lines coming in and thats also not showing anything.
Thanks in advance!
Versions: kafka_2.11-0.8.2.2 , Spark2.2.1, spark-streaming-kafka-0-8-assembly_2.11-2.2.1
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.context import SQLContext
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
sc.setCheckpointDir('c:\Playground\spark\logs')
ssc = StreamingContext(sc, 10)
ssc.checkpoint('c:\Playground\spark\logs')
zkQuorum, topic = sys.argv[1:]
print(str(zkQuorum))
print(str(topic))
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
print(kvs)
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint(num=10)
ssc.start()
ssc.awaitTermination()
Producer Code:
import sys,os
from kafka import KafkaProducer
from kafka.errors import KafkaError
import time
producer = KafkaProducer(bootstrap_servers="localhost:9092")
topic = "KafkaSparkWordCount"
def read_file(fileName):
with open(fileName) as f:
print('started reading...')
contents = f.readlines()
for content in contents:
future = producer.send(topic,content.encode('utf-8'))
try:
future.get(timeout=10)
except KafkaError as e:
print(e)
break
print('.',end='',flush=True)
time.sleep(0.2)
print('done')
if __name__== '__main__':
read_file('C:\\\PlayGround\\spark\\BookText.txt')
how many cores do you use ?
Spark Streaming needs at least two cores, one for the receiver and one for the processor.

how to run, once my object is created in spark

Please help me out i have installed spark and now i am trying to run the code object is defined ,but what next i am confused do help
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> object mapTest{
| def main(args: Array[String]) = {
| val spark = SparkSession.builder.appName("mapExample").master("local").getOrCreate()
| val data = spark.read.textFile("file///home/parv/Desktop/1").rdd
| val mapFile = data.map(line => (line,line.length))
| mapFile.foreach(println)
| }
| }
defined object mapTest
Just say
mapTest.main
on scala shell

spark 1.3 playing with hbase error

i'm trying to create hbase table and insert using spark core (spark streaming after).
I succeeded to create the table and add data into it, even when i got this problem:
warning: Class org.apache.hadoop.hbase.classification.InterfaceAudience not found - continuing with a stub.
but when i try to count i got an error; may someone help me with the first warning and how i cant add streaming data into this table
my code is bellow:
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
val tableName = "ziedspark"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating GHbase Table Creating GHbase Table Creating GHbase Table Creating GHbase Table ")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
tableDesc.addFamily(new HColumnDescriptor("z2".getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
}
val myTable = new HTable(conf, tableName)
for (i <- 414540 to 414545) {
var p = new Put(Bytes.toBytes(""+i))
p.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(""+(i*5)))
p.add("z1".getBytes(), "age".getBytes(), Bytes.toBytes("2016-07-01"))
p.add("z2".getBytes(), "job".getBytes(), Bytes.toBytes(""+i))
p.add("z2".getBytes(), "salary".getBytes(), Bytes.toBytes(""+i))
myTable.put(p)
}
myTable.flushCommits()
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//error here after creating the table count is not working
val count = hBaseRDD.count()
print("HBase RDD count:" + count)
System.exit(0)
Please find a similar question related to the Reading from Spark.
How to read from hbase using spark
Also in the mentioned libraries you'll get the stub to read and write in HBase.
Let me know for any more help on the same.

Spark Streaming - updateStateByKey and caching data

I have a problem with using updateStateByKey function and caching some big data at the same time. Here is a example.
Lets say I get data (lastname,age) from kafka. I want to keep actual age for every person so I use updateStateByKey. Also I want to know name of every person so I join output with external table (lastname,name) e.g. from Hive. Lets assume it's really big table, so I don't want to load it in every batch. And there's a problem.
All works well, when I load table in every batch, but when I try to cache table, StreamingContext doesn't start. I also tried to use registerTempTable and later join data with sql but i got the same error.
Seems like the problem is checkpoint needed by updateStateByKey. When I remove updateStateByKey and leave checkpoint i got error, but when I remove both it works.
Error I'm getting: pastebin
Here is code:
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# function to keep actual state
def updateFunc(channel, actualChannel):
if (actualChannel is None or not channel is None):
try:
actualChannel = channel[-1]
except Exception:
pass
if channel is None:
channel = actualChannel
return actualChannel
def splitFunc(row):
row = row.strip()
lname,age = row.split()
return (lname,age)
def createContext(brokers,topics):
# some conf
conf = SparkConf().setAppName(appName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.dynamicAllocation.enabled","false").\
set("spark.serializer","org.apache.spark.serializer.KryoSerializer").set("spark.sql.shuffle.partitions",'100')
# create SparkContext
sc = SparkContext(conf=conf)
# create HiveContext
sqlContext = HiveContext(sc)
# create Streaming Context
ssc = StreamingContext(sc, 5)
# read big_df and cache (not work, Streaming Context not start)
big_df = sqlContext.sql('select lastname,name from `default`.`names`')
big_df.cache().show(10)
# join table
def joinTable(time,rdd):
if rdd.isEmpty()==False:
df = HiveContext.getOrCreate(SparkContext.getOrCreate()).createDataFrame(rdd,['lname','age'])
# read big_df (work)
#big_df = HiveContext.getOrCreate(SparkContext.getOrCreate()).sql('select lastname,name from `default`.`names`')
# join DMS
df2 = df.join(big_df,df.lname == big_df.lastname,"left_outer")
return df2.map(lambda row:row)
# streaming
kvs = KafkaUtils.createDirectStream(ssc, [topics], {'metadata.broker.list': brokers})
kvs.map(lambda (k,v): splitFunc(v)).updateStateByKey(updateFunc).transform(joinTable).pprint()
return ssc
if __name__ == "__main__":
appName="SparkCheckpointUpdateSate"
if len(sys.argv) != 3:
print("Usage: SparkCheckpointUpdateSate.py <broker_list> <topic>")
exit(-1)
brokers, topics = sys.argv[1:]
# getOrCreate Context
checkpoint = 'SparkCheckpoint/checkpoint'
ssc = StreamingContext.getOrCreate(checkpoint,lambda: createContext(brokers,topics))
# start streaming
ssc.start()
ssc.awaitTermination()
Can you tell me how to properly cache data when checkpoint is enabled? Maybe there is some workaround I don't know.
Spark ver. 1.6
I get this working using lazily instantiated global instance of big_df. Something like that is done in recoverable_network_wordcount.py
.
def getBigDf():
if ('bigdf' not in globals()):
globals()['bigdf'] = HiveContext.getOrCreate(SparkContext.getOrCreate()).sql('select lastname,name from `default`.`names`')
return globals()['bigdf']
def createContext(brokers,topics):
...
def joinTable(time,rdd):
...
# read big_df (work)
big_df = getBigDF()
# join DMS
df2 = df.join(big_df,df.lname == big_df.lastname,"left_outer")
return df2.map(lambda row:row)
...
Seems like in streaming all data must be cached inside streaming processing, not before.

Zeppelin: muptiple SparkContexts issue

I tried to run the folloiwng simple code in Zeppelin:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
System.clearProperty("spark.driver.port")
System.clearProperty("spark.hostPort")
def maxWaitTimeMillis: Int = 20000
def actuallyWait: Boolean = false
val conf = new SparkConf().setMaster("local[2]").setAppName("Streaming test")
var sc = new SparkContext(conf)
def batchDuration: Duration = Seconds(1)
val ssc = new StreamingContext(sc, batchDuration)
This is the output in Zeppelin:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
calculateRMSE: (output: org.apache.spark.streaming.dstream.DStream[(Double, Double)], n: org.apache.spark.streaming.dstream.DStream[Long])Double
res50: String = null
res51: String = null
maxWaitTimeMillis: Int
actuallyWait: Boolean
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf#1daf4e42
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:356)
org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:150)
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:525)
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:345)
org.apache.zeppelin.scheduler.Job.run(Job.java:176)
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$1.apply(SparkContext.scala:2257)
at org.apache.spark.SparkContext$$anonfun$assertNoOtherContextIsRunning$1.apply(SparkContext.scala:2239)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2239)
at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2312)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:91)
Why does it say that I have multiple SparkContexts running? If I do not add the line var sc = new SparkContext(conf), then sc is null, so it's not created.
You can't use multiple SparkContexts in Zeppelin. It's one of his limitations since he's creating actually a webhook to a SparkContext.
If you wish to set up the your SparkConf in Zeppelin, the easiest way is to set those properties in the Interpreter menu and restart the interpreter to take those configuration in your SparkContext.
Now you can go back to your notebook and test your code :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
def maxWaitTimeMillis: Int = 20000
def actuallyWait: Boolean = false
def batchDuration: Duration = Seconds(1)
val ssc = new StreamingContext(sc, batchDuration)
More on that here.

Resources