How to create InputDStream with offsets in PySpark (using KafkaUtils.createDirectStream)? - apache-spark

How to use KafkaUtils.createDirectStream with the offsets for a particular Topic in Pyspark?

If you want to create an RDD from records in a Kafka topic, use a static set of tuples.
Make available all the imports
from pyspark.streaming.kafka import KafkaUtils, OffsetRange
Then you create a dictionary of Kafka Brokers
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
Then you create your offsets object
start = 0
until = 10
partition = 0
topic = 'topic'
offset = OffsetRange(topic,partition,start,until)
offsets = [offset]
Finally you create the RDD:
kafkaRDD = KafkaUtils.createRDD(sc, kafkaParams,offsets)
To create Stream with offsets you need to do the following:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
Then you create your sparkstreaming context using your sparkcontext
ssc = StreamingContext(sc, 1)
Next we set up all of our parameters
kafkaParams = {"metadata.broker.list": "host1:9092,host2:9092,host3:9092"}
start = 0
partition = 0
topic = 'topic'
Then we create our fromOffset Dictionary
topicPartion = TopicAndPartition(topic,partition)
fromOffset = {topicPartion: long(start)}
//notice that we must cast the int to long
Finally we create the Stream
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic],kafkaParams,
fromOffsets=fromOffset)

You can do:
from pyspark.streaming.kafka import TopicAndPartition
topic = "test"
brokers = "localhost:9092"
partition = 0
start = 0
topicpartion = TopicAndPartition(topic, partition)
fromoffset = {topicpartion: int(start)}
kafkaDStream = KafkaUtils.createDirectStream(spark_streaming,[topic], \
{"metadata.broker.list": brokers}, fromOffsets = fromoffset)
Note: Spark 2.2.0, python 3.6

Related

How to incrementally load , fit with new data , save the pipeline model in using spark?

Any pointers to incrementally train and build the model , and get the prediction on single element.
Trying to run a web application will write data to csv in a shared path , and the ml application will read data and loads the model , tries to fit the data and save the model , transform the test data. ( This is supposed to happen in loop)
But when loading the saved model second time , facing following exception ,
(am using a minmax scaler to normalize the data)
Exception in thread "main" java.lang.IllegalArgumentException: Output column features_intermediate already exists.
Any pointers would be much appreciated , Thank you
object RunAppPooling {
def main(args: Array[String]): Unit = { // start the spark session
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
val filePath = "src/main/resources/train.csv"
val modelPath = "file:///home/vagrant/custom.model"
val schema = StructType(
Array(
StructField("IDLE_COUNT", IntegerType),
StructField("TIMEOUTS", IntegerType),
StructField("ACTIVE_COUNT", IntegerType),
StructField("FACTOR_LOAD", DoubleType)))
while(true){
// read the raw data
val df_raw = spark
.read
.option("header", "true")
.schema(schema)
.csv(filePath)
df_raw.show()
println(df_raw.count())
// fill all na values with 0
val df = df_raw.na.fill(0)
df.printSchema()
// create the feature vector
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("IDLE_COUNT", "TIMEOUTS", "ACTIVE_COUNT" ))
.setOutputCol("features_intermediate")
var lr1: PipelineModel = null
try {
lr1 = PipelineModel.load(modelPath)
} catch {
case ie: InvalidInputException => println(ie.getMessage)
}
import org.apache.spark.ml.feature.StandardScaler
val scaler = new StandardScaler().setWithMean(true).setWithStd(true).setInputCol("features_intermediate").setOutputCol("features")
var pipeline: Pipeline = null
if (lr1 == null) {
val lr =
new LinearRegression()
.setMaxIter(100)
.setRegParam(0.1)
.setElasticNetParam(0.8)
.setLabelCol("FACTOR_LOAD") // setting label column
// create the pipeline with the steps
pipeline = new Pipeline().setStages(Array( vectorAssembler, scaler, lr))
} else {
pipeline = new Pipeline().setStages(Array(vectorAssembler, scaler, lr1))
}
// create the model following the pipeline steps
val cvModel = pipeline.fit(df)
// save the model
cvModel.write.overwrite.save(modelPath)
var testschema = StructType(
Array(
StructField("PACKAGE_KEY", StringType),
StructField("IDLE_COUNT", IntegerType),
StructField("TIMEOUTS", IntegerType),
StructField("ACTIVE_COUNT", IntegerType)
))
val df_raw1 = spark
.read
.option("header", "true")
.schema(testschema)
.csv("src/main/resources/test_pooling.csv")
// fill all na values with 0
val df1 = df_raw1.na.fill(0)
val extracted = cvModel.transform(df1) //.toDF("prediction")
import org.apache.spark.sql.functions._
val test = extracted.select(mean(df("FACTOR_LOAD"))).collect()
println(test.apply(0))
}
}
}
I figured out a way at-least to get past away the exception not sure whether it is right apporach or not .Here it goes while creating the pipeline after loading the model , set the stages as only the model , because model has already defined with respective schema . Not sure whether this will normalize the new data or not .
pipeline = new Pipeline().setStages(Array( lr1))

Creating a stream from a text file in Pyspark

I'm getting the following error when I try to create a stream from a text file in Pyspark:
TypeError: unbound method textFileStream() must be called with StreamingContext instance as first argument (got str instance instead)
I don't want to use SparkContext because I get another error so to remove thet error I have to use SparkSession.
My code:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.mllib.stat import Statistics
if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 5)
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
ds1 = ssc.textFileStream(input_path1)
lines1 = ds1.map(lambda x1: x1[1])
windowedds1 = lines1.flatMap(lambda line1: line1.strip().split("\n")).map(lambda strelem1: float(strelem1)).window(5,10)
ds2 = ssc.textFileStream(input_path2)
lines2 = ds2.map(lambda x2: x2[1])
windowedds2 = lines2.flatMap(lambda line2: line2.strip().split("\n")).map(lambda strelem2: float(strelem2)).window(5,10)
result = Statistics.corr(windowedds1,windowedds2, method="pearson")
if result > 0.7:
print("ds1 and ds2 are correlated!!!")
spark.stop()
Thank you!
You have to first create streamingcontext object and then use it to call textFileStream.
spark =
SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 1)
ds = ssc.textFileStream(input_path)

averaging of data using apache spark streaming

I'm using python
I'm receiving json dictionaries through kafka into spark stream,
JSON is like {"a":10}{"a":20} (One dict means one kafka message), key will be "a" always but how much dictionaries, that is not sure.
Now I want average of 10 and 20 in above case.
As per the my knowledge averageByKey may be useful.
But how to use, that I dont know.
Any help would be great!
Thank you for reading
.
.
.
.
.
Update
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
def createContext():
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_02")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming-consumer', {'':1})
raw = kafkaStream.map(lambda kafkaS: kafkaS[1])
clean = raw.map(lambda v: json.loads(v))
print (dir(clean))
clean.pprint()
add=clean.map(lambda xs: ('Total',xs['hello'])).reduceByKey(lambda a, b: a+b)
add.pprint()
count_var = clean.count()
count_var.pprint()
average = add.map(lambda tpl: tpl[1]/float(60))
average.pprint()
return ssc
if __name__ == "__main__":
ssc = StreamingContext.getOrCreate('/path/checkpoint_v'+sys.argv[1],lambda: createContext())
ssc.start()
ssc.awaitTermination()
Now
in above program I'm getting add.pprint() output as below for example:
Stream is like:
{u'hello': 26}
{u'hello': 28}
{u'hello': 31}
{u'hello': 35}
{u'hello': 40}
{u'hello': 46}
>('Total',206)
and output of count_var.pprint() as below for example:
> 6
The question is, in below line
> average = add.map(lambda tpl: tpl[1]/float(60))
I want to use value of count_var.pprint()(which is 6) instead of static value 60
So how can I use stream object as integer in above operation
First, you need to map your event to some processable type, for example tuple. Then you can use just classic "map -> reduceByKey -> map" to calculate the average like this:
import json
ssc = StreamingContext(spark.sparkContext, 1)
dstream = KafkaUtils.createDirectStream(ssc, ['topic'], client_configuration,
valueDecoder=lambda s: json.loads(s.decode('ascii')))
def map_event(raw):
item = list(raw[1].items())[0]
return (item[0], (1, item[1]))
dstream.map(map_event).reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])) \
.map(lambda r: (r[0], float(r[1][1]) / r[1][0])) \
.pprint()
ssc.start()

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

Am getting a compilation error converting the pre-LDA transformation to a data frame using SCALA in SPARK 2.0. The specific code that is throwing an error is as per below:
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
The complete compilation error is:
Error:(132, 8) value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Here is the complete code:
import java.io.FileInputStream
import java.sql.{DriverManager, ResultSet}
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA => oldLDA}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object MPClassificationLDA {
/*Start: Configuration variable initialization*/
val props = new Properties
val fileStream = new FileInputStream("U:\\JIRA\\MP_Classification\\target\\classes\\mpclassification.properties")
props.load(fileStream)
val mpExtract = props.getProperty("mpExtract").toString
val shard6_db_server_name = props.getProperty("shard6_db_server_name").toString
val shard6_db_user_id = props.getProperty("shard6_db_user_id").toString
val shard6_db_user_pwd = props.getProperty("shard6_db_user_pwd").toString
val mp_output_file = props.getProperty("mp_output_file").toString
val spark_warehouse_path = props.getProperty("spark_warehouse_path").toString
val rf_model_file_path = props.getProperty("rf_model_file_path").toString
val windows_hadoop_home = props.getProperty("windows_hadoop_home").toString
val lda_vocabulary_size = props.getProperty("lda_vocabulary_size").toInt
val pre_lda_model_file_path = props.getProperty("pre_lda_model_file_path").toString
val lda_model_file_path = props.getProperty("lda_model_file_path").toString
fileStream.close()
/*End: Configuration variable initialization*/
val conf = new SparkConf().set("spark.sql.warehouse.dir", spark_warehouse_path)
def main(arg: Array[String]): Unit = {
//SQL Query definition and parameter values as parameter upon executing the Object
val cont_id = "14211599"
val top = "100000"
val start_date = "2016-05-01"
val end_date = "2016-06-01"
val mp_spark = SparkSession
.builder()
.master("local[*]")
.appName("MPClassificationLoadLDA")
.config(conf)
.getOrCreate()
MPClassificationLDACalculation(mp_spark, cont_id, top, start_date, end_date)
mp_spark.stop()
}
private def MPClassificationLDACalculation
(mp_spark: SparkSession
,cont_id: String
,top: String
,start_date: String
,end_date: String
): Unit = {
//DB connection definition
def createConnection() = {
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver").newInstance();
DriverManager.getConnection("jdbc:sqlserver://" + shard6_db_server_name + ";user=" + shard6_db_user_id + ";password=" + shard6_db_user_pwd);
}
//DB Field Names definition
def extractvalues(r: ResultSet) = {
Row(r.getString(1),r.getString(2))
}
//Prepare SQL Statement with parameter value replacement
val query = """SELECT docId = audt_id, text = auction_title FROM brands6.dbo.uf_ds_marketplace_classification_listing(#cont_id, #top, '#start_date', '#end_date') WHERE ? < ? OPTION(RECOMPILE);"""
.replaceAll("#cont_id", cont_id)
.replaceAll("#top", top)
.replaceAll("#start_date", start_date)
.replaceAll("#end_date", end_date)
.stripMargin
//Connect to Source DB and execute the Prepared SQL Steatement
val mpDataRDD = new JdbcRDD(mp_spark.sparkContext
,createConnection
,query
,lowerBound = 0
,upperBound = 10000000
,numPartitions = 1
,mapRow = extractvalues)
val schema_string = "docId,text"
val fields = StructType(schema_string.split(",")
.map(fieldname => StructField(fieldname, StringType, true)))
//Create Data Frame using format identified through schema_string
val mpDF = mp_spark.createDataFrame(mpDataRDD, fields)
mpDF.collect()
val mp_listing_tmp = mpDF.selectExpr("cast(docId as long) docId", "text")
mp_listing_tmp.printSchema()
println(mp_listing_tmp.first)
val mp_listing_lda_df = mp_listing_tmp.withColumn("docId", mp_listing_tmp("docId"))
mp_listing_lda_df.printSchema()
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawTokens")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawTokens")
.setOutputCol("tokens")
val vocabSize = 4000
val countVectorizer = new CountVectorizer()
.setVocabSize(vocabSize)
.setInputCol("tokens")
.setOutputCol("features")
val PreLDApipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsRemover, countVectorizer))
val PreLDAmodel = PreLDApipeline.fit(mp_listing_lda_df)
//comment out after saving it the first time
PreLDAmodel.write.overwrite().save(pre_lda_model_file_path)
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
//documents.printSchema()
val numTopics: Int = 20
val maxIterations: Int = 100
//note the FeaturesCol need to be set
val lda = new LDA()
.setOptimizer("em")
.setK(numTopics)
.setMaxIter(maxIterations)
.setFeaturesCol(("_2"))
val vocabArray = PreLDAmodel.stages(2).asInstanceOf[CountVectorizerModel].vocabulary
}
}
Am thinking that it is related to conflicts in the imports section of the code. Appreciate any help.
2 things needed to be done:
Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Move case class outside of the method: case class, by use of which you define the schema of the DataFrame, should be defined outside of the method needing it. You can read more about it here: https://issues.scala-lang.org/browse/SI-6649

Why is my Spark streaming app so slow?

I have a cluster with 4 nodes: 3 Spark nodes and 1 Solr node. My CPU is 8 core, my memory is 32 GB, disc space is SSD. I use cassandra as my database. My data amount is 22GB after 6 hours and I now have around 3,4 Million rows, which should be read in under 5 minutes.
But already it can't complete the task in this amount of time. My future plan is to read 100 Million rows in under 5 minutes. I am not sure what I can increase or do better to achieve this result now as well as to achieve my future goal. Is that even possible or would it be better to use spark for the real time analysis and use for example hadoop for longer tail data (older then 1 day or a couple of hours)?
Thanks a lot!
Here is my Spark app code:
import sys
import json
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext, Row
from pyspark.streaming.kafka import KafkaUtils
from datetime import datetime, timedelta
from dateutil.parser import parse
from cassandra.cluster import Cluster
import pytz
from dateutil.tz import tzutc
tz = pytz.timezone('')
appname = str(sys.argv[1])
source = str(sys.argv[2])
cluster = Cluster(['localhost']);
session_statis = cluster.connect('keyspace')
def read_json(x):
try:
y = json.loads(x)
except:
y = 0
return y
def TransformInData(x):
try:
body = json.loads(x['body'])
return (body['articles'])
except:
return 0
def axesTransformData(x):
try:
body = json.loads(x['body'])
return (body)
except:
return 0
def storeDataToCassandra(rdd):
rdd_cassandra =rdd.map(lambda x:(x[0],(x[0],x[1]['thumbnail'], x[1]['title'], x[1]['url'], datetime.strptime(parse(x[1]['created_at']).strftime('%Y-%m-%d %H:%M:%S'), "%Y-%m-%d %H:%M:%S"),source, x[1]['category'] if x[1]['category'] else '', x[1]['channel'],x[1]['genre']))) \
.subtract(articles)
rdd_article = rdd_cassandra.map(lambda x:Row(id=x[1][0],source=x[1][5],thumbnail=x[1][1],title=x[1][2],url=x[1][3],created_at=x[1][4],category=x[1][6],channel=x[1][7],genre=x[1][8]))
rdd_schedule = rdd_cassandra.map(lambda x:Row(source=x[1][5],type='article',scheduled_for=x[1][4]+timedelta(minutes=5),id=x[1][0]))
rdd_article_by_created_at = rdd_cassandra.map(lambda x:Row(source=x[1][5],created_at=x[1][4],article=x[1][0]))
rdd_article_by_url = rdd_cassandra.map(lambda x:Row(url=x[1][3],article=x[1][0]))
if rdd_article.count()>0:
result_rdd_article = sqlContext.createDataFrame(rdd_article)
result_rdd_article.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if rdd_schedule.count()>0:
result_rdd_schedule = sqlContext.createDataFrame(rdd_schedule)
result_rdd_schedule.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if rdd_article_by_created_at.count()>0:
result_rdd_article_by_created_at = sqlContext.createDataFrame(rdd_article_by_created_at)
result_rdd_article_by_created_at.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if rdd_article_by_url.count()>0:
result_rdd_article_by_url = sqlContext.createDataFrame(rdd_article_by_url)
result_rdd_article_by_url.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
#
def axesStoreToCassandra(rdd):
axes_rdd = rdd.map(lambda x:Row(article=x[1]['id'],at=datetime.now(),comments=x[1]['comments'],likes=x[1]['attitudes'],reads=0,shares=x[1]['reposts']))
if axes_rdd.count()>0:
result_axes_rdd = sqlContext.createDataFrame(axes_rdd)
result_axes_rdd.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
def joinstream(rdd):
article_channels = articlestat.join(channels).map(lambda x:(x[1][0]['id'],{'id':x[1][0]['id'],'thumbnail':x[1][0]['thumbnail'],'title':x[1][0]['title'],'url':x[1][0]['url'],'created_at':x[1][0]['created_at'],'source':x[1][0]['source'],'genre':x[1][0]['genre'],'category':x[1][1]['category'],'author':x[1][1]['author']}))
speed_rdd = axes.map(lambda x:(x.article,[[x.at,x.comments,x.likes,x.reads,x.shares]])) \
.reduceByKey(lambda x,y:x+y) \
.map(lambda x:(x[0],sorted(x[1],key=lambda y:y[0],reverse = True)[0],sorted(x[1],key=lambda y:y[0],reverse = True)[1]) if len(x[1])>=2 else (x[0],sorted(x[1],key=lambda y:y[0],reverse = True)[0],[sorted(x[1],key=lambda y:y[0],reverse = True)[0][0]-timedelta(seconds=300),0,0,0,0])) \
.filter(lambda x:(x[1][0]-x[2][0]).seconds>0) \
.map(lambda x:(x[0],{'id':x[0],'comments':x[1][1],'likes':x[1][2],'reads':x[1][3],'shares':x[1][4],'speed':int(5*288*((x[1][4]-x[2][4])/((x[1][0]-x[2][0]).seconds/60.0)))})) \
.filter(lambda x:x[1]['speed']>=0) \
.filter(lambda x:x[1]['shares']>0)
statistics = article_channels.join(speed_rdd) \
.map(lambda x:{'id':x[1][0]['id'],'thumbnail':x[1][0]['thumbnail'],'title':x[1][0]['title'],'url':x[1][0]['url'],'created_at':x[1][0]['created_at'],'source':x[1][0]['source'],'category':x[1][0]['category'],'author':x[1][0]['author'],'genre':x[1][0]['genre'],'comments':x[1][1]['comments'],'likes':x[1][1]['likes'],'reads':x[1][1]['reads'],'shares':x[1][1]['shares'],'speed':x[1][1]['speed']})
timeone=datetime.now()-timedelta(hours=1)
timethree = datetime.now()-timedelta(hours=3)
timesix = datetime.now()-timedelta(hours=6)
timetwelve = datetime.now()-timedelta(hours=12)
timetwentyfour = datetime.now()-timedelta(hours=24)
result1 = statistics.filter(lambda x:x['created_at']+timedelta(hours=8)>=timeone).map(lambda x:Row(timespan='1',source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre=x['genre'],reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
result3 = statistics.filter(lambda x:x['created_at']+timedelta(hours=8)>=timethree and x['created_at']+timedelta(hours=8)<=timeone).map(lambda x:Row(timespan='3',source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre=x['genre'],reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
result6 = statistics.filter(lambda x:x['created_at']+timedelta(hours=8)>=timesix and x['created_at']+timedelta(hours=8)<=timethree).map(lambda x:Row(timespan='6',source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre=x['genre'],reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
result12 = statistics.filter(lambda x:x['created_at']+timedelta(hours=8)>=timetwelve and x['created_at']+timedelta(hours=8)<=timesix).map(lambda x:Row(timespan='12',source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre=x['genre'],reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
result24 = statistics.filter(lambda x:x['created_at']+timedelta(hours=8)>=timetwentyfour and x['created_at']+timedelta(hours=8)<=timetwelve).map(lambda x:Row(timespan='24',source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre=x['genre'],reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
if result1.count()>0:
session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,'1'))
resultschema1 = sqlContext.createDataFrame(result1)
resultschema1.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if result3.count()>0:
session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,'3'))
resultschema3 = sqlContext.createDataFrame(result3)
resultschema3.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if result6.count()>0:
session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,'6'))
resultschema6 = sqlContext.createDataFrame(result6)
resultschema6.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if result12.count()>0:
session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,'12'))
resultschema12 = sqlContext.createDataFrame(result12)
resultschema12.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
if result24.count()>0:
session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,'24'))
resultschema24 = sqlContext.createDataFrame(result24)
resultschema24.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
conf = SparkConf().setAppName(appname)
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc,30)
sqlContext = SQLContext(sc)
channels = sc.cassandraTable("keyspace","tablename").map(lambda x:(x.id,{'author':x.name,'category':x.category}))
articles = sc.cassandraTable('keyspace','tablename').map(lambda x:(x.id,(x.id,x.thumbnail,x.title,x.url,x.created_at+timedelta(hours=8),source,x.category,x.channel,x.genre)))
articlestat = sc.cassandraTable('keyspace','tablename').map(lambda x:(x.channel,{'id':x.id,'thumbnail':x.thumbnail,'title':x.title,'url':x.url,'created_at':x.created_at,'source':x.source,'category':x.category,'channel':x.channel,'genre':x.genre}))
axes = sc.cassandraTable('keyspace','tablename')
topic = 'topic1'
kafkaParams = {"metadata.broker.list": "localhost:9092"}
article_stream = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams)
article_join_stream=article_stream.map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:TransformInData(x)).filter(lambda x: x!=0).flatMap(lambda x:(a for a in x)).map(lambda x:(x['id'].encode("utf-8") ,x))
article_join_stream.transform(storeDataToCassandra).pprint()
axes_topic = 'topic2'
axes_stream = KafkaUtils.createDirectStream(ssc, [axes_topic], kafkaParams)
axes_join_stream = axes_stream.map(lambda x:read_json(x[1])).filter(lambda x: x!=0).map(lambda x:axesTransformData(x)).filter(lambda x: x!=0).flatMap(lambda x:(a for a in x)).map(lambda x:(str(x['id']),x))
axes_join_stream.transform(axesStoreToCassandra).pprint()
statistics = article_join_stream.map(lambda x:(x[0])).window(15*60,15*60)
statistics.transform(joinstream).pprint()
ssc.start()
EDIT:
This is the stage that seems to consume most time. Any thoughts on that?
At first glance it seems that you just start your application with "spark-submit <your application>"
This means you are using the default allocation of memory and CPU's to your application (which is about 1cpu and 512MB of ram in most default cases)
This is assuming you are using YARN since you don't provide info on this.
Start your application with the appropriate resources and you'll see improvements.
Edit:
I see you are using a lot of lambdas, those need to be serialized.
Do know that when using objects you are passing around the full object every time.
I.E. you are using the full object this.value and not just value.
To fix this, you could use a local variable _value = this.value and use that to proceed.
This might provide you with a speedup.

Resources