Why does streaming job fail with NullPointerException when restarted from checkpoint?

Why does streaming job fail with NullPointerException when restarted from checkpoint? - apache-spark

I have started spark streaming recently and implementing checkpoint. I'm storing the checkpoint in HDFS. when the streaming failed it's able to go back to the last checkpoint but getting NullPointerException and the streaming job is getting killed. I'm able to see the checkpoints in HDFS. Not sure why I'm getting the exception even though there is chckpoint in HDFS. Any inputs will be helpful.
17/04/10 16:30:47 INFO JobGenerator: Batches during down time (2 batches):1491841680000 ms, 1491841800000 ms
17/04/10 16:30:47 INFO JobGenerator: Batches pending processing (0 batches):
17/04/10 16:30:47 INFO JobGenerator: Batches to reschedule (2 batches): 1491841680000 ms, 1491841800000 ms
17/04/10 16:30:48 INFO JobScheduler: Added jobs for time 1491841680000 ms
17/04/10 16:30:48 INFO JobScheduler: Starting job streaming job 1491841680000 ms.0 from job set of time 1491841680000 ms
17/04/10 16:30:48 INFO SparkContext: Starting job: isEmpty at piadj.scala:34
17/04/10 16:30:48 INFO DAGScheduler: Got job 0 (isEmpty at piadj.scala:34) with 1 output partitions
17/04/10 16:30:48 INFO DAGScheduler: Final stage: ResultStage 0 (isEmpty at piadj.scala:34)
17/04/10 16:30:48 INFO DAGScheduler: Parents of final stage: List()
17/04/10 16:30:48 INFO DAGScheduler: Missing parents: List()
17/04/10 16:30:48 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at piadj.scala:32), which has no missing parents
17/04/10 16:30:48 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.1 KB, free 4.1 KB)
17/04/10 16:30:48 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 6.1 KB)
17/04/10 16:30:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.26.118.23:35738 (size: 2.1 KB, free: 5.8 GB)
17/04/10 16:30:48 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
17/04/10 16:30:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at piadj.scala:32)
17/04/10 16:30:48 INFO YarnClusterScheduler: Adding task set 0.0 with 1 tasks
17/04/10 16:30:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, oser402370.wal-mart.com, partition 0,ANY, 2108 bytes)
17/04/10 16:30:48 INFO JobScheduler: Added jobs for time 1491841800000 ms
17/04/10 16:30:48 INFO RecurringTimer: Started timer for JobGenerator at time 1491841920000
17/04/10 16:30:48 INFO JobGenerator: Restarted JobGenerator at 1491841920000 ms
17/04/10 16:30:48 INFO JobScheduler: Starting job streaming job 1491841800000 ms.0 from job set of time 1491841800000 ms
17/04/10 16:30:48 INFO JobScheduler: Started JobScheduler
17/04/10 16:30:48 INFO StreamingContext: StreamingContext started
17/04/10 16:30:48 INFO SparkContext: Starting job: isEmpty at piadj.scala:34
17/04/10 16:30:48 INFO DAGScheduler: Got job 1 (isEmpty at piadj.scala:34) with 1 output partitions
17/04/10 16:30:48 INFO DAGScheduler: Final stage: ResultStage 1 (isEmpty at piadj.scala:34)
17/04/10 16:30:48 INFO DAGScheduler: Parents of final stage: List()
17/04/10 16:30:48 INFO DAGScheduler: Missing parents: List()
17/04/10 16:30:48 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at map at piadj.scala:32), which has no missing parents
17/04/10 16:30:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 10.2 KB)
17/04/10 16:30:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.1 KB, free 12.3 KB)
17/04/10 16:30:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.26.118.23:35738 (size: 2.1 KB, free: 5.8 GB)
17/04/10 16:30:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
17/04/10 16:30:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at map at piadj.scala:32)
17/04/10 16:30:48 INFO YarnClusterScheduler: Adding task set 1.0 with 1 tasks
17/04/10 16:30:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, XXXXXXX, partition 0,ANY, 2108 bytes)
17/04/10 16:30:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on XXXXXXX (size: 2.1 KB, free: 4.3 GB)
17/04/10 16:30:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on XXXXXXX (size: 2.1 KB, free: 4.3 GB)
17/04/10 16:30:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1142 ms on XXXXXXX (1/1)
17/04/10 16:30:49 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/04/10 16:30:49 INFO DAGScheduler: ResultStage 0 (isEmpty at piadj.scala:34) finished in 1.151 s
17/04/10 16:30:49 INFO DAGScheduler: Job 0 finished: isEmpty at piadj.scala:34, took 1.466449 s
17/04/10 16:30:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 979 ms on XXXXXXX (1/1)
17/04/10 16:30:49 INFO YarnClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/04/10 16:30:49 INFO DAGScheduler: ResultStage 1 (isEmpty at piadj.scala:34) finished in 0.983 s
17/04/10 16:30:49 INFO DAGScheduler: Job 1 finished: isEmpty at piadj.scala:34, took 1.006658 s
17/04/10 16:30:49 INFO JobScheduler: Finished job streaming job 1491841680000 ms.0 from job set of time 1491841680000 ms
17/04/10 16:30:49 INFO JobScheduler: Total delay: 169.575 s for time 1491841680000 ms (execution: 1.568 s)
17/04/10 16:30:49 ERROR JobScheduler: Error running job streaming job 1491841680000 ms.0
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
at walmart.com.piadj$$anonfun$createContext$1.apply(piadj.scala:39)
at walmart.com.piadj$$anonfun$createContext$1.apply(piadj.scala:33)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Below is my code
def createContext(brokers:String,topics:String,checkpointDirectory:String):StreamingContext={
val sparkConf = new SparkConf().setAppName("pi")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
val ssc = new StreamingContext(sc, Seconds(1))
ssc.checkpoint(checkpointDirectory)
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()) {
import sqlContext.implicits._
val rdd2 = rdd.map(x => new JsonDeserializer().deserialize("pi_adj",x))
val rdd3 = rdd2.map(x => new String(x,"UTF-8"))
val df1 = sqlContext.read.json(rdd3)
/*some other transformations and inserting into hive*/
}
}
ssc
}
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: streaming <brokers> <topics> <checkpointDirectory>")
System.exit(1)
}
val Array(brokers,topics,checkpointDirectory) = args
val ssc = StreamingContext.getOrCreate(checkpointDirectory,()=>createContext(brokers,topics,checkpointDirectory))
ssc.start()
ssc.awaitTermination()
}

tl;dr Move the code to create a Kafka DStream and foreach outside createContext and use it in main.
According to the scaladoc of StreamingContext:
Either recreate a StreamingContext from checkpoint data or create a new StreamingContext. If checkpoint data exists in the provided checkpointPath, then StreamingContext will be recreated from the checkpoint data. If the data does not exist, then the StreamingContext will be created by called the provided creatingFunc.
And although it may not have been said clearly, creatingFunc to create a StreamingContext should only create a new StreamingContext possibly with checkpoint enabled. Nothing else.
You should move the code to create Kafka DStream and foreachRDD outside createContext and have it as part of main (right after ssc is initialized and before starting it).

if(!rdd.isEmpty()) {
import sqlContext.implicits._
val rdd2 = rdd.map(x => new JsonDeserializer().deserialize("pi_adj",x))
val rdd3 = rdd2.map(x => new String(x,"UTF-8"))
val df1 = sqlContext.read.json(rdd3)
/*some other transformations and inserting into hive*/
}
if(!rdd.isEmpty()) {
val spark =
SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val rdd2 = rdd.map(x => new JsonDeserializer().deserialize("pi_adj",x))
val rdd3 = rdd2.map(x => new String(x,"UTF-8"))
val df1 = spark.read.json(rdd3)
/*some other transformations and inserting into hive*/
}
refer to this refer to this

Related

Access In-Memory Spark Dataframe from different nodes

I am prototyping a Spark based data ingestion system. Essentially I need spark to watch a datalake directory and as data comes in, add this data to an in-memory dataframe. I understand that memory is meant for debuging purposes, but since this is a prototype I am trying to get this working in memory first before the more standard kafka.
Here is my first python script that is supposed to getOrCreate a SparkSession, read from the datalake, then write to a data table located in memory:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.broadcast([1])
schemaSnmp = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
df = spark.readStream.format("json") \
.option("sourceArchiveDir", "/tmp/datalake-archive") \
.option("cleanSource", "archive") \
.schema(schemaSnmp) \
.load("/var/datalake/snmp-get")
result = df.writeStream.queryName("snmpget").format("memory").start()
result.awaitTermination();
This appears to be running just fine. I see my data get archived and the logs. Here is a small sample:
$ spark-submit spark-start-stream-for-snmpget.py
22/08/18 11:18:33 WARN Utils: Your hostname, rhel8.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
22/08/18 11:18:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/18 11:18:35 INFO SparkContext: Running Spark version 3.3.0
22/08/18 11:18:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/18 11:18:35 INFO ResourceUtils: ==============================================================
22/08/18 11:18:35 INFO ResourceUtils: No custom resources configured for spark.driver.
22/08/18 11:18:35 INFO ResourceUtils: ==============================================================
22/08/18 11:18:35 INFO SparkContext: Submitted application: spark.data_processing_engine
22/08/18 11:18:35 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/08/18 11:18:35 INFO ResourceProfile: Limiting resource is cpu
22/08/18 11:18:35 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/08/18 11:18:35 INFO SecurityManager: Changing view acls to: root
22/08/18 11:18:35 INFO SecurityManager: Changing modify acls to: root
22/08/18 11:18:35 INFO SecurityManager: Changing view acls groups to:
22/08/18 11:18:35 INFO SecurityManager: Changing modify acls groups to:
22/08/18 11:18:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
22/08/18 11:18:36 INFO Utils: Successfully started service 'sparkDriver' on port 42657.
22/08/18 11:18:36 INFO SparkEnv: Registering MapOutputTracker
22/08/18 11:18:36 INFO SparkEnv: Registering BlockManagerMaster
22/08/18 11:18:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/08/18 11:18:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/08/18 11:18:36 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/08/18 11:18:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-364e57a6-0f38-43ea-abcc-428e0ca8684f
22/08/18 11:18:36 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/08/18 11:18:36 INFO SparkEnv: Registering OutputCommitCoordinator
...
22/08/18 11:17:55 INFO InMemoryFileIndex: It took 5 ms to list leaf files for 1 paths.
22/08/18 11:17:55 INFO BlockManagerInfo: Removed broadcast_224_piece0 on 10.0.2.15:36859 in memory (size: 34.0 KiB, free: 434.3 MiB)
In a separate process, I fire up pyspark and try to capture this data, but I cannot:
In [1]: APP_NAME = "spark.data_processing_engine"
...:
...: spark = SparkSession \
...: ^I.builder \
...: ^I.master("local[*]") \
...: ^I.appName(APP_NAME) \
...: ^I.getOrCreate()
In [2]: spark.sql("SELECT * FROM snmpget ORDER BY `event-time` DESC").count()
AnalysisException: Table or view not found: snmpget; line 1 pos 14;
'Sort ['event-time DESC NULLS LAST], true
+- 'Project [*]
+- 'UnresolvedRelation [snmpget], [], false
I tried the following already
Use the same app name
set spark.sparkContext.broadcast([1])
scope variables in the pyspark instance
instead of using pyspark I created a script to run:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sql("SELECT * FROM snmpget ORDER BY `event-time` DESC").count()
But I get the same error this way too. I was hoping the getOrCreate method would have been able to re-cycle the SparkSession object and allow other threads to access the data. Obviously there is more to it than that.
Essentially, I need to fire up a spark process to read from the datalake and then fire up other jobs that can read from this data.

Okay I think I figured out the way to get this data coordinated. Maybe there is a better way of doing this, but for now it fits the bill.
First I create a long running spark-submit task that continually reads from a datalake and outputs to a parquet database:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
schemaSnmp = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
result = spark.readStream \
.option("sourceArchiveDir", "/var/datalake-archive") \
.option("cleanSource", "archive") \
.schema(schemaSnmp) \
.json("/var/datalake/snmp-get") \
.writeStream \
.queryName("snmpget") \
.format("parquet") \
.option("checkpointLocation", "/var/spark-map/snmp-get") \
.option("path", "/var/spark-map/snmp-get") \
.start()
result.awaitTermination()
result.stop()
After this is running, I fire up a new spark job that looks like this:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
import datetime
import logging
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
schema = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
results = spark.readStream \
.schema(schema) \
.parquet("/var/spark-map/snmp-get/") \
.writeStream \
.queryName("snmpget") \
.format("memory") \
.start()
i = 0
while i < 60:
x = spark.table("snmpget").select("value", "oid", "`node-hostname`", "`event-time`").orderBy("`event-time`", ascending=False).head()
if (x == None):
print("Data may still be loading, give this approx 30 seconds")
else:
print(f"x = {x}")
i = i+1
time.sleep(1)
results.stop()
That second script can take a few moment to load (so it will print None at first) but then after this, the data comes in! This works along multiple spark jobs, each reading from that same parquet database.
Ideally all this would just exist in memory for demo purposes, but this works nonetheless. If I want more speed, I could always look into making the referenced directories into ram-disks.
If anyone out there knows how to tie this together with only in-memory datastore (i.e., no parquet step) I'd be interested in knowing.

SparkContext: Invoking stop() from shutdown hook (Spark and Kubernetes)

I'm trying to run spark and ml-lib code in a kubernetes cluster.
When I ran the same code in client mode or inside a docker which I build it is working with out any issue.
But when I ran the same code inside kubernetes cluster (PKS platform) its failing with no error.
import os
import numpy as np
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
from pyspark.sql.functions import percent_rank
from pyspark.sql import Window
import pyspark.sql.functions as F
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler,VectorIndexer
from pyspark.sql.functions import broadcast
import datetime
##########sparkcontext###############
sc= SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
print("started")
###########RMSLE-calusulation########
def rmsle(real, predicted):
sum=0.0
for x in range(len(predicted)):
if predicted[x]<0 or real[x]<0: #check for negative values
continue
p = np.log(predicted[x]+1)
r = np.log(real[x]+1)
sum = sum + (p - r)**2
return (sum/len(predicted))**0.5
##########Reading of imput Data###########
customer = sqlContext.read.csv('hdfs://XXXXXX:/user/root/customer.csv', header=True, inferSchema = True)
customer =customer.select("*").toPandas()
lineitem = sqlContext.read.csv('hdfs://XXXXXXX:/user/root/lineitem.csv', header=True, inferSchema = True)
lineitem =lineitem.select("*").toPandas()
order = sqlContext.read.csv('hdfs://XXXXXXX:/user/root/orders.csv', header=True, inferSchema = True)
order =order.select("*").toPandas()
print("data has been read")
###########ETL############################
sales = order.join(customer, order.o_custkey == customer.c_custkey, how = 'inner')
sales = sales.sort_index()
sales.columns = ['key_old', 'o_orderdate', 'o_orderkey', 'o_custkey', 'o_orderpriority',
'o_shippriority', 'o_clerk', 'o_orderstatus', 'o_totalprice',
'o_comment', 'c_custkey', 'c_mktsegment', 'c_nationkey', 'c_name',
'c_address', 'c_phone', 'c_acctbal', 'c_comment']
sales2 = sales.join(lineitem,sales.o_orderkey == lineitem.l_orderkey, how = 'outer')
sales3 = sales2.groupby(by = 'o_orderdate')
sales4 = sales3.agg({'l_quantity': 'sum'})# .withColumnRenamed("sum(l_quantity)", "TOTAL_SALES") .withColumnRenamed("o_orderdate", "ORDERDATE")
print("End of ETL pipeline")
orderdates = pd.to_datetime(sales4.index.values)
orderdates = [datetime.datetime(i.year, i.month, i.day,) for i in orderdates]
l = []
l2 = []
for i in orderdates:
l = []
l.append(i.timestamp())
l.append(i.day)
l.append(i.timetuple().tm_wday)
l.append(i.timetuple().tm_yday)
l.append(i.isocalendar()[1])
l2.append(l)
print("dateconverted")
tmp = np.array(sales4.values)
tmp = tmp.reshape(tmp.shape[0],)
data_new = pd.DataFrame()
data_new['SALES'] = tmp
data_new[['DATE','DAY','WDAY','YDAY','WEEK']] = pd.DataFrame(np.array(l2))
data_new['ONES'] = np.ones((len(data_new)))
print("converted to datframe")
X = np.array(data_new[['DATE','DAY','WDAY','YDAY','WEEK','ONES']])
X = X.reshape(X.shape[0],X.shape[1])
Y = np.array(data_new[['SALES']])
Y = Y.reshape(Y.shape[0],1)
cutoff = 0.1
length = int((1-cutoff)*(len(X)))
X_train = X[0:length]
X_test = X[length:len(X)]
Y_train = Y[0:length]
Y_test = Y[length:len(Y)]
print("pre-processingdone")
weights = np.dot(np.dot(np.linalg.inv(np.dot(X_train.T,X_train)),X_train.T),Y_train)
print("model Ready")
Y_pred = np.dot(X_test,weights)
Y_pred = Y_pred.reshape(Y_pred.shape[0],)
Y_test = Y_test.reshape(Y_test.shape[0],)
print("predictions done")
RMSE = np.sqrt(np.mean((Y_test-Y_pred)**2))
RMSLE = rmsle(Y_test,Y_pred)
print(RMSE)
print(RMSLE
sc.stop()
Environment
Kubernetes cluster : PKS platform
Having enough memory 16GB with master and 16GB RAM with worker nodes of 3
I don't see any spike and during processing it hardly utilising 30% of memory from cluster.
My image is 2GB and data is hardly 100MB.
Its failing during below code:
weights = np.dot(np.dot(np.linalg.inv(np.dot(X_train.T,X_train)),X_train.T),Y_train)
print("model Ready")
Y_pred = np.dot(X_test,weights)
Y_pred = Y_pred.reshape(Y_pred.shape[0],)
Y_test = Y_test.reshape(Y_test.shape[0],)
print("predictions done")
RMSE = np.sqrt(np.mean((Y_test-Y_pred)**2))
RMSLE = rmsle(Y_test,Y_pred)
print(RMSE)
print(RMSLE)
sc.stop()
Below is part of driver log and I don't see any error in log
19/12/26 10:52:23 INFO Context Cleaner: Cleaned accumulator 238
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 222
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 241
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 228
data has been read
End of ETL pipeline
dateconverted
converted to datframe
pre-processingdone
19/12/26 10:52:35 INFO SparkContext: Invoking stop() from shutdown hook
19/12/26 10:52:35 INFO SparkUI: Stopped Spark web UI at http://spark-ml-test-4ff4386f41d48a9f-driver-svc.spark-jobs.svc:4040
19/12/26 10:52:35 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
19/12/26 10:52:35 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
19/12/26 10:52:35 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
19/12/26 10:52:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/12/26 10:52:35 INFO MemoryStore: MemoryStore cleared
19/12/26 10:52:35 INFO BlockManager: BlockManager stopped
19/12/26 10:52:35 INFO BlockManagerMaster: BlockManagerMaster stopped
19/12/26 10:52:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/12/26 10:52:35 INFO SparkContext: Successfully stopped SparkContext
19/12/26 10:52:35 INFO ShutdownHookManager: Shutdown hook called
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /var/data/spark-18e6167c-433f-41d8-82c4-b11ba9a3bf8c/spark-3a8d3e48-292c-4018-b003-bde80641eb90/pyspark-4b10b351-cf07-4452-8285-acb67157db80
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-29b813d2-f377-4d84-be3a-4372f69e58b5
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /var/data/spark-18e6167c-433f-41d8-82c4-b11ba9a3bf8c/spark-3a8d3e48-292c-4018-b003-bde80641eb90
Any guess or idea on this

SPARK not able to consume AWS Kinesis stream

Environment : EMR
AWS Kinesis Steam
Language : PySpark
I have incoming AWS Kinesis stream and I am able to consume stream using Python.(So the EMR is able fetch streams). When I tried to consume thru PySpark Streaming , I am not able to get the stream, instead only logs are printing. Am not doing any transformation , just tried to read the stream and print . Can some one guide me on this.
from __future__ import print_function
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
appName = 'kinesis_myreal_time_stream'
streamName = 'kinesis_myreal_time_stream'
endpointUrl = 'apigateway.us-east-1.amazonaws.com'
regionName = 'us-east-1'
sc = SparkContext()
ssc = StreamingContext(sc, 10)
lines = KinesisUtils.createStream(ssc = ssc, kinesisAppName = appName, streamName = streamName,
endpointUrl = endpointUrl, regionName = regionName,
initialPositionInStream = InitialPositionInStream.LATEST, checkpointInterval = 2)
# counts = lines.flatMap(lambda line: line.split("}{")) \
# .map(lambda word: (word, 1)) \
# .reduceByKey(lambda a, b: a+b)
# counts.pprint()
lines.pprint()
ssc.start()
ssc.awaitTermination()
getting logs as bellow
-------------------------------------------
Time: 2019-02-15 13:17:10
-------------------------------------------
19/02/15 13:17:10 INFO JobScheduler: Finished job streaming job 1550236630000 ms.0 from job set of time 1550236630000 ms
19/02/15 13:17:10 INFO PythonRDD: Removing RDD 59 from persistence list
19/02/15 13:17:10 INFO JobScheduler: Total delay: 0.014 s for time 1550236630000 ms (execution: 0.002 s)
19/02/15 13:17:10 INFO BlockManager: Removing RDD 59
19/02/15 13:17:10 INFO KinesisBackedBlockRDD: Removing RDD 58 from persistence list
19/02/15 13:17:10 INFO BlockManager: Removing RDD 58
19/02/15 13:17:10 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[58] at createStream at NativeMethodAccessorImpl.java:0 of time 1550236630000 ms
19/02/15 13:17:10 INFO ReceivedBlockTracker: Deleting batches: 1550236610000 ms
19/02/15 13:17:10 INFO InputInfoTracker: remove old batch metadata: 1550236610000 ms
19/02/15 13:17:20 INFO JobScheduler: Added jobs for time 1550236640000 ms
19/02/15 13:17:20 INFO JobScheduler: Starting job streaming job 1550236640000 ms.0 from job set of time 1550236640000 ms
-------------------------------------------
Time: 2019-02-15 13:17:20
-------------------------------------------
19/02/15 13:17:20 INFO JobScheduler: Finished job streaming job 1550236640000 ms.0 from job set of time 1550236640000 ms
19/02/15 13:17:20 INFO PythonRDD: Removing RDD 61 from persistence list
19/02/15 13:17:20 INFO JobScheduler: Total delay: 0.018 s for time 1550236640000 ms (execution: 0.001 s)
19/02/15 13:17:20 INFO BlockManager: Removing RDD 61
19/02/15 13:17:20 INFO KinesisBackedBlockRDD: Removing RDD 60 from persistence list
19/02/15 13:17:20 INFO BlockManager: Removing RDD 60
19/02/15 13:17:20 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[60] at createStream at NativeMethodAccessorImpl.java:0 of time 1550236640000 ms
19/02/15 13:17:20 INFO ReceivedBlockTracker: Deleting batches: 1550236620000 ms
19/02/15 13:17:20 INFO InputInfoTracker: remove old batch metadata: 1550236620000 ms
19/02/15 13:17:30 INFO JobScheduler: Added jobs for time 1550236650000 ms
19/02/15 13:17:30 INFO JobScheduler: Starting job streaming job 1550236650000 ms.0 from job set of time 1550236650000 ms
-------------------------------------------
Time: 2019-02-15 13:17:30
-------------------------------------------

I think you copy pasted wrong endpoint url to your app. Also I do not think you need to pass it always. You are passing apigateway service url.
It should be similar to this example
#param endpointUrl Url of Kinesis service (e.g., https://kinesis.us-east-1.amazonaws.com)
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisUtils.scala#L90

KafkaUtils.createStream stops capturing data after a while

I have built a Kafka consumer that fetches from Kafka writes to Elasticsearch, the program runs as expected for a day or two, and then spark stops capturing data. Kafka logs are being generated and the spark stream is running but no data is captured. Below is the code being used:
# For Spark
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext
# For Kafka
from pyspark.streaming.kafka import KafkaUtils
# Name of Spark App
conf = SparkConf().setAppName("test_topic")
# Spark and Spark streaming configuration
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)
# Kafka Enpoints
zkQuorum = '192.0.23.1:2181'
topic = 'test_topic'
# Elastic Search write endpoint
es_write_conf = {
"es.nodes" : "192.000.0.1",
"es.port" : "9200",
"es.resource" : "test_index/test_type",
"es.input.json": "true",
"es.nodes.ingest.only": "true"
}
# Create a kafka Stream
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "cyd-pcs-bro-streaming-consumer", {topic: 1})
# Print stream to console
kafkaStream_json = kafkaStream.map(lambda x: x[1])
kafkaStream_json.pprint()
#Write Stream to ElasticSearch
kafkaStream.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
)
# Start the stream and keep it running unless terminated
ssc.start()
ssc.awaitTermination()
Is there something else that my code should be doing or a way to dig deeper into the issue(logs dont indicate anything)?
Also, if I am okay to have one Spark App per topic, is there any other reason that I would want to use KafkaUtils.createDirectStream. Since I don't want to have the overhead of managing offsets.
Language used: Pyspark
Code run:
sudo $SPARK_HOME/spark-submit --master local[2] --jars /home/user/jars/elasticsearch-hadoop-6.3.2.jar,/home/user/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar /home/user/code/test_stream.py
This is the output of the stream when no data is being captured:
-------------------------------------------
Time: 2018-08-29 12:23:46
-------------------------------------------
18/08/29 12:23:46 INFO JobScheduler: Finished job streaming job 1535525626000 ms.0 from job set of time 1535525626000 ms
18/08/29 12:23:46 INFO JobScheduler: Total delay: 0.030 s for time 1535525626000 ms (execution: 0.007 s)
18/08/29 12:23:46 INFO PythonRDD: Removing RDD 115 from persistence list
18/08/29 12:23:46 INFO BlockManager: Removing RDD 115
18/08/29 12:23:46 INFO BlockRDD: Removing RDD 114 from persistence list
18/08/29 12:23:46 INFO BlockManager: Removing RDD 114
18/08/29 12:23:46 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[114] at createStream at NativeMethodAccessorImpl.java:0 of time 1535525626000 ms
18/08/29 12:23:46 INFO ReceivedBlockTracker: Deleting batches: 1535525624000 ms
18/08/29 12:23:46 INFO InputInfoTracker: remove old batch metadata: 1535525624000 ms
18/08/29 12:23:47 INFO JobScheduler: Added jobs for time 1535525627000 ms
18/08/29 12:23:47 INFO JobScheduler: Starting job streaming job 1535525627000 ms.0 from job set of time 1535525627000 ms
-------------------------------------------
Time: 2018-08-29 12:23:47
-------------------------------------------
18/08/29 12:23:47 INFO JobScheduler: Finished job streaming job 1535525627000 ms.0 from job set of time 1535525627000 ms
18/08/29 12:23:47 INFO JobScheduler: Total delay: 0.025 s for time 1535525627000 ms (execution: 0.005 s)
18/08/29 12:23:47 INFO PythonRDD: Removing RDD 117 from persistence list
18/08/29 12:23:47 INFO BlockRDD: Removing RDD 116 from persistence list
18/08/29 12:23:47 INFO BlockManager: Removing RDD 117
18/08/29 12:23:47 INFO BlockManager: Removing RDD 116
18/08/29 12:23:47 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[116] at createStream at NativeMethodAccessorImpl.java:0 of time 1535525627000 ms
18/08/29 12:23:47 INFO ReceivedBlockTracker: Deleting batches: 1535525625000 ms
18/08/29 12:23:47 INFO InputInfoTracker: remove old batch metadata: 1535525625000 ms
18/08/29 12:23:48 INFO JobScheduler: Added jobs for time 1535525628000 ms
18/08/29 12:23:48 INFO JobScheduler: Starting job streaming job 1535525628000 ms.0 from job set of time 1535525628000 ms

LinearRegression scala.MatchError:

I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case

Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why does streaming job fail with NullPointerException when restarted from checkpoint? - apache-spark

Related

Access In-Memory Spark Dataframe from different nodes

SparkContext: Invoking stop() from shutdown hook (Spark and Kubernetes)

SPARK not able to consume AWS Kinesis stream

KafkaUtils.createStream stops capturing data after a while

LinearRegression scala.MatchError:

Categories

Resources