Access In-Memory Spark Dataframe from different nodes - apache-spark

I am prototyping a Spark based data ingestion system. Essentially I need spark to watch a datalake directory and as data comes in, add this data to an in-memory dataframe. I understand that memory is meant for debuging purposes, but since this is a prototype I am trying to get this working in memory first before the more standard kafka.
Here is my first python script that is supposed to getOrCreate a SparkSession, read from the datalake, then write to a data table located in memory:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.broadcast([1])
schemaSnmp = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
df = spark.readStream.format("json") \
.option("sourceArchiveDir", "/tmp/datalake-archive") \
.option("cleanSource", "archive") \
.schema(schemaSnmp) \
.load("/var/datalake/snmp-get")
result = df.writeStream.queryName("snmpget").format("memory").start()
result.awaitTermination();
This appears to be running just fine. I see my data get archived and the logs. Here is a small sample:
$ spark-submit spark-start-stream-for-snmpget.py
22/08/18 11:18:33 WARN Utils: Your hostname, rhel8.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
22/08/18 11:18:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/18 11:18:35 INFO SparkContext: Running Spark version 3.3.0
22/08/18 11:18:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/18 11:18:35 INFO ResourceUtils: ==============================================================
22/08/18 11:18:35 INFO ResourceUtils: No custom resources configured for spark.driver.
22/08/18 11:18:35 INFO ResourceUtils: ==============================================================
22/08/18 11:18:35 INFO SparkContext: Submitted application: spark.data_processing_engine
22/08/18 11:18:35 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/08/18 11:18:35 INFO ResourceProfile: Limiting resource is cpu
22/08/18 11:18:35 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/08/18 11:18:35 INFO SecurityManager: Changing view acls to: root
22/08/18 11:18:35 INFO SecurityManager: Changing modify acls to: root
22/08/18 11:18:35 INFO SecurityManager: Changing view acls groups to:
22/08/18 11:18:35 INFO SecurityManager: Changing modify acls groups to:
22/08/18 11:18:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
22/08/18 11:18:36 INFO Utils: Successfully started service 'sparkDriver' on port 42657.
22/08/18 11:18:36 INFO SparkEnv: Registering MapOutputTracker
22/08/18 11:18:36 INFO SparkEnv: Registering BlockManagerMaster
22/08/18 11:18:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/08/18 11:18:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/08/18 11:18:36 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/08/18 11:18:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-364e57a6-0f38-43ea-abcc-428e0ca8684f
22/08/18 11:18:36 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/08/18 11:18:36 INFO SparkEnv: Registering OutputCommitCoordinator
...
22/08/18 11:17:55 INFO InMemoryFileIndex: It took 5 ms to list leaf files for 1 paths.
22/08/18 11:17:55 INFO BlockManagerInfo: Removed broadcast_224_piece0 on 10.0.2.15:36859 in memory (size: 34.0 KiB, free: 434.3 MiB)
In a separate process, I fire up pyspark and try to capture this data, but I cannot:
In [1]: APP_NAME = "spark.data_processing_engine"
...:
...: spark = SparkSession \
...: ^I.builder \
...: ^I.master("local[*]") \
...: ^I.appName(APP_NAME) \
...: ^I.getOrCreate()
In [2]: spark.sql("SELECT * FROM snmpget ORDER BY `event-time` DESC").count()
AnalysisException: Table or view not found: snmpget; line 1 pos 14;
'Sort ['event-time DESC NULLS LAST], true
+- 'Project [*]
+- 'UnresolvedRelation [snmpget], [], false
I tried the following already
Use the same app name
set spark.sparkContext.broadcast([1])
scope variables in the pyspark instance
instead of using pyspark I created a script to run:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sql("SELECT * FROM snmpget ORDER BY `event-time` DESC").count()
But I get the same error this way too. I was hoping the getOrCreate method would have been able to re-cycle the SparkSession object and allow other threads to access the data. Obviously there is more to it than that.
Essentially, I need to fire up a spark process to read from the datalake and then fire up other jobs that can read from this data.

Okay I think I figured out the way to get this data coordinated. Maybe there is a better way of doing this, but for now it fits the bill.
First I create a long running spark-submit task that continually reads from a datalake and outputs to a parquet database:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
schemaSnmp = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
result = spark.readStream \
.option("sourceArchiveDir", "/var/datalake-archive") \
.option("cleanSource", "archive") \
.schema(schemaSnmp) \
.json("/var/datalake/snmp-get") \
.writeStream \
.queryName("snmpget") \
.format("parquet") \
.option("checkpointLocation", "/var/spark-map/snmp-get") \
.option("path", "/var/spark-map/snmp-get") \
.start()
result.awaitTermination()
result.stop()
After this is running, I fire up a new spark job that looks like this:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
import datetime
import logging
APP_NAME = "spark.data_processing_engine"
spark = SparkSession \
.builder \
.master("local[*]") \
.appName(APP_NAME) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
schema = StructType([
StructField("node-hostname", StringType(), True),
StructField("event-time", TimestampType(), True),
StructField("oid", StringType(), True),
StructField("value", StringType(), True)
])
results = spark.readStream \
.schema(schema) \
.parquet("/var/spark-map/snmp-get/") \
.writeStream \
.queryName("snmpget") \
.format("memory") \
.start()
i = 0
while i < 60:
x = spark.table("snmpget").select("value", "oid", "`node-hostname`", "`event-time`").orderBy("`event-time`", ascending=False).head()
if (x == None):
print("Data may still be loading, give this approx 30 seconds")
else:
print(f"x = {x}")
i = i+1
time.sleep(1)
results.stop()
That second script can take a few moment to load (so it will print None at first) but then after this, the data comes in! This works along multiple spark jobs, each reading from that same parquet database.
Ideally all this would just exist in memory for demo purposes, but this works nonetheless. If I want more speed, I could always look into making the referenced directories into ram-disks.
If anyone out there knows how to tie this together with only in-memory datastore (i.e., no parquet step) I'd be interested in knowing.

Related

PySpark Structured Streaming Query - query in dashbord visibility

I wrote some example code which connect to kafka broker, read data from topic and sink it to snappydata table.
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext, Row, SparkSession
from pyspark.sql.snappy import SnappySession
from pyspark.rdd import RDD
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import col, explode, split
import time
import sys
def main(snappy):
logger = logging.getLogger('py4j')
logger.info("My test info statement")
sns = snappy.newSession()
df = sns \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.0.0.4:9092") \
.option("subscribe", "test_import3") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "latest") \
.load()
bdf = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
streamingQuery = bdf\
.writeStream\
.format("snappysink") \
.queryName("Devices3") \
.trigger(processingTime="30 seconds") \
.option("tablename","devices2") \
.option("checkpointLocation","/tmp") \
.start()
streamingQuery.awaitTermination()
if __name__ == "__main__":
from pyspark.sql.snappy import SnappySession
from pyspark import SparkContext, SparkConf
sc = SparkSession.builder.master("local[*]").appName("test").config("snappydata.connection", "10.0.0.4:1527").getOrCreate()
snc = SnappySession(sc)
main(snc)
I`m submitting it with command
/opt/snappydata/bin/spark-submit --master spark://10.0.0.4:1527 /path_to/file.py --conf snappydata.connection=10.0.0.4:1527
Everything works, data is readed from Kafka Topic and writed in snappydata table.
I don't understand why i don't see this streaming query in the SnappyData dashboard UI - after submitting pyspark code in the console i saw new Spark Master UI its started.
How can i connect to SnappyData internal Spark Master from pySpark it is possible?
SnappyData supports Python jobs to be submitted only in Smart Connector mode, which means it'll always be launched via a separate Spark Cluster to talk to SnappyData cluster. Hence, you see that your Python job is seen on this Spark cluster's UI and not on SnappyData's dashboard.

unable to read kafka topic data using spark

I have data like below in one of the topics which I created named "sampleTopic"
sid,Believer
Where the first argument is the username and the second argument is the song name which the user frequently listens. Now, I have started zookeeper, Kafka server, and producer with the topic name as mentioned above. I have entered the above data for that topic using CMD. Now, I want to read the topic in spark perform some aggregation, and write it back to stream. Below is my code:
package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object SparkKafkaTopic {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
println("hey")
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "sampleTopic1")
.load()
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}
}
However, when I execute the above code it gives :
+----+--------------------+------------+---------+------+--------------------+-------------+
| key| value| topic|partition|offset| timestamp|timestampType|
+----+--------------------+------------+---------+------+--------------------+-------------+
|null|[73 69 64 64 68 6...|sampleTopic1| 0| 4|2020-05-31 12:12:...| 0|
+----+--------------------+------------+---------+------+--------------------+-------------+
with infinite below looping messages too
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
20/05/31 11:56:12 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-0d6807b9-fcc9-4847-abeb-f0b81ab25187--264582860-driver-0] Resetting offset for partition sampleTopic1-0 to offset 4.
I need output something like below:
As modified on the suggestion by Srinivas I got the following output:
Not sure what exactly is wrong over here. Please guide me through it.
Try to add spark-sql-kafka library to your build file. Check below.
build.sbt
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
// Change to Your spark version
pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.3.0</version> // Change to Your spark version
</dependency>
Change your code like below
package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
case class KafkaMessage(key: String, value: String, topic: String, partition: Int, offset: Long, timestamp: String)
object SparkKafkaTopic {
def main(args: Array[String]) {
//val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
println("hey")
val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
import spark.implicits._
val mySchema = StructType(Array(
StructField("userName", StringType),
StructField("songName", StringType)))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "sampleTopic1")
.load()
val query = df
.as[KafkaMessage]
.select(split($"value", ",")(0).as("userName"),split($"value", ",")(1).as("songName"))
.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
/*
+------+--------+
|userid|songname|
+------+--------+
| sid|Believer|
+------+--------+
*/
}
}
spark-sql-kafka jar is missing, which is having the implementation of 'kafka' datasource.
you can add the jar using config option or build fat jar which includes spark-sql-kafka jar. Please use relevant version of jar
val spark = SparkSession.builder()
.appName("SparkKafka").master("local[*]")
.config("spark.jars","/path/to/spark-sql-kafka-xxxxxx.jar")
.getOrCreate()

SparkContext: Invoking stop() from shutdown hook (Spark and Kubernetes)

I'm trying to run spark and ml-lib code in a kubernetes cluster.
When I ran the same code in client mode or inside a docker which I build it is working with out any issue.
But when I ran the same code inside kubernetes cluster (PKS platform) its failing with no error.
import os
import numpy as np
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
from pyspark.sql.functions import percent_rank
from pyspark.sql import Window
import pyspark.sql.functions as F
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler,VectorIndexer
from pyspark.sql.functions import broadcast
import datetime
##########sparkcontext###############
sc= SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
print("started")
###########RMSLE-calusulation########
def rmsle(real, predicted):
sum=0.0
for x in range(len(predicted)):
if predicted[x]<0 or real[x]<0: #check for negative values
continue
p = np.log(predicted[x]+1)
r = np.log(real[x]+1)
sum = sum + (p - r)**2
return (sum/len(predicted))**0.5
##########Reading of imput Data###########
customer = sqlContext.read.csv('hdfs://XXXXXX:/user/root/customer.csv', header=True, inferSchema = True)
customer =customer.select("*").toPandas()
lineitem = sqlContext.read.csv('hdfs://XXXXXXX:/user/root/lineitem.csv', header=True, inferSchema = True)
lineitem =lineitem.select("*").toPandas()
order = sqlContext.read.csv('hdfs://XXXXXXX:/user/root/orders.csv', header=True, inferSchema = True)
order =order.select("*").toPandas()
print("data has been read")
###########ETL############################
sales = order.join(customer, order.o_custkey == customer.c_custkey, how = 'inner')
sales = sales.sort_index()
sales.columns = ['key_old', 'o_orderdate', 'o_orderkey', 'o_custkey', 'o_orderpriority',
'o_shippriority', 'o_clerk', 'o_orderstatus', 'o_totalprice',
'o_comment', 'c_custkey', 'c_mktsegment', 'c_nationkey', 'c_name',
'c_address', 'c_phone', 'c_acctbal', 'c_comment']
sales2 = sales.join(lineitem,sales.o_orderkey == lineitem.l_orderkey, how = 'outer')
sales3 = sales2.groupby(by = 'o_orderdate')
sales4 = sales3.agg({'l_quantity': 'sum'})# .withColumnRenamed("sum(l_quantity)", "TOTAL_SALES") .withColumnRenamed("o_orderdate", "ORDERDATE")
print("End of ETL pipeline")
orderdates = pd.to_datetime(sales4.index.values)
orderdates = [datetime.datetime(i.year, i.month, i.day,) for i in orderdates]
l = []
l2 = []
for i in orderdates:
l = []
l.append(i.timestamp())
l.append(i.day)
l.append(i.timetuple().tm_wday)
l.append(i.timetuple().tm_yday)
l.append(i.isocalendar()[1])
l2.append(l)
print("dateconverted")
tmp = np.array(sales4.values)
tmp = tmp.reshape(tmp.shape[0],)
data_new = pd.DataFrame()
data_new['SALES'] = tmp
data_new[['DATE','DAY','WDAY','YDAY','WEEK']] = pd.DataFrame(np.array(l2))
data_new['ONES'] = np.ones((len(data_new)))
print("converted to datframe")
X = np.array(data_new[['DATE','DAY','WDAY','YDAY','WEEK','ONES']])
X = X.reshape(X.shape[0],X.shape[1])
Y = np.array(data_new[['SALES']])
Y = Y.reshape(Y.shape[0],1)
cutoff = 0.1
length = int((1-cutoff)*(len(X)))
X_train = X[0:length]
X_test = X[length:len(X)]
Y_train = Y[0:length]
Y_test = Y[length:len(Y)]
print("pre-processingdone")
weights = np.dot(np.dot(np.linalg.inv(np.dot(X_train.T,X_train)),X_train.T),Y_train)
print("model Ready")
Y_pred = np.dot(X_test,weights)
Y_pred = Y_pred.reshape(Y_pred.shape[0],)
Y_test = Y_test.reshape(Y_test.shape[0],)
print("predictions done")
RMSE = np.sqrt(np.mean((Y_test-Y_pred)**2))
RMSLE = rmsle(Y_test,Y_pred)
print(RMSE)
print(RMSLE
sc.stop()
Environment
Kubernetes cluster : PKS platform
Having enough memory 16GB with master and 16GB RAM with worker nodes of 3
I don't see any spike and during processing it hardly utilising 30% of memory from cluster.
My image is 2GB and data is hardly 100MB.
Its failing during below code:
weights = np.dot(np.dot(np.linalg.inv(np.dot(X_train.T,X_train)),X_train.T),Y_train)
print("model Ready")
Y_pred = np.dot(X_test,weights)
Y_pred = Y_pred.reshape(Y_pred.shape[0],)
Y_test = Y_test.reshape(Y_test.shape[0],)
print("predictions done")
RMSE = np.sqrt(np.mean((Y_test-Y_pred)**2))
RMSLE = rmsle(Y_test,Y_pred)
print(RMSE)
print(RMSLE)
sc.stop()
Below is part of driver log and I don't see any error in log
19/12/26 10:52:23 INFO Context Cleaner: Cleaned accumulator 238
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 222
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 241
19/12/26 10:52:23 INFO ContextCleaner: Cleaned accumulator 228
data has been read
End of ETL pipeline
dateconverted
converted to datframe
pre-processingdone
19/12/26 10:52:35 INFO SparkContext: Invoking stop() from shutdown hook
19/12/26 10:52:35 INFO SparkUI: Stopped Spark web UI at http://spark-ml-test-4ff4386f41d48a9f-driver-svc.spark-jobs.svc:4040
19/12/26 10:52:35 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
19/12/26 10:52:35 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
19/12/26 10:52:35 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
19/12/26 10:52:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/12/26 10:52:35 INFO MemoryStore: MemoryStore cleared
19/12/26 10:52:35 INFO BlockManager: BlockManager stopped
19/12/26 10:52:35 INFO BlockManagerMaster: BlockManagerMaster stopped
19/12/26 10:52:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/12/26 10:52:35 INFO SparkContext: Successfully stopped SparkContext
19/12/26 10:52:35 INFO ShutdownHookManager: Shutdown hook called
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /var/data/spark-18e6167c-433f-41d8-82c4-b11ba9a3bf8c/spark-3a8d3e48-292c-4018-b003-bde80641eb90/pyspark-4b10b351-cf07-4452-8285-acb67157db80
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-29b813d2-f377-4d84-be3a-4372f69e58b5
19/12/26 10:52:35 INFO ShutdownHookManager: Deleting directory /var/data/spark-18e6167c-433f-41d8-82c4-b11ba9a3bf8c/spark-3a8d3e48-292c-4018-b003-bde80641eb90
Any guess or idea on this

how to correctly configure maxResultSize?

I cant find a way to set driver max results size. Below is my configuration.
conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar")
,("spark.executor.instances", "4")
,("spark.executor.cores", "4")
,("spark.executor.memories", "10g")
,("spark.driver.memory", "15g")
,("spark.dirver.maxResultSize", "0")
,("spark.memory.offHeap.enabled","true")
,("spark.memory.offHeap.size","20g")])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sqlContext = SQLContext(sc)
i get this error after joining 2 large tables and getting collect
'Py4JJavaError: An error occurred while calling o292.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 101 tasks (1028.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)'
I have seen similar problems on stackoverflow advising to maxResultsize but I can;t figure out how to do that correctly.
The following should do the trick. Also note that you have mis-spelled ("spark.executor.memories", "10g"). The correct configuration is 'spark.executor.memory'.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master('yarn') # depends on the cluster manager of your choice
.appName('StackOverflow')
.config('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar')
.config('spark.executor.instances', 4)
.config('spark.executor.cores', 4)
.config('spark.executor.memory', '10g')
.config('spark.driver.memory', '15g')
.config('spark.memory.offHeap.enabled', True)
.config('spark.memory.offHeap.size', '20g')
.config('spark.dirver.maxResultSize', '4096')
)
sc = spark.sparkContext
Alternatively, try this:
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
.setMaster('yarn') \
.setAppName('StackOverflow') \
.set('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar') \
.set('spark.executor.instances', 4) \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '10g') \
.set('spark.driver.memory', '15g') \
.set('spark.memory.offHeap.enabled', True) \
.set('spark.memory.offHeap.size', '20g') \
.set('spark.dirver.maxResultSize', '4096')
spark_context = SparkContext(conf=conf)
Old post but there was a typo in: "spark.dirver.maxResultSize". Should of course be "spark.driver.maxResultSize"

How do I write to Kafka using pyspark?

I am trying to write to Kafka using PySpark.
I got stuck on stage zero:
[Stage 0:> (0 + 8) / 9]
Then I get a timeout error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
Code is:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 pyspark-shell'
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import *
def main():
spark = SparkSession.builder.master("local").appName("Spark CSV Reader")
.getOrCreate();
dirpath = os.path.abspath(sys.argv[1])
os.chdir(dirpath)
mySchema = StructType([
StructField("id", IntegerType()),StructField("name", StringType()),\
StructField("year", IntegerType()),StructField("rating", DoubleType()),\
StructField("duration", IntegerType()) ])
streamingDataFrame = spark.readStream.schema(mySchema)
.csv('file://' + dirpath + "/" )
streamingDataFrame.selectExpr("CAST(id AS STRING) AS key",
"to_json(struct(*)) AS value").\
writeStream.format("kafka").option("topic", "topicName")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("checkpointLocation", "./chkpt").start()
I am running HDP 2.6.
As I mentioned in the comments, Spark runs on multiple machines, and it is highly unlikely that all these machines will be Kafka brokers.
Use the external address(es) for the Kafka cluster
.option("kafka.bootstrap.servers", "<kafka-broker-1>:9092,<kafka-broker-2>:9092")\

Resources