Printing Kafka debug message on PySpark job - apache-spark

Is there a way to print a Kafka debug message (I am thinking about log messages that are similar to librdkafka Debug message, or kafkacat -D option), when running PySpark job?
The issue is that I used the following codes on PySpark to connect to a Kafka cluster called A, it works and printing things out to theconsole every time there is a new message coming in. But when I switched to another cluster, called B and setup the same way as cluster A, it didn't print anything out to the screen when there is new messages coming in, I can see that the message is going through just fine using kafkacat tool on both clusters.
consumer.py
from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
spark = SparkSession.builder.appName("KafkaConsumer").getOrCreate()
sc = spark.sparkContext
sqlc = SQLContext(sc)
hosts = "host1:9092,host2:9092,host3:9092"
topic = "myTopic"
securityProtocol = "SASL_PLAINTEXT"
saslMechanism = "PLAIN"
try:
df = sqlc \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", hosts) \
.option("kafka.security.protocol", securityProtocol) \
.option("kafka.sasl.mechanism", saslMechanism) \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
dss = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream.outputMode('append') \
.format("console") \
.start()
dss.awaitTermination()
except KeyboardInterrupt:
print 'shutting down...'
kafka.jaas
KafkaClient {
org.apache.kafka.common.security.plain.PlainLoginModule required
username="user1"
password="sssshhhh"
serviceName="kafka";
};
shell command:
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 \
--files "kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka.jaas" \
"./consumer.py"
Seems like kafka cluster B is reachable since I am able to get the offset information from it, but it's just not reading the messages.

The issue was caused by worker nodes connection to the Kafka cluster, the worker nodes IP address weren't on the firewall whitelist on the Kafka cluster. The code above caused worker nodes to time out and keep on retrying to connect to Kafka cluster until Interrupt signal is given.
In relation to the error message itself, no error message was generated to the Master node as worker node is still attempting to connect to Kafka cluster, but every now and then there is a message printed out on the Master console saying it failed to communicate with the worker node (or some message like 'gathering information').
NOTE: This is what I presumed happen in the worker node (which I unable to log on to, due to admin rights), but there may be a log that is stored on the worker nodes. (If someone can back or prove otherwise. it'll be much appreciated)
As for the Kafka debug message itself, it looks like already printing to the screen by default if there is Error, Info or Warning happen depending on the logger level setup and in some odd instance like this one, the log message may not be directly visible to the screen.

Related

EMR Multiple Steps Running Spark Streaming

I'm trying to run 2 pyspark streaming job separate in two steps in AWS EMR Cluster.
When I start the first stream job it runs normally and status stays in RUNNING.
When I try to run the second job it gets stuck in ACCEPTED status.
As shown in image bellow:
My pyspark streaming code simplified:
spark = SparkSession.builder.getOrCreate()
df = spark.readStream\
.format("csv")\
.option("delimiter","|")\
.option("Header",True)\
.option("multiLine",True)\
.option('ignoreLeadingWhiteSpace',True)\
.option('ignoreTrailingWhiteSpace',True)\
.option("escape", "\"")\
.load(f"s3a://input_bucket/path/")\
.withColumn("file_path", input_file_name())
def for_each_batch(df, batchId):
df.write.format("delta").mode("append").save('s3a://output_bucket/path/')
query_changes = df.writeStream \
.foreachBatch(for_each_batch) \
.option("checkpointLocation", f"./checkpoint")\
.start()
query_changes.awaitTermination()
In the step page of the EMR cluster both jobs are with RUNNING status.
I suspect that await streaming function is blocking the execution.

Spark job doesn't start consuming from huge Kafka topic

I am facing with a supposedly simple problem which anyway is making me scratching my head against wall.
I've set up a Kafka cluster (MSK in AWS) with one topic and 200 partitions, right now the topic has collected 100M events and 1TB of data size.
MSK is configured with 6 broker kafka.m5.4xlarge and this is the basic config:
log.retention.ms = 300000
message.max.bytes = 10485760
replica.fetch.max.bytes = 10485760
replica.fetch.response.max.bytes = 10485760
socket.receive.buffer.bytes = 10485760
socket.request.max.bytes = 10485760
socket.send.buffer.bytes = 10485760
I want to process these events one by one using a Spark cluster, so I have created a simple Spark job with this code:
from pyspark.sql import SparkSession
from src import util, event_schema
def print_row(row):
print(row)
if __name__ == "__main__":
config = util.get_config()
spark_session = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
# Read from kafka
events_df = spark_session.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', config['kafka']['bootstrap_servers']) \
.option('kafka.sasl.jaas.config', f'org.apache.kafka.common.security.scram.ScramLoginModule required username="{config["kafka"]["username"]}" password="{config["kafka"]["password"]}";') \
.option('kafka.sasl.mechanism', 'SCRAM-SHA-512') \
.option('kafka.security.protocol', 'SASL_SSL') \
.option('subscribe', config['kafka']['topic']) \
.option('groupIdPrefix', 'test') \
.option('failOnDataLoss', 'false') \
.load()
events_df = events_df.selectExpr('CAST(value AS STRING) as data')
events_df = events_df.select(explode(split(events_df.data, '\n')))
events_df = events_df.select(from_json(col('col'), event_schema).alias('value'))
events_df = events_df.selectExpr('value.*')
events_df.writeStream \
.foreach(print_row) \
.start()
This simple Spark job should start consuming every single event and printing that event.
When the topic is empty, it correctly start consuming, however if I attach this consumer group to the existing topic with this amount of data, it simply doesn't start consuming at all, like if it's stuck. The same doesn't happen if I write a simple Kafka consumer (not using PySpark), it correctly starts consuming (even tho it takes few mins to start).
What is wrong with my code and how could I be able to simply start consuming events from Kafka topic straight away?
Thanks
Please follow the spark-kafka integration guide while reading stream from kafka into spark. Try to explore following options present on the same page referred by above link:
startingOffsets: This is the latest by default in case of reading stream which means application will be able to read the new events arriving into kafka after spark application deployment. If you want to read the historic non processed events, try to use value earliest for this option. Look for checkpointing as well.
maxOffsetsPerTrigger: This would be useful if in every trigger you want to read limited events only.

Spark-Streaming hangs with kafka starting offset at earliest (Kafka 2, spark 2.4.3)

i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Why does spark-submit fail with "AnalysisException: kafka is not a valid Spark SQL Data Source"?

I use Spark 2.1.0 with Kafka 0.10.2.1.
I write a Spark application that reads datasets from a Kafka topic.
The code is as follows:
package com.example;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public class MLP {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("MLP")
.getOrCreate();
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092,localhost:9093")
.option("subscribe", "resultsTopic")
.load();
df.show();
spark.stop();
}
}
My deployment script is as follows:
spark-submit \
--verbose \
--jars${echo /home/hduser1/spark/jars/*.jar | tr ' ' ',') \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.10 \
--class com.**** \
--master (Spark Master URL) /path/to/jar
However I get the error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
kafka is not a valid Spark SQL Data Source.;
I've tried using the same application with a non-Jafka data source and the dataframe is correctly created. I've also tried using yarn in client mode and I get the same error.
Kafka as a Data Source for non-stream DataFrame - Datasets will be available from Spark 2.2, reference in this issue on Spark JIRA
As #JacekLaskowski mentioned, change package to (modified Jacek's version to use 2.2):
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
What's more, use readStream to read a stream of data.
You cannot use show with streaming data sources, instead use console format.
StreamingQuery query = df.writeStream()
.outputMode("append")
.format("console")
.start();
query.awaitTermination();
See this link
First of all, you should replace --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.10 (which I doubt works) with the following:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1
I don't think the version 2.10 was ever available. You may have thought about 2.1.0 that could have worked if you'd used 2.1.0 (not 2.10).
Secondly, remove --jars${echo /home/hduser1/spark/jars/*.jar | tr ' ' ',') which Spark loads anyway except some additional jars like the one for Kafka source.
That should give you access to kafka source format.

Resources