Simple spark streaming not printing lines - apache-spark

I am trying to write a spark script that monitors a directory & processes data as it streams in.
In the below, i dont get any errors, but it also doesn't print the files,
Does anyone have any ideas?
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext.getOrCreate(conf=conf)
ssc = StreamingContext(sc, 1) #microbatched every 1 second
lines = ssc.textFileStream('file:///C:/Users/kiera/OneDrive/Documents/logs')#directory of log files, Does not work for subdirectories
lines.pprint()
ssc.start()
ssc.awaitTermination()

Related

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

Write results from Kafka to csv in pyspark

I have setup a Kafka broker and I manage to read the records with pyspark.
import os
from pyspark.sql import SparkSession
import pyspark
import sys
from pyspark import SparkConf, SparkContext, SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
conf = SparkConf().setMaster("my-master").setAppName("Kafka_Spark")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,5)
kvs = KafkaUtils.createDirectStream(ssc,
['enriched_messages'],
{"metadata.broker.list":"my-kafka-broker","auto.offset.reset" : "smallest"},
keyDecoder=lambda x: x,
valueDecoder=lambda x: x)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination(10)
Example of returning data (timestamp, name, lastname, height):
2020-05-07 09:16:38, JoHN, Doe, 182.5
I want to write these records into a csv file. lines is of type KafkaTransformedDStream and classic solution with rdd is not working.
Has anyone a solution to this?
converting DStreams to single rdd is not possible, as DStreams are continuous streams. You can use the following, which results many files, and later merge them to single file.
lines.saveAsTextFiles("prefix", "suffix")

How to have more StreamingContexts in a single Spark application?

I'm trying to run a spark streaming job on the spark-shell, localhost. Following the code from here this is what I first tried:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(30))
Which gives following errors:
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
And so I had to try this:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName
("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(30))
This runs but with the following warning:
2018-05-17 17:01:14 WARN SparkContext:87 - Multiple running SparkContexts detected in the same JVM!
So I'd like to know if there might be another way of declaring a StreamingContext object that does not require allowMulipleContexts == True as it appears using multiple contexts is discouraged? Thanks
You need to use the existing SparkContext sc to create the StreamingContext
val ssc = new StreamingContext(sc, Seconds(30))
When you create it using the alternate constructor i.e the one with SparkConf it internally creates another SparkContext and that's why you get that Warning.

MySQL read with PySpark

I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.
You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file
If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

Getting data from HDFS using Sparkstreaming

I am trying to read data from HDFS using spark streaming.
Below is my code.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.fs._
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(10))
val directory ="hdfs://pc-XXXX:9000/hdfs/watchdirectory/"
val lines=ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t:Path) => true, true).map(_._2.toString)
lines.count()
lines.print()
ssc.start
ssc.awaitTermination()
The code runs but it does not read any data from HDFS.
After every 10 seconds I get a blank line.
I have gone through the document for fileStream and I know that I have move the file to the watch directory.
But it doesn't work for me.
I have also tried with textFileStream but no luck.
I am using spark 2.0.0 built with Scala 2.11.8
Any suggestions please.
Please try below
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(10))
val lines= ssc.textFileStream("hdfs://pc-XXXX:9000/hdfs/watchdirectory/").map(_._2.toString)
lines.count()
lines.print()
ssc.start
ssc.awaitTermination()
Once you execute this , move the files to
/hdfs/watchdirectory/

Resources