I am trying to read data from HDFS using spark streaming.
Below is my code.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.fs._
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(10))
val directory ="hdfs://pc-XXXX:9000/hdfs/watchdirectory/"
val lines=ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t:Path) => true, true).map(_._2.toString)
lines.count()
lines.print()
ssc.start
ssc.awaitTermination()
The code runs but it does not read any data from HDFS.
After every 10 seconds I get a blank line.
I have gone through the document for fileStream and I know that I have move the file to the watch directory.
But it doesn't work for me.
I have also tried with textFileStream but no luck.
I am using spark 2.0.0 built with Scala 2.11.8
Any suggestions please.
Please try below
val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(10))
val lines= ssc.textFileStream("hdfs://pc-XXXX:9000/hdfs/watchdirectory/").map(_._2.toString)
lines.count()
lines.print()
ssc.start
ssc.awaitTermination()
Once you execute this , move the files to
/hdfs/watchdirectory/
Related
I am trying to write a spark script that monitors a directory & processes data as it streams in.
In the below, i dont get any errors, but it also doesn't print the files,
Does anyone have any ideas?
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext.getOrCreate(conf=conf)
ssc = StreamingContext(sc, 1) #microbatched every 1 second
lines = ssc.textFileStream('file:///C:/Users/kiera/OneDrive/Documents/logs')#directory of log files, Does not work for subdirectories
lines.pprint()
ssc.start()
ssc.awaitTermination()
I'm trying to run a spark streaming job on the spark-shell, localhost. Following the code from here this is what I first tried:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(30))
Which gives following errors:
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
And so I had to try this:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName
("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(30))
This runs but with the following warning:
2018-05-17 17:01:14 WARN SparkContext:87 - Multiple running SparkContexts detected in the same JVM!
So I'd like to know if there might be another way of declaring a StreamingContext object that does not require allowMulipleContexts == True as it appears using multiple contexts is discouraged? Thanks
You need to use the existing SparkContext sc to create the StreamingContext
val ssc = new StreamingContext(sc, Seconds(30))
When you create it using the alternate constructor i.e the one with SparkConf it internally creates another SparkContext and that's why you get that Warning.
i have tried like this. but no luck.File1 and file2 are in my local machine. Not in the hdfs. Please help.
SparkConf sparkConf = new SparkConf().setAppName("sample");
SparkContext sc = new SparkContext(sparkConf);
SQLContext sqlContext = SQLContext.getOrCreate(sc);
val file1=sc.textFile("file1.txt", minPartitions);
val file2=sc.textFile("file2.txt", minPartitions);
Using a Kafka Stream in PySpark, is it possible to seek to the beginning of a Kafka topic without creating a new consumer group?
For example, I have the following code snippet:
...
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext('local[2]', appName="MyStreamingApp_01")
sc.setLogLevel("INFO")
ssc.StreamingContext(sc, 30)
spark = SparkSession(sc)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper_ip, 'group-id', {'messages': 1})
counted = kafkaStream.count()
...
My goal is to do something along the lines of
kafkaStream.seekToBeginningOfTopic()
Currently, I'm creating a new consumer group to re-read from the beginning of the topic, e.g.:
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, 'group-id-2', {'messages': 1}, {"auto.offset.reset": "smallest"})
Is this the proper way to consume a topic from the beginning using PySpark?
I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.
You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file
If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()