Write results from Kafka to csv in pyspark - apache-spark

I have setup a Kafka broker and I manage to read the records with pyspark.
import os
from pyspark.sql import SparkSession
import pyspark
import sys
from pyspark import SparkConf, SparkContext, SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
conf = SparkConf().setMaster("my-master").setAppName("Kafka_Spark")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,5)
kvs = KafkaUtils.createDirectStream(ssc,
['enriched_messages'],
{"metadata.broker.list":"my-kafka-broker","auto.offset.reset" : "smallest"},
keyDecoder=lambda x: x,
valueDecoder=lambda x: x)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination(10)
Example of returning data (timestamp, name, lastname, height):
2020-05-07 09:16:38, JoHN, Doe, 182.5
I want to write these records into a csv file. lines is of type KafkaTransformedDStream and classic solution with rdd is not working.
Has anyone a solution to this?

converting DStreams to single rdd is not possible, as DStreams are continuous streams. You can use the following, which results many files, and later merge them to single file.
lines.saveAsTextFiles("prefix", "suffix")

Related

Simple spark streaming not printing lines

I am trying to write a spark script that monitors a directory & processes data as it streams in.
In the below, i dont get any errors, but it also doesn't print the files,
Does anyone have any ideas?
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext.getOrCreate(conf=conf)
ssc = StreamingContext(sc, 1) #microbatched every 1 second
lines = ssc.textFileStream('file:///C:/Users/kiera/OneDrive/Documents/logs')#directory of log files, Does not work for subdirectories
lines.pprint()
ssc.start()
ssc.awaitTermination()

Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

How can I convert this row form into JSON while pushing into kafka topic

I am using a Spark application for processing textfiles that dropped at /home/user1/files/ folder in my system and which map the comma separated data that present in those text files into a particular JSON format. I have written following python code using spark for doing the same. But the output that comes in Kafka will look like as follows
Row(Name=Priyesh,Age=26,MailId=priyeshkaratha#gmail.com,Address=AddressTest,Phone=112)
Python Code :
import findspark
findspark.init('/home/user1/spark')
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.sql import Column, DataFrame, Row, SparkSession
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='server.kafka:9092')
def handler(message):
records = message.collect()
for record in records:
producer.send('spark.out', str(record))
print(record)
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream('/home/user1/files/')
fields = lines.map(lambda l: l.split(","))
udr = fields.map(lambda p: Row(Name=p[0],Age=int(p[3].split('#')[0]),MailId=p[31],Address=p[29],Phone=p[46]))
udr.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
So how can I convert this row form into JSON while pushing into kafka topic?
You can convert Spark Row objects to dict's, and then serialize those to JSON. For example, you could change this line:
producer.send('spark.out', str(record))
to this:
producer.send('spark.out', json.dumps(record.asDict())))
Alternatively.. in your example code since you aren't using DataFrames you could just create it as a dict to begin with instead of a Row.

'RDD' object has no attribute '_jdf' pyspark RDD

I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
and for my last command, i obtain the error
"AttributeError: 'RDD' object has no attribute '_jdf'
enter image description here
can anyone help me please.
thank you
You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
And then it should work so that you can call transform function as
vectorizer_transformer.transform(td).show(truncate=False)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
But I would suggest you to stick with dataframe way.

Seek to Beginning of Kafka Topic Using PySpark

Using a Kafka Stream in PySpark, is it possible to seek to the beginning of a Kafka topic without creating a new consumer group?
For example, I have the following code snippet:
...
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext('local[2]', appName="MyStreamingApp_01")
sc.setLogLevel("INFO")
ssc.StreamingContext(sc, 30)
spark = SparkSession(sc)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper_ip, 'group-id', {'messages': 1})
counted = kafkaStream.count()
...
My goal is to do something along the lines of
kafkaStream.seekToBeginningOfTopic()
Currently, I'm creating a new consumer group to re-read from the beginning of the topic, e.g.:
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, 'group-id-2', {'messages': 1}, {"auto.offset.reset": "smallest"})
Is this the proper way to consume a topic from the beginning using PySpark?

Resources