Pyspark And Cassandra - Extracting Data Into RDD as Fields from Map Field - cassandra

I have a table with a map field with data that looks as follows from Cassandra,
test_id test_map
1 {tran_id=99, tran_type=sample}
I am attempting to add these fields to the existing RDD that I am pulling this data from as new fields to the exact same key which would look as follows,
test_id test_map tran_id tran_type
1 {tran_id=99, trantype=sample} 99 sample
I'm able to pull the fields fine using spark context but I can't find a good method to transform this field into the RDD as expected above.
Sample Code:
import os
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0 --conf spark.cassandra.connection.host=xxx.xxx.xxx.xxx pyspark-shell'
sc = SparkContext("local", "test")
sqlContext = SQLContext(sc)
def test_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()
return table_df
df_test = test_df("test", "test")
Then to query data I use Spark SQL in such format:
df_test.registerTempTable("dftest")
df = sqlContext.sql(
"""
select * from dftest
"

Related

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

Write results from Kafka to csv in pyspark

I have setup a Kafka broker and I manage to read the records with pyspark.
import os
from pyspark.sql import SparkSession
import pyspark
import sys
from pyspark import SparkConf, SparkContext, SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
conf = SparkConf().setMaster("my-master").setAppName("Kafka_Spark")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,5)
kvs = KafkaUtils.createDirectStream(ssc,
['enriched_messages'],
{"metadata.broker.list":"my-kafka-broker","auto.offset.reset" : "smallest"},
keyDecoder=lambda x: x,
valueDecoder=lambda x: x)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination(10)
Example of returning data (timestamp, name, lastname, height):
2020-05-07 09:16:38, JoHN, Doe, 182.5
I want to write these records into a csv file. lines is of type KafkaTransformedDStream and classic solution with rdd is not working.
Has anyone a solution to this?
converting DStreams to single rdd is not possible, as DStreams are continuous streams. You can use the following, which results many files, and later merge them to single file.
lines.saveAsTextFiles("prefix", "suffix")

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

Load only part of HBase/Phoenix table as Spark Datafrom

I am using the following code in Spark to load specified columns of my HBase/Phoenix table into a Spark Dataframe. I can specify the columns I want to load, but can I specify which rows? Or do I have to load all rows?
import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
sc.stop()
val sc = new SparkContext("local", "phoenix-test")
val df = sqlContext.phoenixTableAsDataFrame(
"TABLENAME", Array("ROWKEY", "CF.COL1","CF.COL2","CF.COL3"), conf = configuration
)

I am able to connect to the Hive database using pyspark but when i run my program data is not showing

I have written the below code to read the data from HIVE table and when I am trying to run no compilation errors and no data displaying.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext, SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars hive-jdbc-2.1.0.jar
pyspark-shell'
sparkConf = SparkConf().setAppName("App")
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc);
source_df = hiveContext.read.format('jdbc').options(
url='jdbc:hive2://localhost:10000/sample',
driver='org.apache.hive.jdbc.HiveDriver',
dbtable='abc',
user='root',
password='root').load()
print source_df.show()
When i run this, I am getting below output and not able to fetch the
data from table.
+--------+------+
|abc.name|abc.id|
+--------+------+
+--------+------+
Just try
df = hiveContext.read.table("your_hive_table") //reads from default db
df = hiveContext.read.table("your_db.your_hive_table") //reads from your db
you could also do
df = hiveContext.sql("select * from your_table")

Resources