Error while using dataframe show method in pyspark - python-3.x

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.

I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

Related

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

How to get detailed Information about Spark Stages&Tasks

I´ve set up an Apache Spark cluster with a master and one Worker and I use Python with Spyder as IDE. Everything works fine so far, but I need detailed Information about the task distribution in the Cluster. I know that there is the Spark Web UI but I would like to have Information directly in my Spyder console. So I mean which part of my code/script is done by which Worker/Master. I think with the python package "socket" and socket.gethostname() it must be possible to get more Information. I really look forward to for an help.
Here is my code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import matplotlib.pyplot as plt
from datetime import datetime
from pyspark.sql.functions import udf
from datetime import datetime
import pyspark.sql.functions as F
#spark = SparkSession \
# .builder \
# .appName('weather_data') \
# .getOrCreate()
spark = SparkSession \
.builder \
.appName("weather_data_u") \
.master('master_ip#...')\
.getOrCreate()
data.show()
data.printSchema()
data_selected = data\
.select(data['Date'],
data['TemperatureHighC'],
data['TemperatureAvgC'],
data['TemperatureLowC'],
data['DewpointHighC'],
data['DewpointAvgC'],
data['DewpointLowC'],
data['HumidityAvg'],
data['WindSpeedMaxKMH'],
data['WindSpeedAvgKMH'],
data['GustSpeedMaxKMH'],
data['PrecipitationSumCM'])
data_selected.printSchema()
data_selected.show()
f = udf(lambda row: datetime.strptime(row, '%Y-%m-%d'), TimestampType())
data_selected = data_selected\
.withColumn('date', f(data['Date'].cast(StringType())))\
.withColumn('t_max', data['TemperatureHighC'].cast(DoubleType()))\
.withColumn('t_mean', data['TemperatureAvgC'].cast(DoubleType()))\
.withColumn('t_min', data['TemperatureLowC'].cast(DoubleType()))\
.withColumn('dew_max', data['DewpointHighC'].cast(DoubleType()))\
.withColumn('dew_mean', data['DewpointAvgC'].cast(DoubleType()))\
.withColumn('dew_min', data['DewpointLowC'].cast(DoubleType()))\
.cache()
data_selected.show()
t_mean_calculated = data_selected\
.groupBy(F.date_format(data_selected.date, 'M'))\
.agg(F.mean(data_selected.t_max))\
.orderBy('date_format(date, M)')
t_mean_calculated = t_mean_calculated\
.withColumn('month', t_mean_calculated['date_format(date, M)'].cast(IntegerType()))\
.withColumnRenamed('avg(t_max)', 't_max_month')\
.orderBy('month')\
.drop(t_mean_calculated['date_format(date, M)'])\
.select('month', 't_max_month')
t_mean_calculated = t_mean_calculated.collect()
As reported by #Jacek Laskowski himself, you can use Spark-Core local properties to modify job-name in web-ui
callSite.short
callSite.long
For instance, my Spark-application syncs multiple MySQL tables to S3, and I set
spark.sparkContext.setLocalProperty("callSite.short", currentTableName)
so reflect current table-name in web-ui

I am able to connect to the Hive database using pyspark but when i run my program data is not showing

I have written the below code to read the data from HIVE table and when I am trying to run no compilation errors and no data displaying.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext, SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars hive-jdbc-2.1.0.jar
pyspark-shell'
sparkConf = SparkConf().setAppName("App")
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc);
source_df = hiveContext.read.format('jdbc').options(
url='jdbc:hive2://localhost:10000/sample',
driver='org.apache.hive.jdbc.HiveDriver',
dbtable='abc',
user='root',
password='root').load()
print source_df.show()
When i run this, I am getting below output and not able to fetch the
data from table.
+--------+------+
|abc.name|abc.id|
+--------+------+
+--------+------+
Just try
df = hiveContext.read.table("your_hive_table") //reads from default db
df = hiveContext.read.table("your_db.your_hive_table") //reads from your db
you could also do
df = hiveContext.sql("select * from your_table")

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website:
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
However, the example above wouldn't run and gave me the following errors:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-aaffcd1239c9> in <module>()
1 from pyspark import *
2 data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
----> 3 df = spark.createDataFrame(data, ["features"])
4 kmeans = KMeans(k=2, seed=1)
5 model = kmeans.fit(df)
NameError: name 'spark' is not defined
What additional configuration/variable needs to be set to get the example running?
You can add
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
to the begining of your code to define a SparkSession, then the spark.createDataFrame() should work.
Answer by 率怀一 is good and will work for the first time.
But the second time you try it, it will throw the following exception :
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at <ipython-input-3-786525f7559f>:10
There are two ways to avoid it.
1) Using SparkContext.getOrCreate() instead of SparkContext():
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
2) Using sc.stop() in the end, or before you start another SparkContext.
Since you are calling createDataFrame(), you need to do this:
df = sqlContext.createDataFrame(data, ["features"])
instead of this:
df = spark.createDataFrame(data, ["features"])
spark stands there as the sqlContext.
In general, some people have that as sc, so if that didn't work, you could try:
df = sc.createDataFrame(data, ["features"])
You have to import the spark as following if you are using python then it will create
a spark session but remember it is an old method though it will work.
from pyspark.shell import spark
If it errors you regarding other open session do this:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
scraped_data=spark.read.json("/Users/reihaneh/Desktop/nov3_final_tst1/")

Resources