Module not found error when importing Pyspark Delta Lake module - apache-spark

I'm running Pyspark with delta lake but when I try to import the delta module I get a ModuleNotFoundError: No module named 'delta'. This is on a machine without an internet connection so I had to download the delta-core jar manually from Maven and place it into the %SPARK_HOME%/jars folder.
My program works without any issues and I'm able to write and read from delta lake so I'm happy I've got the correct jar. But when I try and import the delta module from delta.tables import * I get the error.
For information my code is:
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType, FloatType, StructType, StructField
from pyspark.sql.functions import input_file_name
from Constants import Constants
if __name__ == "__main__":
constants = Constants()
spark = SparkSession.builder.master("local[*]")\
.appName("Delta Lake Testing")\
.getOrCreate()
# have to start spark session before importing: https://docs.delta.io/latest/quick-start.html#python
from delta.tables import *
# set logging level to limit output
spark.sparkContext.setLogLevel("ERROR")
spark.conf.set("spark.sql.session.timeZone", "UTC")
# push additional python files to the worker nodes
base_path = os.path.abspath(os.path.dirname(__file__))
spark.sparkContext.addPyFile(os.path.join(base_path, 'Constants.py'))
# start pipeline
schema = StructType([StructField("Timestamp", TimestampType(), False),\
StructField("ParamOne", FloatType(), False),\
StructField("ParamTwo", FloatType(), False),\
StructField("ParamThree", FloatType(), False)])
df = spark.readStream\
.option("header", "true")\
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")\
.schema(schema)\
.csv(constants.input_path)\
.withColumn("input_file_name", input_file_name())
df.writeStream\
.format("delta")\
.outputMode("append")\
.option("checkpointLocation", constants.checkpoint_location)\
.start("/tmp/bronze")
# await on stream
sqm = spark.streams
sqm.awaitAnyTermination()
This is using Spark v2.4.4 and Python v3.6.1 and the job is submitted using spark-submit path/to/job.py

%pyspark
sc.addPyFile("**LOCATION_OF_DELTA_LAKE_JAR_FILE**")
from delta.tables import *

Related

How to call avro SchemaConverters in Pyspark

Although PySpark has Avro support, it does not have the SchemaConverters method. I may be able to use Py4J to accomplish this, but I have never used a Java package within Python.
This is the code I am using
# Import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def _test():
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("sparvro") \
.getOrCreate()
avroSchema = sc._jvm.org.apache.spark.sql.avro.SchemaConverters.toAvroType(StructType([ StructField("firstname", StringType(), True)]))
if __name__ == "__main__":
_test()
however, I keep getting this error
AttributeError: 'StructField' object has no attribute '_get_object_id'

Spark Performance Issue - Writing to S3

I have a AWS Glue job in which I am using pyspark to read a large file (30gb) csv on s3 and then save it as parquet on s3. The job ran for more then 3 hours post which I killed it. Not sure why converting the file format would take so long ? Is spark right tool to do this conversion . below is the code I am using
import logging
import sys
from datetime import datetime
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import boto3
import time
if __name__ == "__main__":
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
sqc = SQLContext(sc)
rdd = (sc. \
textFile("s3://my-bucket/data.txt")\
.flatMap(lambda line: line.split("END")) \
.map(lambda x: x.split("|")) \
.filter(lambda x: len(x) > 1))
df=sqc.createDataFrame(rdd)
#print(df1.head(10))
print(f'df.rdd.getNumPartitions() - {df.rdd.getNumPartitions()}')
df1.write.mode('overwrite').parquet('s3://my-bucket/processed')
job.commit()
Any suggestions for reducing the run time ?

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

How to get detailed Information about Spark Stages&Tasks

I´ve set up an Apache Spark cluster with a master and one Worker and I use Python with Spyder as IDE. Everything works fine so far, but I need detailed Information about the task distribution in the Cluster. I know that there is the Spark Web UI but I would like to have Information directly in my Spyder console. So I mean which part of my code/script is done by which Worker/Master. I think with the python package "socket" and socket.gethostname() it must be possible to get more Information. I really look forward to for an help.
Here is my code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import matplotlib.pyplot as plt
from datetime import datetime
from pyspark.sql.functions import udf
from datetime import datetime
import pyspark.sql.functions as F
#spark = SparkSession \
# .builder \
# .appName('weather_data') \
# .getOrCreate()
spark = SparkSession \
.builder \
.appName("weather_data_u") \
.master('master_ip#...')\
.getOrCreate()
data.show()
data.printSchema()
data_selected = data\
.select(data['Date'],
data['TemperatureHighC'],
data['TemperatureAvgC'],
data['TemperatureLowC'],
data['DewpointHighC'],
data['DewpointAvgC'],
data['DewpointLowC'],
data['HumidityAvg'],
data['WindSpeedMaxKMH'],
data['WindSpeedAvgKMH'],
data['GustSpeedMaxKMH'],
data['PrecipitationSumCM'])
data_selected.printSchema()
data_selected.show()
f = udf(lambda row: datetime.strptime(row, '%Y-%m-%d'), TimestampType())
data_selected = data_selected\
.withColumn('date', f(data['Date'].cast(StringType())))\
.withColumn('t_max', data['TemperatureHighC'].cast(DoubleType()))\
.withColumn('t_mean', data['TemperatureAvgC'].cast(DoubleType()))\
.withColumn('t_min', data['TemperatureLowC'].cast(DoubleType()))\
.withColumn('dew_max', data['DewpointHighC'].cast(DoubleType()))\
.withColumn('dew_mean', data['DewpointAvgC'].cast(DoubleType()))\
.withColumn('dew_min', data['DewpointLowC'].cast(DoubleType()))\
.cache()
data_selected.show()
t_mean_calculated = data_selected\
.groupBy(F.date_format(data_selected.date, 'M'))\
.agg(F.mean(data_selected.t_max))\
.orderBy('date_format(date, M)')
t_mean_calculated = t_mean_calculated\
.withColumn('month', t_mean_calculated['date_format(date, M)'].cast(IntegerType()))\
.withColumnRenamed('avg(t_max)', 't_max_month')\
.orderBy('month')\
.drop(t_mean_calculated['date_format(date, M)'])\
.select('month', 't_max_month')
t_mean_calculated = t_mean_calculated.collect()
As reported by #Jacek Laskowski himself, you can use Spark-Core local properties to modify job-name in web-ui
callSite.short
callSite.long
For instance, my Spark-application syncs multiple MySQL tables to S3, and I set
spark.sparkContext.setLocalProperty("callSite.short", currentTableName)
so reflect current table-name in web-ui

Resources