Spark Extension using AWS Glue - apache-spark

I have created a script locally that uses the spark extension 'uk.co.gresearch.spark:spark-extension_2.12:2.2.0-3.3' for comparing different DataFrames in a simple manner.
However, when I try this out on AWS Glue I ran into some issues and received this error:
ModuleNotFoundError: No module named 'gresearch'
I have tried copying the .jar file from my local disk that was referenced when I initialized the spark session locally and received this message:
... The jars for the packages stored in: /Users/["SOME_NAME"]/.ivy2/jars
uk.co.gresearch.spark#spark-extension_2.12 added as a dependency...
In that path I found a file named: uk.co.gresearch.spark_spark-extension_2.12-2.2.0-3.3.jar that I copied to S3 and referenced in the Jar lib path.
But this did not work... How would you go about setting this up in the correct manner?
The example code I've used to test this on AWS Glue looks like this:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
appName = 'test_gresearch'
spark_conf = SparkConf()
spark_conf.setAll([('spark.jars.packages', 'uk.co.gresearch.spark:spark-
extension_2.12:2.2.0-3.3')])
spark=SparkSession.builder.config(conf=spark_conf)\
.enableHiveSupport().appName(appName).getOrCreate()
from gresearch.spark.diff import *
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
df1.show()
df2.show()
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
Any tips are more than welcome. Thank you in advance!
Regards

After some investigation with the AWS support team, I was instructed to include the package .jar file through the Python library path since the .jar file comprises embedded Python packages. The correct version of the .jar file shall therefore be downloaded (https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension_2.12/2.1.0-3.1 was the version I ended up using) and uploaded to S3 and referenced in under the Glue job setting for Python library path (for eg - s3://bucket-name/spark-extension_2.12-2.1.0-3.1.jar).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
from gresearch.spark.diff import *
left.diff(right, "id").show()

Related

i had the error Py4JJavaError: An error occurred while calling o65.showString in pyspark

i am trying to implement this code using:
python 3.9
spark-3.3.1-bin-hadoop3 included pyspark
java 1.8.0_171
the paths is alright and i am running other codes on jupyter but i didn't find any answer related to the error Py4JJavaError: An error occurred while calling o65.showString
note: my spark contains spark-sql_2.12-3.3.1 and graphframes-0.8.2-spark3.2-s_2.12 jar files and thats mean the same version of scala 2.12
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sc.addPyFile('C:\Program Files\spark-3.3.1-bin-hadoop3\jars\graphframes-0.8.2-spark3.2-s_2.12.jar')
sqlc = SQLContext(sc)
# Create a Vertex DataFrame with unique ID column "id"
v = sqlc.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlc.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
i hope some one help to fix the error

Unable to add/import additional python library datacompy in aws glue

i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step
Open the AWS Glue console.
Under Job parameters, added the following:
For Key, added --additional-python-modules.
For Value, added datacompy==0.7.3, s3://python-modules/datacompy-0.7.3.whl.
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datacompy
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
## #params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA','additional-python-modules'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
but the job return the error
module not found error no module named 'datacompy'
how to resolve this issue?
With Spark 2.4, Python 3 (Glue Version 2.0)
I set the following Job Parameter
Then I can import it my Job like so
import pandas as pd
import numpy as np
import datacompy
df1 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
df2 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
compare = datacompy.Compare(df1, df2, join_columns='a')
print(compare.report())
and when I check the CW Log for the Job Run
If you're using a Python Shell Job, try the following
Create a datacompy whl file or you can download it from PYPI
upload that file to an S3 bucket
Then enter the path to the s3 whl file in the Python library path box
s3://my-bucket/datacompy-0.8.0-py3-none-any.whl

Pandas udf error on EMR: class "io.netty.buffer.ArrowBuf"'

I'm trying to use a pandas udf on a Jupyter notebook on AWS EMR to no avail.
First I tried to use a function that I did, but I couldn't get it to work, so I tried some examples of answers to other questions I found here, but I still couldn't get it to work.
I tried this code:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pyarrow
df = spark.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["id", "type", "code"])
schema = StructType([
StructField("code", StringType()),
])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def dummy_udaf(pdf):
pdf = pdf[['code']]
return pdf
df.groupBy('type').apply(dummy_udaf).show()
And I get this error:
Caused by: java.lang.SecurityException: class "io.netty.buffer.ArrowBuf"'s signer information does not match signer information of other classes in the same package
I tried without the import pyarrow and I get the same error. I also used other codes from answers about this topic and the result was the same.
In the bootstrap shell script I have a pip install line as follows:
sudo python3 -m pip install pandas==0.24.2 pyarrow==0.14.1
I've tried with pyarrow 0.15.1, but nothing changed.
Dou you have any idea what is causing this error? Thank you!
Set the following versions
sudo python3 -m pip install pyarrow==0.14 pandas==1.1.4

Lost messages in spark-streaming with checkpoint

Read from Kafka with Spark-Streaming and Checkpoint, but lost messages.
Code for generating a test stream:
from kafka import KafkaProducer
import time
p = KafkaProducer(bootstrap_servers='kafka.dev:9092')
for i in range(1000):
time.sleep(2)
p.send('y_test', value='{"test": ' + str(i) + '}')
Code for reading from kafka:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def createContext():
sc = SparkContext(appName='test_app')
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum='kafka.dev:2181',
groupId='test_app', topics={'y_test': 1})
kafkaStream.saveAsTextFiles('test_dir/')
ssc.checkpoint('checkpoint_dir')
return ssc
context = StreamingContext.getOrCreate('checkpoint_dir', createContext)
context.start()
context.awaitTermination()
How I check:
1) start code for reading
2) start code for generating
3) restart code for reading
4) read hdfs and see data failure:
{"test": 1}
{"test": 2}
{"test": 3}
{"test": 4}
{"test": 8}
{"test": 9}
{"test": 10}
{"test": 11}
{"test": 12}
Kafka 0.9.0.0, Spark 1.6.0

Why spark tell me “ name 'sqlContext' is not defined ”, how can I use sqlContext?

I try to run example of spark-ml, but
from pyspark import SparkContext
import pyspark.sql
sc = SparkContext(appName="PythonStreamingQueueStream")
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
cannot run because terminal tells me that
NameError: name 'SQLContext' is not defined
Why this happened? How can I solve it?
If you are using Apache Spark 1.x line (i.e. prior to Apache Spark 2.0), to access the sqlContext, you would need to import the sqlContext; i.e.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
If you're using Apache Spark 2.0, you can just the Spark Session directly instead. Therefore your code will be
training = spark.createDataFrame(...)
For more information, please refer to the Spark SQL Programing Guide.
from pyspark.sql import SparkSession,SQLContext
spark = SparkSession.builder.appName("Basics").getOrCreate()
sc=spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0,10)
Above piece of code will solve your issue.

Resources