Pandas udf error on EMR: class "io.netty.buffer.ArrowBuf"' - apache-spark

I'm trying to use a pandas udf on a Jupyter notebook on AWS EMR to no avail.
First I tried to use a function that I did, but I couldn't get it to work, so I tried some examples of answers to other questions I found here, but I still couldn't get it to work.
I tried this code:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pyarrow
df = spark.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["id", "type", "code"])
schema = StructType([
StructField("code", StringType()),
])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def dummy_udaf(pdf):
pdf = pdf[['code']]
return pdf
df.groupBy('type').apply(dummy_udaf).show()
And I get this error:
Caused by: java.lang.SecurityException: class "io.netty.buffer.ArrowBuf"'s signer information does not match signer information of other classes in the same package
I tried without the import pyarrow and I get the same error. I also used other codes from answers about this topic and the result was the same.
In the bootstrap shell script I have a pip install line as follows:
sudo python3 -m pip install pandas==0.24.2 pyarrow==0.14.1
I've tried with pyarrow 0.15.1, but nothing changed.
Dou you have any idea what is causing this error? Thank you!

Set the following versions
sudo python3 -m pip install pyarrow==0.14 pandas==1.1.4

Related

i had the error Py4JJavaError: An error occurred while calling o65.showString in pyspark

i am trying to implement this code using:
python 3.9
spark-3.3.1-bin-hadoop3 included pyspark
java 1.8.0_171
the paths is alright and i am running other codes on jupyter but i didn't find any answer related to the error Py4JJavaError: An error occurred while calling o65.showString
note: my spark contains spark-sql_2.12-3.3.1 and graphframes-0.8.2-spark3.2-s_2.12 jar files and thats mean the same version of scala 2.12
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sc.addPyFile('C:\Program Files\spark-3.3.1-bin-hadoop3\jars\graphframes-0.8.2-spark3.2-s_2.12.jar')
sqlc = SQLContext(sc)
# Create a Vertex DataFrame with unique ID column "id"
v = sqlc.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlc.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
i hope some one help to fix the error

Spark Extension using AWS Glue

I have created a script locally that uses the spark extension 'uk.co.gresearch.spark:spark-extension_2.12:2.2.0-3.3' for comparing different DataFrames in a simple manner.
However, when I try this out on AWS Glue I ran into some issues and received this error:
ModuleNotFoundError: No module named 'gresearch'
I have tried copying the .jar file from my local disk that was referenced when I initialized the spark session locally and received this message:
... The jars for the packages stored in: /Users/["SOME_NAME"]/.ivy2/jars
uk.co.gresearch.spark#spark-extension_2.12 added as a dependency...
In that path I found a file named: uk.co.gresearch.spark_spark-extension_2.12-2.2.0-3.3.jar that I copied to S3 and referenced in the Jar lib path.
But this did not work... How would you go about setting this up in the correct manner?
The example code I've used to test this on AWS Glue looks like this:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
appName = 'test_gresearch'
spark_conf = SparkConf()
spark_conf.setAll([('spark.jars.packages', 'uk.co.gresearch.spark:spark-
extension_2.12:2.2.0-3.3')])
spark=SparkSession.builder.config(conf=spark_conf)\
.enableHiveSupport().appName(appName).getOrCreate()
from gresearch.spark.diff import *
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
df1.show()
df2.show()
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
Any tips are more than welcome. Thank you in advance!
Regards
After some investigation with the AWS support team, I was instructed to include the package .jar file through the Python library path since the .jar file comprises embedded Python packages. The correct version of the .jar file shall therefore be downloaded (https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension_2.12/2.1.0-3.1 was the version I ended up using) and uploaded to S3 and referenced in under the Glue job setting for Python library path (for eg - s3://bucket-name/spark-extension_2.12-2.1.0-3.1.jar).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
from gresearch.spark.diff import *
left.diff(right, "id").show()

importing from file vs internet in genfromtxt numpy date “TypeError: must be str, not bytes” using python 3.6 [duplicate]

This question already has answers here:
StringIO example does not work
(2 answers)
Closed 4 years ago.
Strange enough! but, applying np.genfromtxt() function on the file(goog.csv), wherein the data has been downloaded and stored from a source, produces no error.Following is the code->
import numpy as np
from matplotlib.dates import bytespdate2num
names = ["A", "B", "C", "D", "E", "F", "G"]
my_array1 = np.genfromtxt("goog.csv",
delimiter=',',
skip_header=1,
names=names,
dtype=None,
converters={0: bytespdate2num('%Y-%m-%d')})
print(my_array1["A"])
Output->
[ 736536. 736535. 736534. ..., 730124. 730123. 730122.]
However, applying the same function on a list whose data has been fetched from the same source, being in the same format(.csv), produces the Typerror.Following is the code->
import numpy as np
import request
from matplotlib.dates import bytespdate2num
/*fetching the internet data and store it in a list called stock_data*/
source_code = str(requests.get(goog_url, verify=True, auth=('user', 'pass')).content)
stock_data = []
split_source = source_code.split('\\n')
for line in split_source:
stock_data.append(line)
names = ["A", "B", "C", "D", "E", "F", "G"]
my_array2 = np.genfromtxt(stock_data,
delimiter=',',
skip_header=1,
names=names,
dtype=None,
converters={0: bytespdate2num('%Y-%m-%d')})
print(my_array2["A"])
Output->
TypeError: must be str or None, not bytes
Data in the link goog_url as well as the file (goog.csv) is of the following format->
2017-07-26,153.3500,153.9300,153.0600,153.5000,153.5000,12778195.00
could find no reason for the difference and error in the second case.
This works for me.
Demo:
import numpy as np
import datetime
datefunc = lambda x: datetime.datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')
Date, Open, High, Low, Close, Adjusted_close, Volume = np.genfromtxt(filename, dtype=None, unpack=True, delimiter=',', converters = {0: datefunc}).tolist()
print(Date)
Output:
2017-07-26 00:00:00
Using decode like this assumes x is bytestring:
In [127]: datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')
In [128]: datefunc('1999-01-30')
....
AttributeError: 'str' object has no attribute 'decode'
In [129]: datefunc(b'1999-01-30')
Without the decode it handles the default PY3 string type:
In [130]: datefunc1 = lambda x: datetime.strptime(x, '%Y-%m-%d')
In [132]: datefunc1('1999-01-30')
Out[132]: datetime.datetime(1999, 1, 30, 0, 0)
Previously genfromtxt opened the file in bytestring mode, and thus required this kind of conversion. But in the current version, it can open the file in unicode, and shouldn't need the decode. If your version of genfromtxt accepts an encoding parameter (it may be even raise a warning about it), it's new.

Both Cython and Numba, Pandasql sqldf select statement throws sqlite3.OperationalError: no such table

I am really new to Python programming. I have a dataframe pandasql query which runs fine when I run my code with the standard Python3 implementation. However, after cythonizing it, I always get the following exception:
sqlite3.OperationalError: no such table: dataf
Following is the snippet from the processor.pyx file
import pandas as pd
from pandasql import sqldf
def process_date(json):
#json has the properties format [{"x": "1", "y": "2", "z": "3"}]
dataf = pd.read_json(json, orient='records')
sql = """select x, y, z from dataf;"""
result = sqldf(sql)
Could cythonizing the code make it behave differently? This exact same code runs
fine with the standard python3 implementation.
Following is the setup.py I have written to transpile the code to c.
# several files with ext .pyx, that i will call by their name
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules=[
Extension("c_processor", ["processor.pyx"])]
setup(
name = 'RTP_Cython',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules,
)
I also tried to use Numba and got the same error. Code below:
import pandas as pd
from pandasql import sqldf
from numba import jit
from numpy import arange
#jit
def process_data():
print("In process data")
json = "[{\"id\": 2, \"name\": \"zain\"}]"
df = pd.read_json(json, orient='records')
sql = "select id, name from df;"
df = sqldf(sql)
print("the df is %s" % df)
process_data()
If I comment out #jit annotation, the code works fine.
Should I be using another extension of the panda libraries that inter operate with C, since both Numba and Cython give me the same error?
I hope there is an easy solution to this.

Do not discard keys with null values when converting to JSON in PySpark DataFrame

I am creating a column in a DataFrame from several other columns that I want to store as a JSON serialized string. When the serialization to JSON occurs, keys with null values are dropped. Is there a way to keep keys even if the value is null?
Sample program illustrating the issue:
from pyspark.sql import functions as F
df = sc.parallelize([
(1, 10),
(2, 20),
(3, None),
(4, 40),
]).toDF(['id', 'data'])
df.collect()
#[Row(id=1, data=10),
# Row(id=2, data=20),
# Row(id=3, data=None),
# Row(id=4, data=40)]
df_s = df.select(F.struct('data').alias('struct'))
df_s.collect()
#[Row(struct=Row(data=10)),
# Row(struct=Row(data=20)),
# Row(struct=Row(data=None)),
# Row(struct=Row(data=40))]
df_j = df.select(F.to_json(F.struct('data')).alias('json'))
df_j.collect()
#[Row(json=u'{"data":10}'),
# Row(json=u'{"data":20}'),
# Row(json=u'{}'), <= would like this to be u'{"data":null}'
# Row(json=u'{"data":40}')]
Running Spark 2.1.0
Could not find a Spark specific solution so just wrote a udf and used the python json package:
import json
from pyspark.sql import types as T
def to_json(data):
return json.dumps({'data': data})
to_json_udf = F.udf(to_json, T.StringType())
df.select(to_json_udf('data').alias('json')).collect()
# [Row(json=u'{"data": 10}'),
# Row(json=u'{"data": 20}'),
# Row(json=u'{"data": null}'),
# Row(json=u'{"data": 40}')]
Also posted on this StackOverflow post:
Since Pyspark 3, one can use the ignoreNullFields option when
writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs:
https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

Resources