Failed to run pyspark's .withcolumn function - apache-spark

I am trying to run Pyspark on pycharm in Windows 10, but I kept getting some weird error message on Node 81 related to JVM when trying to execute the simple function .withColumn() and .withColumnRenamed. I have a tmp folder on my desktop (see the attached image), and I set all the environment variables for HADOOP_PATH, JAVA_HOME, PATH, PYTHON_PATH and SPARK_HOME. I was also able to create the spark object with the following lines of code
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("Data Est") \
.config("spark.driver.memory", memory='4g') \
.config("spark.sql.shuffle.partitions", partitions=400) \
.config("spark.sql.broadcastTimeout", -1) \
.config("spark.sql.session.timezone", "UTC") \
.config("spark.local.dir", spark_local_dir=[some directory path on desktop]) \
.getOrCreate()
System Environment Variables - Windows 10 64-bit

Related

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_ACCESS_KEY_ID) \
.config("spark.hadoop.fs.s3a.secret.key", configuration.AWS_ACCESS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
lines = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_server) \
.option('subscribe', kafka_topic) \
.option("startingOffsets", "earliest") \
.load()
streaming_query = lines.writeStream \
.format('parquet') \
.outputMode('append') \
.option('path', configuration.S3_PATH) \
.start()
streaming_query.awaitTermination()
Hadoop version: 3.2.1,
Spark version 3.2.1
I have added the dependency jars to pyspark jars:
spark-sql-kafka-0-10_2.12:3.2.1,
aws-java-sdk-s3:1.11.375,
hadoop-aws:3.2.1,
I get the following error when executing:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
In my case, it worked in the end by adding the following statement:
.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
Also, all the hadoop jars in site-package/pyspark/jars must be in the same version, hadoop-aws:3.2.2, hadoop-client-api-3.2.2, hadoop-client-runtime-3.2.2, hadoop-yam-server-web-proxy-3.2.2
For version 3.2.2 of hadoop-aws, aws-java-sdk-s3:1.11.563 package is needed.
Also I replaced guava-14.0.jar with guava-23.0.jar.
I used same package with you.
in my case, when i added below the line.
config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
i got this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
....
For solving this I installed `guava-30.0
try to download jar lib put into spark/jars
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar

Pyspark is unable to find bigquery datasource

This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.config('spark.jars.packages','com.google.cloud.bigdataoss:gcsio:1.5.4') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar,spark-bigquery-with-dependencies_2.12-0.21.1.jar,spark-bigquery-latest_2.11.jar') \
.config('spark.jars', 'postgresql-42.2.23.jar,bigquery-connector-hadoop2-latest.jar') \
.getOrCreate()
Then when i try to write a demo spark dataframe to bigquery
df.write.format('bigquery') \
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket",bucket) \
.save()
It throws and error
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-1.9.5,com.google.guava:guava:r05') \
.config('spark.jars','guava-11.0.1.jar,gcsio-1.9.0-javadoc.jar') \ # you will have to download these jars manually
.getOrCreate()

How to use pyspark on yarn-cluster mode by code

This is my code.
spark = SparkSession.builder. \
master("yarn"). \
config("hive.metastore.uris", settings.HIVE_URL). \
config("spark.executor.memory", '5g'). \
config("spark.driver.memory", '2g').\
config('spark.submit.deployMode', 'cluster').\
enableHiveSupport(). \
appName(app_name). \
getOrCreate()
I want to use this way to submit a task to a yarn-cluster.
But it throws an exception:
Cluster deploy mode is not applicable to Spark shells.
What should I do?
I need use cluster mode, not client.

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!
To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

Resources