how to read xlsx file using pyspark without help of pandas

how to read xlsx file using pyspark without help of pandas - apache-spark

I am using this code to read the XLSX file in my local PC. but I couldn't read that file and I am using "com.crealytics.spark.excel" library also.
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession.builder \
.appName("test") \
.master("local[0]") \
.getOrCreate()
empFile = "C:/Users/Dev/Downloads/SAMPLE.xlsx"
employeesDF = sqlContext.read.format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", empFile).option("addColorColumns", "False").load()
employeesDF.createOrReplaceTempView("EMP")
expLevel = sqlContext.sql("Select * from EMP")
expLevel.show()
if I run this code I got the error like this
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.lang.NoClassDefFoundError: scala/Product$class

Related

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

How to connect spark with hive using pyspark?

I am trying to read hive tables using pyspark, remotely. It states the error that it is unable to connect to Hive Metastore client.
I have read multiple answers on SO and other sources, they were mostly configurations but none of them could address why am I unable to connect remotely. I read the documentation and observed that without making changes in any configuration file, we can connect spark with hive. Note: I have port-forwarded a machine where hive is running and brought it available to localhost:10000. I even connected the same using presto and was able to run queries on hive.
The code is:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')
I expect the output to be an acknowledgment of table being saved but instead, I am facing this error.
Abstract error is:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I have fired a command:
ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu#I.J.K.l
When I check for ports 10000 and 9083 via the commands:
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral#versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!
Upon running the script, I get the following error:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more

The catch is in letting the hive configs being stored while creating the spark session itself.
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
It should be noted that no changes in spark conf are required, even serverless services like AWS Glue can have such connections.
For full code:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

Error thrown while reading data from cassandraTable

I am trying to read data from cassandra table using pyspark-cassandra connector.
I downloaded the package from pyspark-cassandra
My script is as follow:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import pyspark_cassandra
from pyspark_cassandra import CassandraSparkContext,Row
conf=SparkConf().setMaster("local").setAppName("App1").set("spark.cassandra.co nnection.host","http://192.168.0.2")
sc = CassandraSparkContext(conf=conf)
rdd = sc.cassandraTable("keyspace1", "table1")
When I run above script using command:
$ bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.3.5 pyscript.py
I am getting below error:
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o83.newInstance. : java.lang.NoClassDefFoundError: Could not
> initialize class com.datastax.spark.connector.types.TypeConverter$

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found

I'm trying to write a Spark DataFrame to S3 with pyspark. I'm using Spark version 2.2.0.
sc = SparkContext('local', 'Test')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", aws_secret)
sc._jsc.hadoopConfiguration().set("fs.s3a.multipart.uploads.enabled", "true")
spark = sql.SparkSession \
.builder \
.appName("TEST") \
.getOrCreate()
sql_context = sql.SQLContext(sc, spark)
filename = 'gerrymandering'
s3_uri = 's3a://mybucket/{}'.format(filename)
print(s3_uri)
df = sql_context.createDataFrame([('1', '4'), ('2', '5'), ('3', '6')], ["A", "B"])
df.write.parquet(s3_uri)
The traceback I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
I'm not sure but there seems to be a jar dependency error. I've tried multiple versions of hadoop-aws-X.jar as well as aws-java-sdk-X.jar but they all produce this same error.
As of writing this my command was:
spark-submit --jars hadoop-aws-2.9.0.jar,aws-java-sdk-1.7.4.jar test.py
Any ideas on how I can resolve this NoClassDefFoundError?

Don't try and use a Hadoop-aws JAR and AWS SDK. different from that it ships with; the AWS SDK Changes too much between versions. For hadoop-2.9.0 you need aws-java-sdk-bundle version 1.11.199
See mvnrepo/hadoop-aws

TypeError: 'JavaPackage' object is not callable

when I code the spark sql API hiveContext.sql()
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext,HiveContext
conf = SparkConf().setAppName("spark_sql")
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
#rdd = sc.textFile("test.txt")
sqlContext = SQLContext(sc)
res = hc.sql("use teg_uee_app")
#for each in res.collect():
# print(each[0])
sc.stop()
I got the following error:
enFile "spark_sql.py", line 23, in <module>
res = hc.sql("use teg_uee_app")
File "/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/spark/python/pyspark/sql/context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "/spark/python/pyspark/sql/context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable
how do I add SPARK_CLASSPATH or SparkContext.addFile?I don't have idea.

Maybe this will help you: When using HiveContext I have to add three jars to the spark-submit arguments:
spark-submit --jars /usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/lib/datanucleus-core-3.2.10.jar,/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar ...
Of course the paths and versions depend on your cluster setup.

In my case this turned out to be a classpath issue - I had a Hadoop jar on the classpath that was a wrong version of Hadoop than I was running.
Make sure you only set the executor and/or driver classpaths in one place and that there's no system-wide default applied somewhere such as .bashrc or Spark's conf/spark-env.sh.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to read xlsx file using pyspark without help of pandas - apache-spark

Related

Error while connecting big query in GCP using Spark

How to connect spark with hive using pyspark?

Error thrown while reading data from cassandraTable

PySpark 2.2.0 Write DataFrame to S3 AmazonServiceException Class Not Found

TypeError: 'JavaPackage' object is not callable

Categories

Resources