Read XML file in Jupiter notebook - python-3.x

I tried to read an XML file in Jupiter notebook getting the below error
import os
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.13:0.4.1 pyspark-shell'
df = spark.read.format('xml').option('rowTag', 'row').load('test.xml')
The error:
Py4JJavaError: An error occurred while calling o251.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html

Related

pyspark & hadoop mismatch in jupyter notebook

In Google Colab, I'm able to successfully set up a connection to S3 using pyspark and hadoop using the code below. When I try running the code in Jupyter Notebook, I get the error:
Py4JJavaError: An error occurred while calling o46.parquet. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
From what I've found on stackoverflow, it seems that this error occurs when the hadoop and pyspark versions are not matched, but in my setup, I've specified that they are both version 3.1.2.
Is someone able to tell me why I am getting this error and what I need to change for it to work? Code below:
In[1]
# Download AWS SDK libs
! rm -rf aws-jars
! mkdir -p aws-jars
! wget -c 'https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar'
! wget -c 'https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.2/hadoop-aws-3.1.2.jar'
! mv *.jar aws-jars
In[2]
! pip install pyspark==3.1.2
! pip install pyspark[sql]==3.1.2
In[3]
from pyspark.sql import SparkSession
AWS_JARS = '/content/aws-jars'
AWS_CLASSPATH = "{0}/hadoop-aws-3.1.2.jar:{0}/aws-java-sdk-bundle-1.11.271.jar".format(AWS_JARS)
spark = SparkSession.\
builder.\
appName("parquet").\
config("spark.driver.extraClassPath", AWS_CLASSPATH).\
config("spark.executor.extraClassPath", AWS_CLASSPATH).\
getOrCreate()
AWS_KEY_ID = "YOUR_KEY_HERE"
AWS_SECRET = "YOUR_SECRET_KEY_HERE"
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY_ID)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
In[4]
dataset1 = spark.read.parquet("s3a://bucket/dataset1/path.parquet")
dataset1.createOrReplaceTempView("dataset1")
dataset2 = spark.read.parquet("s3a://bucket/dataset2/path.parquet")
dataset2.createOrReplaceTempView("dataset2")
After In[4] is run, I receive the error:
Py4JJavaError: An error occurred while calling o46.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Failed to load dynlib/dll

I am trying to package the following Python code into an executable file using PyInstaller:
import pandas as pd
import teradatasql
with teradatasql.connect(host='abcdxxx', user='abcdxxx', password='abcdxxx') as connect:
query = "SHOW TABLE AdventureWorksDW.DimAccount"
df = pd.read_sql(query, connect)
print(df)
When I run the .exe file, it gives me the error:
PyInstallerImportError: Failed to load dynlib/dll
'C:\\Users\\GAX~1.P\\AppData\\Local\\Temp\\_MEI153202\\teradatasql\\teradatasql.dll'.
Most likely this dynlib/dll was not found when the application was frozen.
[9924] Failed to execute script 'hello' due to unhandled exception!
I tried to make the following changes to the .spec file:
b = [
('C:\Users\Path_to_Python\Python\Python310\Lib\site-
packages\teradatasql\teradatasql.dll', '.\\teradatasql')
]
a = Analysis(['hello.py'],
pathex=[],
binaries=b,
datas=[] # , .....
)
But it doesn't resolve the problem. How to fix this?
We provide an article explaining how to include the Teradata SQL Driver for Python into an application packaged by PyInstaller:
https://support.teradata.com/community?id=community_blog&sys_id=c327eac51b1e9c103b00bbb1cd4bcb37

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

Import Class on spark-shell where package name starting with word "spark"

I have opened Spark-Shell. In shell we already have an variable -
spark: org.apache.spark.sql.SparkSession
I have a third party Jar which has package name starting with "spark", like -
spark.myreads.one.KafkaProducerWrapper
When I try to import above package on spark shell then 'm getting exception -
scala> import spark.myreads.one.KafkaProducerWrapper
<console>:38: error: value myreads is not a member of org.apache.spark.sql.SparkSession
import spark.myreads.one.KafkaProducerWrapper
How can I import such a package on Spark-Shell resolving above conflict.
I'm using Spark-2.0.0 , JDK-1.8 and Scala -2.11
Use _root_ as beginning part like this:
import _root_.spark.myreads.one.KafkaProducerWrapper

grovysh imports when working with HBase fail

What *.jar I need to make my groovysh work with hbase 1.1.2 i am trying to run a simple script and the following imports fail -
groovy:000>
import org.apache.hadoop.hbase.client.Put
ERROR java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/io/HeapSize
at java_lang_Runnable$run.call (Unknown Source)
groovy:000>
import org.apache.hadoop.hbase.client.Result
ERROR java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/CellScannable
at java_lang_Runnable$run.call (Unknown Source)
import org.apache.hadoop.hbase.util.Bytes
Invalid import definition: 'org.apache.hadoop.hbase.util.Bytes'; reason: startup failed:
script14891462389401754287428.groovy: 1: unable to resolve class org.apache.hadoop.hbase.util.Bytes
# line 1, column 1.
import org.apache.hadoop.hbase.util.Bytes
I have loaded hbase-client.jar in my -classthpath. Just need to write a simple script that puts and increments hbase variables and execute via groovysh.
Edit 1
I get this now
groovy:000> groovy.grape.Grape.grab(group:'org.apache.hbase', module:'hbase-client', version:'1.3.0')
ERROR java.lang.RuntimeException:
Error grabbing Grapes -- [download failed: junit#junit;4.12!junit.jar, download failed: org.slf4j#slf4j-api;1.7.7!slf4j-api.jar, download failed: org.slf4j#slf4j-log4j12;1.6.1!slf4j-log4j12.jar]
groovy:000> groovy.grape.Grape.grab('org.apache.hbase:hbase-client:1.3.0')
hbase client has a lot of dependencies:
http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.hbase/hbase-client/1.1.1/
You can't just grab one jar and stick it on the classpath, you need a whole load of them
I don't use groovysh, but you should be able to do:
:grab 'org.apache.hbase:hbase-client:1.3.0'
And that should pull down hbase-client and all of its dependencies on to your classpath

Resources