Py4JJavaError: An error occurred while calling o389.csv - apache-spark

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?

The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

Related

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

SparkSession Object has no attribute read_csv

Getting an error message while running the below commands using pyspark (Pycharm IDE)
spark=SparkSession.builder.master("local").appname("Sample").getOrCreate()
df=spark.read_csv('filename.csv')
Error: SparkSession object has no attribute read_csv
Your syntax is incorrect. Use spark.read.csv(...)
spark needs to use spark.read.csv(file_name)
read_csv(file_name) is pandas function to read CSV files.
Don't get confused with pandas df and spark df.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders while reading from Azure Blob Storage

I am trying to read a CSV file stored in Azure Storage Account. For that, I have installed a spark on my Virtual Machine and trying to read a CSV file in a dataframe from pyspark.
I read somewhere how to do that and I followed the steps and copied the latest hadoop-azure & azure-storage JAR files on my /jar directories. Then, I came up with this error:-
NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
I searched for this error and found that I need to refer hadoop-azure-2.8.5.jar instead of latest hadoop-azure JAR. So, I replaced this JAR with the latest hadoop-azure jar and again executed my pyspark code.
After executing my code, I encountered with another error: -
: java.lang.NoSuchMethodError:
org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
Also, below is my pyspark code: -
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
storage_account_name = "<storage_account_name>"
storage_account_access_key = "<storage_account_access_key>"
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net",storage_account_access_key)
spark._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.my_account.blob.core.windows.net", "storage_account_access_key")
df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://<container_name>#<storage_account_name>.blob.core.windows.net/<path_to_csv>/sample_file.csv")
df.show()
I searched for this and tried various hadoop-azure JAR versions. The one which worked for me was hadoop-azure-2.7.0.jar.
With this JAR version, I was able to read the CSV file from Blob storage.

Reading AVRO from Azure Datalake in Databricks

I am trying to read eventhub data (AVRO) format. I am having issues loading data into a dataframe in databricks.
Here's the code I am using. Please let me know if I am doing anything wrong
path='/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/*.avro'
df = spark.read.format("com.databricks.spark.avro") \
.load(path)
Error
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:
I did try using some code to remove the error, but I am getting the syntax errors
import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
.builder()
.config("spark.sql.warehouse.dir","/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/")
.getOrCreate()
SyntaxError: invalid syntax
File "<command-265213674761208>", line 2
SparkSession spark = SparkSession
Relative path in absolute URI
You need to specify the protocol rather than use /mnt
For example, wasb://some/path/ if reading from Azure blobstore
You can also exclude *.avro since the Avro reader should already pick up all Avro files in the path
https://docs.databricks.com/data/data-sources/read-avro.html#python-api
And if you want to read from EventHub, that exposes a Kafka API, not a filepath, AFAIK

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Resources