SparkSession Object has no attribute read_csv - apache-spark

Getting an error message while running the below commands using pyspark (Pycharm IDE)
spark=SparkSession.builder.master("local").appname("Sample").getOrCreate()
df=spark.read_csv('filename.csv')
Error: SparkSession object has no attribute read_csv

Your syntax is incorrect. Use spark.read.csv(...)

spark needs to use spark.read.csv(file_name)
read_csv(file_name) is pandas function to read CSV files.
Don't get confused with pandas df and spark df.

Related

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

spark-xml on jupyter notebook

I am trying to run spark-xml on my jupyter notebook in order to read xml files using spark.
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'
I found out that this is way to use it. But when I try to import com.databricks.spark.xml._, I get an error saying
no module named "com"
As I see you are not able to load xml file as it is , using pyspark and databricks lib, this problem happens offen, well try to run this command from your teminal or from your notebook as a shell command :
pyspark --packages com.databricks:spark-xml_2.11:0.4.1
if it does not work you can try this work around, as you can read your file as a text then parse it.
#define your parser function: input is rdd:
def parse_xml(rdd):
"""
Read the xml string from rdd, parse and extract the elements,
then return a list of list.
"""
return results
#read the file as text at a RDD level
file_rdd = spark.read.text("/path/to/data/*.xml", wholetext=True).rdd
# parse xml tree, extract the records and transform to new RDD
records_rdd = file_rdd.flatMap(parse_xml)
# convert RDDs to DataFrame with the pre-defined schema
output_df = records_rdd.toDF(my_schema)
If the .toDf will not work import spark.implicit.

Py4JJavaError: An error occurred while calling o389.csv

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

Is it possible to read Excel file from Apache Zeppellin to PySpark or to a Pandas Dataframe?

I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)
I am using a Zeppelin Notebook to do my code.
Is it possible to get data from this file?
I have already tried the following commands, but none of them worked:
df = pd.read_excel("/user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")
I don't think you can read files stored in hdfs directly with pandas.
You probably have to either :
load the file into spark then use toPandas()
df = spark.read.format("excel").load("hdfs:xxx").toPandas()
use some alternative to enable pandas to read directly, as described here
It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Resources