AttributeError: 'SQLContext' object has no attribute 'jsonFile' - apache-spark

When I perform the following actions.I met this problem in centos 7.0 and spark 2.1.0. I am a freshman in spark. How to fix it?
>>> from pyspark.sql import SQLContext
>>> ssc = SQLContext(sc)
>>> df = ssc.jsonFile('file:///root/work/person.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SQLContext' object has no attribute 'jsonFile'

Use SparkSession with the newer version of Spark and read using
df = spark.read.json('path to json).

jsonFile has been deprecated, please use sqlContext.read.json

Related

Pyarrow Table doesn't seem to have to_pylist() as a method

pyarrow documentation says that the Table class has a method called to_pylist which should return a list of dictionaries.
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pylist
When I run their code example:
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
table = pa.Table.from_pandas(df)
table.to_pylist()
I get the following attributeerror:
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: 'pyarrow.lib.Table' object has no attribute 'to_pylist'
Has to_pylist been removed or is there something wrong with my package?
Method to_pylist was added to pa.Table in version 7.0.0. As suggested, could you check that the version of pyarrow you are using is not older?

'SparkSession' object has no attribute 'textFile'

I am currently using SparkSession and was told that SparkContext is within SparkSession. However, when doing up the code, it is showing me an error that SparkContext does not exist in SparkSession
Below is the code that i have done
import findspark
findspark.init()
from pyspark.sql import SparkSession, Row
import collections
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file://C:/temp").appName("SparkSQL").getOrCreate()
lines = spark.textFile('C:/Users/file.xslx')
The error is as follow:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_59944/722806425.py in <module>
----> 1 lines = spark.textFile('C:/Users/samue/bt4221_spark/exercise/week5/customer-orders.xslx')
AttributeError: 'SparkSession' object has no attribute 'textFile'
My current version of
findspark: 1.4.2
pyspark: 3.0.3
I dont think its related to any version issue. Any help is greatly appreciated! :)
textFile is present in SparkContext class not in SparkSession.
spark.sparkContext.textFile('filepath')

ValueError: Cannot run multiple SparkContexts at once in spark with pyspark

i am new in using spark , i try to run this code on pyspark
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
but he till me this erore message
Using Python version 3.5.2 (default, Jul 5 2016 11:41:13)
SparkSession available as 'spark'.
>>> from pyspark import SparkConf, SparkContext
>>> import collections
>>> conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
>>> sc = SparkContext(conf = conf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\python\pyspark\context.py", line 275, in _ensure_initialized
callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by getOrCreate at C:\spark\bin\..\python\pyspark\shell.py:43
>>>
i have version spark 2.1.1 and python 3.5.2 , i search and found it is problem in sc ,he could not read it but no when till why , any one have help here
You can try out this
sc = SparkContext.getOrCreate();
You can try:
sc = SparkContext.getOrCreate(conf=conf)
Your previous session is still on. You can run
sc.stop()
it can run through Jupyter lab also. but you have to use as your previous session is still running and local can not run two sessions at a time
sc = SparkContext.getOrCreate( conf =conf)

Zeppelin PySpark: 'JavaMember' object has no attribute 'parseDataType'

This simple PySpark snippet runs fine with normal spark-submit but fails with Apache Zeppelin on the cast call. Any ideas?
%pyspark
import pyspark.sql.functions as spark_functions
col1 = spark_functions.lit(None)
print("type(col1)={}".format(type(col1)))
col2 = col1.cast(StringType())
error is:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6046223946582899049.py", line 252, in <module>
eval(compiledCode)
File "<string>", line 14, in <module>
File "/usr/lib/spark/python/pyspark/sql/column.py", line 334, in cast
jdt = ctx._ssql_ctx.parseDataType(dataType.json())
AttributeError: 'JavaMember' object has no attribute 'parseDataType'
This is a known bug with Spark 2.0 on Zeppelin 0.6.1 that is targeted to be fixed in Zeppelin 0.6.2: https://issues.apache.org/jira/browse/ZEPPELIN-1411

Where is startTime in pyspark context?

In PySpark 1.3, there appears to be no startTime:
>>> sc.startTime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'startTime'
Scala has the method:
scala> sc.startTime
res1: Long = 1431974499272
Why?
As #vanza pointed out, you can access the startTime through the jsc
sc._jsc.startTime()
I put together a quick PR https://github.com/apache/spark/pull/6275 which adds the startTime property to pyspark.

Resources