Where is startTime in pyspark context? - apache-spark

In PySpark 1.3, there appears to be no startTime:
>>> sc.startTime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'startTime'
Scala has the method:
scala> sc.startTime
res1: Long = 1431974499272
Why?

As #vanza pointed out, you can access the startTime through the jsc
sc._jsc.startTime()
I put together a quick PR https://github.com/apache/spark/pull/6275 which adds the startTime property to pyspark.

Related

How to find hours passed since given time in millseconds in Python?

One of the tool gives the start time in milliseconds as below:
'StartMilliseconds': 1645250400857
How do I find the hours passed since this timestamp? I tried below
>>> start=datetime.datetime.fromtimestamp(1645250400857/1000.0)
>>> now=datetime.datetime.now
>>> now-start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'builtin_function_or_method' and
'datetime.datetime'
You can find the current timestamp in seconds using datetime.now().timestamp() and then apply the conversions to compute hours since start:
from datetime import datetime
start_ms = 1645250400857
hours_since_start = (datetime.now().timestamp() - start_ms / 1000) / 3600

get columns post group by in pyspark with dataframes

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

AttributeError: 'SQLContext' object has no attribute 'jsonFile'

When I perform the following actions.I met this problem in centos 7.0 and spark 2.1.0. I am a freshman in spark. How to fix it?
>>> from pyspark.sql import SQLContext
>>> ssc = SQLContext(sc)
>>> df = ssc.jsonFile('file:///root/work/person.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
Use SparkSession with the newer version of Spark and read using
df = spark.read.json('path to json).
jsonFile has been deprecated, please use sqlContext.read.json

Zeppelin PySpark: 'JavaMember' object has no attribute 'parseDataType'

This simple PySpark snippet runs fine with normal spark-submit but fails with Apache Zeppelin on the cast call. Any ideas?
%pyspark
import pyspark.sql.functions as spark_functions
col1 = spark_functions.lit(None)
print("type(col1)={}".format(type(col1)))
col2 = col1.cast(StringType())
error is:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6046223946582899049.py", line 252, in <module>
eval(compiledCode)
File "<string>", line 14, in <module>
File "/usr/lib/spark/python/pyspark/sql/column.py", line 334, in cast
jdt = ctx._ssql_ctx.parseDataType(dataType.json())
AttributeError: 'JavaMember' object has no attribute 'parseDataType'
This is a known bug with Spark 2.0 on Zeppelin 0.6.1 that is targeted to be fixed in Zeppelin 0.6.2: https://issues.apache.org/jira/browse/ZEPPELIN-1411

Pyspark error while querying cassandra to convert into dataframes

I am getting the following error while executing the command:
user = sc.cassandraTable("DB NAME", "TABLE NAME").toDF()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 60, in toDF
return sqlContext.createDataFrame(self, schema, sampleRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 333, in createDataFrame
schema = self._inferSchema(rdd, samplingRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 220, in _inferSchema
raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
Load into a Dataframe directly this will also avoid any python level code for interpreting types.
sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="tb").load()

Resources