In PySpark 1.3, there appears to be no startTime:
>>> sc.startTime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'startTime'
Scala has the method:
scala> sc.startTime
res1: Long = 1431974499272
Why?
As #vanza pointed out, you can access the startTime through the jsc
sc._jsc.startTime()
I put together a quick PR https://github.com/apache/spark/pull/6275 which adds the startTime property to pyspark.
Related
One of the tool gives the start time in milliseconds as below:
'StartMilliseconds': 1645250400857
How do I find the hours passed since this timestamp? I tried below
>>> start=datetime.datetime.fromtimestamp(1645250400857/1000.0)
>>> now=datetime.datetime.now
>>> now-start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'builtin_function_or_method' and
'datetime.datetime'
You can find the current timestamp in seconds using datetime.now().timestamp() and then apply the conversions to compute hours since start:
from datetime import datetime
start_ms = 1645250400857
hours_since_start = (datetime.now().timestamp() - start_ms / 1000) / 3600
I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)
When I perform the following actions.I met this problem in centos 7.0 and spark 2.1.0. I am a freshman in spark. How to fix it?
>>> from pyspark.sql import SQLContext
>>> ssc = SQLContext(sc)
>>> df = ssc.jsonFile('file:///root/work/person.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
Use SparkSession with the newer version of Spark and read using
df = spark.read.json('path to json).
jsonFile has been deprecated, please use sqlContext.read.json
This simple PySpark snippet runs fine with normal spark-submit but fails with Apache Zeppelin on the cast call. Any ideas?
%pyspark
import pyspark.sql.functions as spark_functions
col1 = spark_functions.lit(None)
print("type(col1)={}".format(type(col1)))
col2 = col1.cast(StringType())
error is:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6046223946582899049.py", line 252, in <module>
eval(compiledCode)
File "<string>", line 14, in <module>
File "/usr/lib/spark/python/pyspark/sql/column.py", line 334, in cast
jdt = ctx._ssql_ctx.parseDataType(dataType.json())
AttributeError: 'JavaMember' object has no attribute 'parseDataType'
This is a known bug with Spark 2.0 on Zeppelin 0.6.1 that is targeted to be fixed in Zeppelin 0.6.2: https://issues.apache.org/jira/browse/ZEPPELIN-1411
I am getting the following error while executing the command:
user = sc.cassandraTable("DB NAME", "TABLE NAME").toDF()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 60, in toDF
return sqlContext.createDataFrame(self, schema, sampleRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 333, in createDataFrame
schema = self._inferSchema(rdd, samplingRatio)
File "/usr/local/src/spark/spark-1.4.1/python/pyspark/sql/context.py", line 220, in _inferSchema
raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
Load into a Dataframe directly this will also avoid any python level code for interpreting types.
sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="tb").load()