I am working with data with timestamps that contain nanoseconds and am trying to convert the string to timestamp format.
Here is what the 'Time' column looks like:
+---------------+
| Time |
+---------------+
|091940731349000|
|092955002327000|
|092955004088000|
+---------------+
I would like to cast it to:
+------------------+
| Timestamp |
+------------------+
|09:19:40.731349000|
|09:29:55.002327000|
|09:29:55.004088000|
+------------------+
From what I have found online, I don't need to use a udf to do this and there should be a native function that I can use.
I have tried cast and to_timestamp but got 'null' values:
df_new = df.withColumn('Timestamp', df.Time.cast("timestamp"))
df_new.select('Timestamp').show()
+---------+
|Timestamp|
+---------+
| null|
| null|
+---------+
There are two problems in your code:
Input is not a valid timestamp representation.
Spark doesn't provide type that can represent time without date component
The closest you can get to the required output is to convert input to JDBC compliant java.sql.Timestamp format:
from pyspark.sql.functions import col, regexp_replace
df = spark.createDataFrame(
["091940731349000", "092955002327000", "092955004088000"],
"string"
).toDF("time")
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"1970-01-01 $1:$2:$3.$4"
).cast("timestamp").alias("time")).show(truncate = False)
# +--------------------------+
# |time |
# +--------------------------+
# |1970-01-01 09:19:40.731349|
# |1970-01-01 09:29:55.002327|
# |1970-01-01 09:29:55.004088|
# +--------------------------+
If you want just a string skip cast and limit output to:
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"$1:$2:$3.$4"
).alias("time")).show(truncate = False)
# +------------------+
# |time |
# +------------------+
# |09:19:40.731349000|
# |09:29:55.002327000|
# |09:29:55.004088000|
# +------------------+
Related
I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.
As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.
So, you can first convert the flags to MapType and use map_keys to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
+-----+--------+
| id| flags|
+-----+--------+
|v_001|[93, 83]|
|v_002|[45, 76]|
+-----+--------+
Then you can use explode on the flags.
.select('id', F.explode('flags'))
+-----+---+
| id|col|
+-----+---+
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
+-----+---+
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
I'm trying to create a DataSet from the Dataframe using a case class.
case class test (language:String, users_count: String = "100")
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
df.as[test]
How to handle the scenario where a column is missing in the dataframe ?
The expectation is dataset populates default value provided in the case class.
If the dataframe only has one column, it throws an exception
org.apache.spark.sql.AnalysisException: cannot resolve 'users_count' given input columns: [language];
Expected Result:
+--------+
|language|
+--------+
| Java|
| Python|
| Scala|
+--------+
df.as[test].collect(0)
test('Java',100) // where 100 is the default value
You could use the map function and explicitly call the constructor like this:
df
.map(row => test(row.getAs[String]("language")))
.show
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 100|
| Python| 100|
| Scala| 100|
+--------+-----------+
How can I convert timestamp as string to timestamp in "yyyy-mm-ddThh:mm:ss.sssZ" format using PySpark?
Input timestamp (string), df:
| col_string |
| :-------------------- |
| 5/15/2022 2:11:06 AM |
Desired output (timestamp), df:
| col_timestamp |
| :---------------------- |
| 2022-05-15T2:11:06.000Z |
to_timestamp can be used providing the optional format parameter.
from pyspark.sql import functions as F
df = spark.createDataFrame([("5/15/2022 2:11:06 AM",)], ["col_string"])
df = df.select(F.to_timestamp("col_string", "M/dd/yyyy h:mm:ss a").alias("col_ts"))
df.show()
# +-------------------+
# | col_ts|
# +-------------------+
# |2022-05-15 02:11:06|
# +-------------------+
df.printSchema()
# root
# |-- col_ts: timestamp (nullable = true)
I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?
Using pyspark, what is the best way to reduce a map to the item with smallest value for each row?
In the below example, I would like to only take the action that is occurred first:
Example dataframe:
+------+-----------------------+
| Name | Actions |
+------+-----------------------+
|Alice |{1978:'aaa',1981:'bbb'}|
|Jack |{1999:'xxx',1988:'yyy'}|
|Bill |{1992:'zzz'} |
+------+-----------------------+
Desired DF:
+------+----------------------+
| Name | Actions |
+------+----------------------+
|Alice |{1978:'aaa'} |
|Jack |{1988:'yyy'} |
|Bill |{1992:'zzz'} |
+------+----------------------+
Convert to arrays with map_keys and map_values:
from pyspark.sql.functions import *
df = spark.createDataFrame([("Name", {1978: 'aaa', 1981: 'bbb'})], ("Name", "Actions"))
df_array = df.select(
"Name",
map_keys("Actions").alias("keys"),
map_values("Actions").alias("values")
)
Combine both with arrays_zip, sort with array_sort:
df_array_sorted = df_array.withColumn("sorted", arrays_zip("keys", "values"))
take the first element and convert back to map with map_from_entries
df_array_sorted.select("Name", map_from_entries(array(col("sorted")[0])).alias("Actions")).show()
# +----+-------------+
# |Name| Actions|
# +----+-------------+
# |Name|[1981 -> bbb]|
# +----+-------------+