I have Cassandra table with unix epoch timestamp column (value e.g 1599613045). I would like to use spark sqlcontext to select from this table from date to date based on this unix epoch timestamp column. I intend to convert from date, to date input into epoch timestamp and compare (>= & <=) with table epoch ts column. Is it possible ? Any suggestion ? Many thanks!
Follow the below approach,
Let's consider cassandra is running on localhost:9042
keyspace-->mykeyspace
tabe-->mytable
columnName-->timestamp
spark-scala code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// create SparkSession
val spark=SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
//Read table from cassandra, spark-cassandra connector should be added to classpath
spark.conf.set("spark.cassandra.connection.host", "localhost")
spark.conf.set("spark.cassandra.connection.port", "9042")
var cassandraDF = spark.read.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "mykeyspace", "table" -> "mytable")).load()
//select timestamp column
cassandraDF=cassandraDF.select('timestamp)
cassandraDF.show(false)
// let's consider following as the output
+----------+
| timestamp|
+----------+
|1576089000|
|1575916200|
|1590258600|
|1591900200|
+----------+
// To convert the above output to spark's default date format yyyy-MM-dd
val outDF=cassandraDF.withColumn("date",to_date(from_unixtime('timestamp)))
outDF.show(false)
+----------+----------+
| timestamp| date|
+----------+----------+
|1576089000|2019-12-12|
|1575916200|2019-12-10|
|1590258600|2020-05-24|
|1591900200|2020-06-12|
+----------+----------+
// You can proceed with next steps from here
Related
I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+
I'm trying to read a csv file into spark with databricks, but my time column is in string format, my time column entry is like: 2019-08-01 23:59:05-07:00, I want to convert it into timestamp type, here's what I tried:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_file)
.withColumn("observed", unix_timestamp("dt", "yyyy-MM-dd hh:mm:ss.SSSZ")
.cast("double")
.cast("timestamp"))
)
But I got error message: cannot resolve '`dt`' given input columns, I'm guessing I didn't get the "yyyy-MM-dd hh:mm:ss.SSSZ" format right?
Assuming your csv looks like this:
df = spark.createDataFrame([('2019-08-01 23:59:05-07:00',)], ['dt'])
df.show()
+--------------------+
| dt|
+--------------------+
|2019-08-01 23:59:...|
+--------------------+
You can simply parse the timestamp with a to_timestamp function
from pyspark.sql.functions import to_timestamp
df.withColumn('observed', to_timestamp('dt', "yyyy-MM-dd HH:mm:ssXXX")).show()
+--------------------+-------------------+
| dt| observed|
+--------------------+-------------------+
|2019-08-01 23:59:...|2019-08-02 08:59:05|
+--------------------+-------------------+
So, as #HristoIliev mentioned, the reason behind cannot resolve '`dt`' is that 'dt' is supposed to be name of the column already in your dataframe, and 'observed' is supposed to be the name of a new column. If you adjust the names thought it still won't work, because there is format mismatch: yyyy-MM-dd hh:mm:ss.SSSZ won't parse 2019-08-01 23:59:05-07:00, but "yyyy-MM-dd HH:mm:ssXXX" will.
I've a column with the data 20180501 in string format, I want to convert it to date format, tried using
to_date(cast(unix_timestamp('20180501', 'YYYYMMDD') as timestamp))'
but still it didn't worked. I'm using Spark SQL with dataframes
The format should be yyyyMMdd:
spark.sql("SELECT to_date(cast(unix_timestamp('20180501', 'yyyyMMdd') as timestamp))").show()
# +------------------------------------------------------------------+
# |to_date(CAST(unix_timestamp('20180501', 'yyyyMMdd') AS TIMESTAMP))|
# +------------------------------------------------------------------+
# | 2018-05-01|
# +------------------------------------------------------------------+
As pointed out in the other answer the format you use is incorrect. But you can also use to_date directly:
spark.sql("SELECT to_date('20180501', 'yyyyMMdd')").show()
+-------------------------------+
|to_date('20180501', 'yyyyMMdd')|
+-------------------------------+
| 2018-05-01|
+-------------------------------+
I am using Spark Dataset and having trouble subtracting days from a timestamp column.
I would like to subtract days from Timestamp Column and get new Column with full datetime format. Example:
2017-09-22 13:17:39.900 - 10 ----> 2017-09-12 13:17:39.900
With date_sub functions I am getting 2017-09-12 without 13:17:39.900.
You cast data to timestamp and expr to subtract an INTERVAL:
import org.apache.spark.sql.functions.expr
val df = Seq("2017-09-22 13:17:39.900").toDF("timestamp")
df.withColumn(
"10_days_before",
$"timestamp".cast("timestamp") - expr("INTERVAL 10 DAYS")).show(false)
+-----------------------+---------------------+
|timestamp |10_days_before |
+-----------------------+---------------------+
|2017-09-22 13:17:39.900|2017-09-12 13:17:39.9|
+-----------------------+---------------------+
If data is already of TimestampType you can skip cast.
Or you can simply use date_sub function from pyspark +1.5:
from pyspark.sql.functions import *
df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))
I want to find out the datatype of each column of a table?
For example, let's say my table was created using this:
create table X
(
col1 string,
col2 int,
col3 int
)
I want to do a command that will output somethign like this:
column datatype
col1 string
col2 int
Is there a command for this? Preferably in SparkSQL. But, if not, then how to get this data using another way? I'm using spark sql to query hive tables. Perhaps through the metadata in HIVE? thank you.
You can read the Hive table as DataFrame and use the printSchema() function.
In pyspark repl:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
table=hive_context("database_name.table_name")
table.printSchema()
And similar in spark-shell repl(Scala):
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext
val table=hiveContext.table("database_name.table_name")
table.printSchema
You can use desc <db_name>.<tab_name>
(or)
spark.catalog.listColumns("<db>.<tab_name>")
Example:
spark.sql("create table X(col1 string,col2 int,col3 int)")
Using desc to get column_name and datatype:
spark.sql("desc default.x").select("col_name","data_type").show()
//+--------+---------+
//|col_name|data_type|
//+--------+---------+
//| col1| string|
//| col2| int|
//| col3| int|
//+--------+---------+
Using spark.catalog to get column_name and data_type:
spark.catalog.listColumns("default.x").select("name","dataType")show()
//+----+--------+
//|name|dataType|
//+----+--------+
//|col1| string|
//|col2| int|
//|col3| int|
//+----+--------+
in scala : Create a dataframe for your table and try below:
df.dtypes
Your result:
Array((PS_PROD_DESC,StringType), (PS_OPRTNG_UNIT_ID,StringType),...)