Using unix_timestamp method in creating timestamp in spark - apache-spark

i have a csv file. It has many columns out of which two are Month and Year. Month is represented as 1...12 whereas Year 2013.. (Example). I need to create a timestamp in the format of mm/yyyy as a new column, say, 'timestamp'. I tried the below snippet but it failed.
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> val df = spark.read.format("csv").option("header",
"true").load("/user/bala/AWI/*.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, Month: string ... 28
more fields]
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf
scala> def makeDT(Month: String, Year: String) = s"$Month $Year"
makeDT: (Month: String, Year: String)String
scala> val makeDt = udf(makeDT(_:String,_:String))
makeDt: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function2>,StringType,Some(List(StringType,
StringType)))
scala> df.select($"Month", $"Year", unix_timestamp(makeDt($"Month",
$"Year"), "mm/yyyy")).show(2)
+-----+----+-----------------------------------------+
|Month|Year|unix_timestamp(UDF(Month, Year), mm/yyyy)|
+-----+----+-----------------------------------------+
| 1|2013| null|
| 1|2013| null|
+-----+----+-----------------------------------------+
only showing top 2 rows
scala>
Can someone point out to me where I am going wrong??

You need day, month & year to build timestamp.
You can redefine your makeMT:
scala>def makeMT(Month: String, Year: String) = s"00/$Month/$Year 00:00:00"
Then you can use it similar to below (I didnt test it):
(unix_timestamp(makeDt($"Month", $"Year"), "dd/M/yyyy HH:mm:ss") * 1000).cast("timestamp")

Related

How can I cast a spanish date (14-ENE-2021) in a single spark sql query?

I have a file with dates that are imported as strings with the following format:
14-ENE-2021
as a spanish date (ene = january). I need to cast this as a date in a single spark SQL query. So far I have tried:
spark.sql("select TO_DATE('14-ENE-21', 'dd-MMM-yy')").show()
Which returns null.
It's a tricky one, but you can use from_csv to set a locale:
spark.sql("""
select from_csv(
'14-ENE-21',
'date date',
map('dateFormat', 'dd-MMM-yy', 'locale', 'ES')
).date as date
""").show()
+----------+
| date|
+----------+
|2021-01-14|
+----------+
(inspired from the docs)
You can register a custom toDate UDF like this:
import java.text.SimpleDateFormat
import java.util.Locale
spark.udf.register("toDate", (value: String, pattern: String, locale: String) => {
try {
val parser = new SimpleDateFormat(pattern, new Locale(locale))
val formatter = new SimpleDateFormat("yyyy-MM-dd", Locale.US)
Some(new java.sql.Date(parser.parse(value.trim).getTime))
} catch {
case _: Exception => None
}
}
)
spark.sql("select toDate('14-ENE-21', 'dd-MMM-yy', 'ES') as date").show()
//+----------+
//| date|
//+----------+
//|2021-01-14|
//+----------+

Converting a column from string to to_date populating a different month in pyspark

I am using spark 1.6.3. When converting a column val1 (of datatype string) to date, the code is populating a different month in the result than what's in the source.
For example, suppose my source is 6/15/2017 18:32. The code below is producing 15-1-2017 as the result (Note that the month is incorrect).
My code snippet is as below
from pyspark.sql.functions import from_unixtime,unix_timestamp ,to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "mm/dd/yyyy"))))
Expected output is 6/15/2017 of date type. Please suggest.
You're using the incorrect date format. You need to use MM for the month (not mm).
For example:
df = sqlCtx.createDataFrame([('6/15/2017 18:32',)], ["val1"])
df.printSchema()
#root
# |-- val1: string (nullable = true)
As we can see val1 is a string. We can convert to date using your code with the capital M:
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date
df5 = df.withColumn("val1", to_date(from_unixtime(unix_timestamp(("val1"), "MM/dd/yyyy"))))
df5.show()
#+----------+
#| val1|
#+----------+
#|2017-06-15|
#+----------+
The new is a date type, which will display as YYYY-MM-DD:
df5.printSchema()
#root
# |-- val1: date (nullable = true)

Adding month to DateType based on column value

Assuming a dataframe with a date column and an Int column representing a number of months:
val df = Seq(("2011-11-11",1),("2010-11-11",3),("2012-11-11",5))
.toDF("startDate","monthsToAdd")
.withColumn("startDate",'startDate.cast(DateType))
+----------+-----------+
| startDate|monthsToAdd|
+----------+-----------+
|2011-11-11| 1|
|2010-11-11| 3|
|2012-11-11| 5|
+----------+-----------+
is there a way of creating an endDate column by adding the months to startDate without converting the date column back to string?
So basically same as the add_months function
def add_months(startDate: Column, numMonths: Int)
but passing a column instead of a literal.
you can use UDF (User Defined Functions) to achieve this. Below I have create myUDF function which add the months to date and returns the result date in String format and I will use this UDF to create a new column by using withColumn on DataFrame
import java.text.SimpleDateFormat
import java.util.Calendar
import javax.xml.bind.DatatypeConverter
import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._
val df = Seq(("2011-11-11",1),("2010-11-11",3),("2012-11-11",5)).toDF("startDate","monthsToAdd")
val myUDF = udf {
val simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
(startDate: String, monthValue: Int) => {
val calendar = DatatypeConverter.parseDateTime(startDate)
calendar.add(Calendar.MONTH, monthValue)
simpleDateFormat.format(calendar.getTime)
}
}
val newDf = df.withColumn("endDate", myUDF(df("startDate"), df("monthsToAdd")))
newDf.show()
Output:
+----------+-----------+----------+
| startDate|monthsToAdd| endDate|
+----------+-----------+----------+
|2011-11-11| 1|2011-12-11|
|2010-11-11| 3|2011-02-11|
|2012-11-11| 5|2013-04-11|
+----------+-----------+----------+

Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values.
df.createOrReplaceTempView("incidents")
spark.sql("select Date from incidents").show()
+----------+
| Date|
+----------+
|08/26/2016|
|08/26/2016|
|08/26/2016|
|06/14/2016|
spark.sql("select to_date(Date) from incidents").show()
+---------------------------+
|to_date(CAST(Date AS DATE))|
+---------------------------+
| null|
| null|
| null|
| null|
The Date column is in String format:
|-- Date: string (nullable = true)
Use to_date with Java SimpleDateFormat.
TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP))
Example:
spark.sql("""
SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""
).show()
+----------+
| dt|
+----------+
|2016-08-26|
+----------+
I solved the same problem without the temp table/view and with dataframe functions.
Of course I found that only one format works with this solution and that's yyyy-MM-DD.
For example:
val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))
df3.printSchema
root
|-- Id: string (nullable = true)
|-- Timestamp: timestamp (nullable = true)
|-- Date: date (nullable = true)
df3.show
+----------+--------------------+----------+
| Id| Timestamp| Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+
The timestamp of course has 00:00:00.0 as a time value.
Since your main aim was to convert the type of a column in a DataFrame from String to Timestamp, I think this approach would be better.
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))
You could also use to_timestamp (I think this is available from Spark 2.x) if you require fine grained timestamp.
you can also do this query...!
sqlContext.sql("""
select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format
""").show()
You can also pass date format
df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))
For Example
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))
I have personally found some errors in when using unix_timestamp based date converstions from dd-MMM-yyyy format to yyyy-mm-dd, using spark 1.6, but this may extend into recent versions. Below I explain a way to solve the problem using java.time that should work in all versions of spark:
I've seen errors when doing:
from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate
Below is code to illustrate the error, and my solution to fix it.
First I read in stock market data, in a common standard file format:
import sys.process._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
import sqlContext.implicits._
val EODSchema = StructType(Array(
StructField("Symbol" , StringType, true), //$1
StructField("Date" , StringType, true), //$2
StructField("Open" , StringType, true), //$3
StructField("High" , StringType, true), //$4
StructField("Low" , StringType, true), //$5
StructField("Close" , StringType, true), //$6
StructField("Volume" , StringType, true) //$7
))
val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"
// below is code to read using later versions of spark
//val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)
// here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"
val eoddata = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("delimiter", ",") //.option("dateFormat", "dd-MMM-yyyy") failed to work
.schema(EODSchema)
.load(textFileName)
eoddata.registerTempTable("eoddata")
And here is the date conversions having issues:
%sql
-- notice there are errors around the turn of the year
Select
e.Date as StringDate
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ProperDate
, e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000
A chart made in zeppelin shows spikes, which are errors.
and here is the check that shows the date conversion errors:
// shows the unix_timestamp conversion approach can create errors
val result = sqlContext.sql("""
Select errors.* from
(
Select
t.*
, substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy
, substring(t.ConvertedCloseDate, 0, 4) as Converted_Date_Year_yyyy
from
( Select
Symbol
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ConvertedCloseDate
, e.Date as OriginalStringDate
, Close
from eoddata e
where e.Symbol = 'SPX.IDX'
) t
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy
""")
//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))
result.registerTempTable("SPX")
result.cache()
result.show(100)
result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
+-------+------------------+------------------+-------+----------------+------------------------+
| Symbol|ConvertedCloseDate|OriginalStringDate| Close|String_Year_yyyy|Converted_Date_Year_yyyy|
+-------+------------------+------------------+-------+----------------+------------------------+
|SPX.IDX| 1997-12-30| 30-Dec-1996| 753.85| 1996| 1997|
|SPX.IDX| 1997-12-31| 31-Dec-1996| 740.74| 1996| 1997|
|SPX.IDX| 1998-12-29| 29-Dec-1997| 953.36| 1997| 1998|
|SPX.IDX| 1998-12-30| 30-Dec-1997| 970.84| 1997| 1998|
|SPX.IDX| 1998-12-31| 31-Dec-1997| 970.43| 1997| 1998|
|SPX.IDX| 1998-01-01| 01-Jan-1999|1229.23| 1999| 1998|
+-------+------------------+------------------+-------+----------------+------------------------+
FINISHED
After this result, I switched to java.time conversions with a UDF like this, which worked for me:
// now we will create a UDF that uses the very nice java.time library to properly convert the silly stockmarket dates
// start by importing the specific java.time libraries that superceded the joda.time ones
import java.time.LocalDate
import java.time.format.DateTimeFormatter
// now define a specific data conversion function we want
def fromEODDate (YourStringDate: String): String = {
val formatter = DateTimeFormatter.ofPattern("dd-MMM-yyyy")
var retDate = LocalDate.parse(YourStringDate, formatter)
// this should return a proper yyyy-MM-dd date from the silly dd-MMM-yyyy formats
// now we format this true local date with a formatter to the desired yyyy-MM-dd format
val retStringDate = retDate.format(DateTimeFormatter.ISO_LOCAL_DATE)
return(retStringDate)
}
Now I register it as a function for use in sql:
sqlContext.udf.register("fromEODDate", fromEODDate(_:String))
and check the results, and rerun test:
val results = sqlContext.sql("""
Select
e.Symbol as Symbol
, e.Date as OrigStringDate
, Cast(fromEODDate(e.Date) as Date) as ConvertedDate
, e.Open
, e.High
, e.Low
, e.Close
from eoddata e
order by Cast(fromEODDate(e.Date) as Date)
""")
results.printSchema()
results.cache()
results.registerTempTable("results")
results.show(10)
results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
root
|-- Symbol: string (nullable = true)
|-- OrigStringDate: string (nullable = true)
|-- ConvertedDate: date (nullable = true)
|-- Open: string (nullable = true)
|-- High: string (nullable = true)
|-- Low: string (nullable = true)
|-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
+--------+--------------+-------------+-------+-------+-------+-------+
| Symbol|OrigStringDate|ConvertedDate| Open| High| Low| Close|
+--------+--------------+-------------+-------+-------+-------+-------+
|ADVA.IDX| 01-Jan-1996| 1996-01-01| 364| 364| 364| 364|
|ADVN.IDX| 01-Jan-1996| 1996-01-01| 1527| 1527| 1527| 1527|
|ADVQ.IDX| 01-Jan-1996| 1996-01-01| 1283| 1283| 1283| 1283|
|BANK.IDX| 01-Jan-1996| 1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX| 01-Jan-1996| 1996-01-01| 39.39| 39.39| 39.39| 39.39|
|COMP.IDX| 01-Jan-1996| 1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX| 01-Jan-1996| 1996-01-01| 1.261| 1.261| 1.261| 1.261|
|DECA.IDX| 01-Jan-1996| 1996-01-01| 205| 205| 205| 205|
|DECN.IDX| 01-Jan-1996| 1996-01-01| 825| 825| 825| 825|
|DECQ.IDX| 01-Jan-1996| 1996-01-01| 754| 754| 754| 754|
+--------+--------------+-------------+-------+-------+-------+-------+
only showing top 10 rows
which looks ok, and I rerun my chart, to see if there are errors/spikes:
As you can see, no more spikes or errors. I now use a UDF as I've shown to apply my date format transformations to a standard yyyy-MM-dd format, and have not had spurious errors since. :-)
You could simply do df.withColumn("date", date_format(col("string"),"yyyy-MM-dd HH:mm:ss.ssssss")).show()
dateID is int column contains date in Int format
spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)
Find the below-mentioned code, it might be helpful for you.
val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+
This works in Spark SQL:
TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name. You can replace the second argument with however your date string is formatted, e.g. yyyy/MM/dd. The return type is date.
Use below function in PySpark to convert datatype into your required datatype.
Here I'm converting all the date datatype into the Timestamp column.
def change_dtype(df):
for name, dtype in df.dtypes:
if dtype == "date":
df = df.withColumn(name, col(name).cast('timestamp'))
return df
When you try to change the string data type to date format when you have the string data in the format 'dd/MM/yyyy' with slashes and using spark version greater than 3.0 it converts the value to null.
In order for that to work you can set the spark configuration property which will allow you to get the output that you want.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
and then we can use the below code to get the output that we want
df.withColumn("tx_date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
The solution proposed above by Sai Kiriti Badam worked for me.
I'm using Azure Databricks to read data captured from an EventHub. This contains a string column named EnqueuedTimeUtc with the following format...
12/7/2018 12:54:13 PM
I'm using a Python notebook and used the following...
import pyspark.sql.functions as func
sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))
... to create a new column EnqueuedTimestamp of type "timestamp" with data in the following format...
2018-12-07 12:54:13

Issue with DataFrame na() fill methods and ambiguous reference

I'm using Spark 1.3.1 where joining two dataframes repeats the column(s) being
joined. I'm left outer joining two data frames and want to send the
resulting dataframe to the na().fill() method to convert nulls to known
values based on the data type of the column. I've built a map of
"table.column" -> "value" and pass that to the fill method. But I get
exception instead of success :(. What are my options? I see that there is a dataFrame.withColumnRenamed method but I can only rename one column. I have joins that involve more than one column. Do I just have to ensure that there is a unique set of column names, regardless of table aliases in the dataFrame where I apply the na().fill() method?
Given:
scala> val df1 = sqlContext.jsonFile("people.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [first: string, last: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [first: string, last: string]
I can join them together with
val df3 = df1.join(df2, df1("first") === df2("first"), "left_outer")
And I have a map that converts data type to value.
scala> val map = Map("df1.first"->"unknown", "df1.last" -> "unknown",
"df2.first" -> "unknown", "df2.last" -> "unknown")
But executing fill(map) results in exception.
scala> df3.na.fill(map)
org.apache.spark.sql.AnalysisException: Reference 'first' is ambiguous,
could be: first#6, first#8.;
Here is what I came up with. In my original example, there is nothing interesting left in df2 after the join, so I changed this to be classical department / employee example.
department.json
{"department": 2, "name":"accounting"}
{"department": 1, "name":"engineering"}
person.json
{"department": 1, "first":"Bruce", "last": "szalwinski"}
And now I can join the dataframes, build the map, and replace nulls with unknowns.
scala> val df1 = sqlContext.jsonFile("department.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [department: bigint, name: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> val df3 = df1.join(df2, df1("department") === df2("department"), "left_outer")
df3: org.apache.spark.sql.DataFrame = [department: bigint, name: string, department: bigint, first: string, last: string]
scala> val map = Map("first" -> "unknown", "last" -> "unknown")
map: scala.collection.immutable.Map[String,String] = Map(first -> unknown, last -> unknown)
scala> val df4 = df3.select("df1.department", "df2.first", "df2.last").na.fill(map)
df4: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> df4.show()
+----------+-------+----------+
|department| first| last|
+----------+-------+----------+
| 2|unknown| unknown|
| 1| Bruce|szalwinski|
+----------+-------+----------+

Resources