How to convert Georgian Calendar to Hijri/Islamic Calendar Spark - apache-spark

if you are using spark java and want to convert the Georgian Data column to Hijri then the available code will help you.
`
spark.udf().register("samplehijri", ( Date d1 ) -> {
Calendar cl=Calendar.getInstance();
cl.setTime(d1);
HijrahDate islamyDate = HijrahChronology.INSTANCE.date(
LocalDate.of(cl.get(Calendar.YEAR),cl.get(Calendar.MONTH)+1, cl.get(Calendar.DATE)));
System.out.println(islamyDate.toString());
return islamyDate.toString();
}, DataTypes.StringType);
resultDF = resultDF.withColumn("hijridob", functions.callUDF("samplehijri", functions.col("dob").cast("date")));
`
the resuldDF is final result. and dob wants to convert, this function will add another column with the name of hijridob.

Related

Get value of column in AWS Glue Custom Transform

I'm working on ETL in AWS Glue. I need to decode text from table which is in base64 - I'm doing that in Custom Transform in Python3.
My code is below:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
import base64
newdf = dfc.select(list(dfc.keys())[0]).toDF()
data = newdf["email"]
data_to_decrypt = base64.b64decode(data)
I've got error like that:
TypeError: argument should be a bytes-like object or ASCII string, not 'Column'
How to get plan string from the Column object?
I was wrong and it was completely different thing than I thought.
Column object from newdf["email"] consists all rows for this single column, so it's not possible to just fetch one value from that.
What I ended up doing is iterating through whole rows and mapping them to new value like that:
def map_row(row):
id = row.id
client_key = row.client_key
email = decrypt_jasypt_string(row.email.strip())
phone = decrypt_jasypt_string(row.phone.strip())
created_on = row.created_on
return (id, email, phone, created_on, client_key)
df = dfc.select(list(dfc.keys())[0]).toDF()
rdd2=df.rdd.map(lambda row: map_row(row))
df2=rdd2.toDF(["id","email","phone", "created_on", "client_key"])
dyf_filtered = DynamicFrame.fromDF(df2, glueContext, "does it matter?")

Pyspark change string to timestamptype

I am new to PySpark and I am trying to create a function that can be used across when inputted a column from String type to a timestampType.
This the input column string looks like: 23/04/2021 12:00:00 AM
I want this to be turned in to timestampType so I can get latest date using pyspark.
Below is the function I so far created:
def datetype_change(self, key, col):
self.log.info("datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn("column_name", F.unix_timestamp(F.col("column_name"), 'yyyy-MM-dd HH:mm:ss').cast(TimestampType()))
When I run it I'm getting an error:
NameError: name 'TimestampType' is not defined
How do I change this function so it can take the intended output?
Found my answer:
def datetype_change(self,key,col):
self.log.info("-datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn(col, F.unix_timestamp(self.df[key][col], 'dd/MM/yyyy hh:mm:ss aa').cast(TimestampType()))

AWS Glue - How to exclude rows where string does not match a date format

I have a dataset with a datecreated column. this column is typically in the format 'dd/MM/yy' but sometimes it has garbage text. I want to ultimately convert the column to a DATE and have the garbage text as a NULL value.
I have been trying to use resolveChoice, but it is resulting in all null values.
data_res = date_dyf.resolveChoice(specs =
[('datescanned','cast:timestamp')])
Sample data
3,1/1/18,text7
93,this is a test,text8
9,this is a test,text9
82,12/12/17,text10
Try converting a DynamicFrame into Spark's DataFrame and parse date using to_date function:
from pyspark.sql.functions import to_date
df = date_dyf.toDF
parsedDateDf = df.withColumn("datescanned", to_date(df["datescanned"], "dd/MM/yy"))
dyf = DynamicFrame.fromDF(parsedDateDf, glueContext, "convertedDyf")
If a string doesn't match the format a null value will be set

Define UDF in Spark Scala

I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.
You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !

CreateDataFrame or SaveAsTable intuitively encode in pyspark 1.6

I am trying to save a table in spark1.6 using pyspark. All of the tables columns are saved as text, I'm wondering if I can change this:
product = sc.textFile('s3://path/product.txt')
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
product.saveAsTable("product", mode='overwrite')
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics? I have a lot of files and a lot of fields to upload so ideally it would be automatic
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics
If you pass int or float (depending on what you need) pyspark will convert the data type for you.
In your case, changing the lambda function in
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
to
from pyspark.sql.types import Row
def split_product_line(line):
fields = line.split('\t')
return Row(
productid=int(fields[0]),
marketID=int(fields[1]),
...
)
product = m3product.map(split_product_line).toDF()
You will find it much easier to control data types and possibly error/exception checks.
Try to prohibit lambda functions if possible :)

Resources