Hive auto increment UDF doesn't give desired results - apache-spark

I am trying to create a UDF in Hive. This UDF has to auto increment a hive table column called id.
Now the following is the Java code to create the UDF.
package myudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
#UDFType(deterministic = false, stateful = true)
public class autoincrement extends UDF{
int lastValue;
public int evaluate() {
lastValue++;
return lastValue;
}
}
Now I am able create a jar file and add the jar file to hive like below:
add jar /home/cloudera/Desktop/increment.jar;
Then create a temporary function
create temporary function inc as 'myudf.autoincrement';
Create table like below.
Create table abc(id int, name string)
Insert values:
INSERT into TABLE abc SELECT inc() as id, 'Tim';
Do select statement:
select * from abc;
Output:
1 Tim
Insert values:
INSERT into TABLE abc SELECT inc() as id, 'John';
Do select statement:
select * from abc
Output:
1 Tim
1 John
But what I was expecting was when I insert values for the 2nd time.
My expected output was :
1 Tim
2 John
How to get the expected output. What should I change in the Java code to get the desired result?
And Can I use the same function in Spark as well
In spark when I do
sqlContext.sql("show functions")
It shows the list of all functions available in Hive
But when I do
sqlContext.sql("INSERT into TABLE abc SELECT inc() as id, 'Jim'")
I got the below error
pyspark.sql.utils.AnalysisException: u'undefined function inc; line 1 pos 29'
How to create the same UDF in pyspark and get the desired output
What happens when the insert statements are executed at the same time?

Follow the below steps
change your insert to INSERT into TABLE abc SELECT max(id)+1 as id, 'Tim' from abc;
or
Modify the UDF to take int column as input and return input+1
modify your insert to INSERT into TABLE abc SELECT inc(max(id)) as id, 'Tim' from abc;
You have to try the correctness of the SQL in hive as I have checked and it works in MYSQL.

Related

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

Use Common Table Expression with Pony ORM

I have the following query that contains a common table expression:
WITH example AS (
SELECT unnest(ARRAY['foo', 'bar', 'baz']) as col
)
SELECT *
FROM example
Trying to use it in database.select(query) throws pony.orm.dbapiprovider.ProgrammingError: syntax error at or near "WITH", and database.select(raw_sql(query)) throws TypeError: expected string or bytes-like object.
How can I select data using a CTE with ponyorm?
To use a query containing a CTE, call the execute function on the database and fetch the rows with the returned cursor:
cursor = database.execute("""
WITH example AS (
SELECT unnest(ARRAY['foo', 'bar', 'baz']) as col
)
SELECT *
FROM example
""")
rows = cursor.fetchall()
Note: The cursor is a class from psycopg2, so while this solution does use the pony library the solution may differ depending on the database being used.

Spark: How to make spark execute transformation one time on which three actions depending on

We have written a spark batch application (Spark version: 2.3.0). Code is as below.
Transformation: Dataset<CollectionFlattenedData> collectionDataDS = flatMap(Function which parse some files and return us the data set); This data set will have three types of data which are differentiated by column record type: 1,2,3.
Load to Temp table : collectionDataDS.createOrReplaceTempView(TEMP_TABLE); Creating temp view of dataset.
Action1: sparkSession.sql("INSERT INTO TABLE1 SELECT COL1,COL2,COL3 FROM TEMP_TABLE WHERE recordtype='1'"); hive query to load TABLE1 table from temp table.
Action2: sparkSession.sql("INSERT INTO TABLE2 SELECT COL4,COL5,COL6 FROM TEMP_TABLE WHERE recordtype='2'"); Hive query to load TABLE2 from temp table.
Action3: sparkSession.sql("INSERT INTO TABLE2 SELECT COL7,COL8,COL9 FROM TEMP_TABLE WHERE recordtype='3'"); hive query to load ERROR table
What is happening: Because we are running 3 queries which are nothing but separate actions flatmap transformation is called three times(one time for one action). But our requirement is we should call flatmap operation only once.
CollectionFlattenedData pojo code is something like this
public class CollectionFlattenedData implements Serializable {
private String recordtype;
private String COL1;
private String COL2;
private String COL3;
private String COL4;
private String COL5;
private String COL6;
private String COL7;
private String COL8;
private String COL9;
//getters and setters of all the columns
}
Is there anyway we can do this. Early response is highly appreciated.
We can process this in two ways but first identify the size of the "TEMP_TABLE".
If the size is in the order of your RAM i.e. if it is able to cache good amount of your TEMP table, then you can cache it & use it in the further calculations.(You can get the data quantity from UI)
The other way better way is just save the data into a permanent table.
You can just refer the next steps as usually.
When you use .createOrReplaceTempView(), you are giving a name to use it further in your spark sql like queries. It will not create any action on the resultant dataframe.

Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources