Transformation logic in hive / spark

Transformation logic in hive / spark - apache-spark

How can we write a Hive query in select statement for the logic below ?
If a column value is null then it should return ' '
If trim(column) is null then it should return ' '
Else it should populate the value of that column.
I guess this can be implemented using case when approach.
How to implement in a Hive query?

I think you want:
select coalesce(trim(column), '')
Note that trim() doesn't take a second argument in Hive. Also, trim() doesn't return NULL unless the argument is NULL; it returns an empty string.

Related

Sybase/SAP ASE T-SQL: Why ltrim() and rtrim() native functions return null when evaluaten for empty and only-spaces strings?

I noted an (for me) unexpected behavior on ltrim() and rtrim() functions on Sybase ASE T-SQL engines. The weird behavior I observed occurs on ltrim() and rtrim() functions for empty and only-spaces strings arguments; they return NULL. I consider this behavior weird because one (or me at least) expect ltrim() and rtrim() string functions just remove starting and ending spaces; not to turn a not-null value (as empty and only-spaces strings) to NULL
To reproduce the issue, I wrote the script below. Knowing that using of the = operator against null isn't a good practice, I just included those examples for educative purposes
begin
create table #tmpSomeTable (
rowId int not null,
someStringColumn varchar(10) null
)
insert into #tmpSomeTable(someStringColumn,rowId) values (null,1)
insert into #tmpSomeTable(someStringColumn,rowId) values ('',2)
insert into #tmpSomeTable(someStringColumn,rowId) values (' ',3)
insert into #tmpSomeTable(someStringColumn,rowId) values (' ',4)
--
select '=null predicate' as [predicate]
,*
from #tmpSomeTable
where someStringColumn = null
select 'is null predicate' as [predicate]
,*
from #tmpSomeTable
where someStringColumn is null
--ltrim() when evaluated for empty string from rowId=2 row returns null
select '=null predicate' as [predicate]
,*
from #tmpSomeTable
where ltrim(someStringColumn) = null
select 'is null predicate' as [predicate]
,*
from #tmpSomeTable
where ltrim(someStringColumn) is null
--rtrim() when evaluated for empty string from rowId=2 row returns null
select '=null predicate' as [predicate]
,*
from #tmpSomeTable
where rtrim(someStringColumn) = null
select 'is null predicate' as [predicate]
,*
from #tmpSomeTable
where rtrim(someStringColumn) = null
drop table #tmpSomeTable
end
My questions about this are ¿Why ltrim() and rtrim() native functions return null when evaluated for empty and only-spaces strings on Sybase ASE Database Engine? ¿Is that an expected behavior? ¿Is that a non-deterministic behavior defined by database instance parameters? Is that a Bug or Known-issue?

How to create an expression or condition from a string value inside spark dataframe

I am trying to filter a column in dataframe using filter() function.
And the condition for filter is saved in a string variable like below.
val condition = ">10"
val outDF = df.filter((col("value") > expr(condition))
In the above code, is it possible to use expression or any SQL functions to convert the condition string value ">10" to an actual condition in filter function ?

Try below code.
val condition = "> 10"
df.filter(s"value ${condition}")
OR
df
.filter(expr(s"value ${condition}"))
.show(false)

How does when function work with multiple matching cases?

does spark when function is consistently return the first match?
for example,
val df = spark.sql("SELECT 1 as a")
df.withColumn("a",when($"a">0,1).when($"a">0.5,2)).show()
does it always return the first 'when' match consistently?
or better practice is to do that way:
df.withColumn("a",when($"a">0,1).otherwise(when($"a">0.5,2)).show()
what is better practice to use?

The documentation (especially the example) suggest that the first match is taken :
Evaluates a list of conditions and returns one of multiple possible
result expressions. If otherwise is not defined at the end, null is
returned for unmatched conditions.
// Example: encoding gender string column into integer.
// Scala:
people.select(
when(people("gender") === "male", 0)
.when(people("gender") === "female", 1)
.otherwise(2))
EDIT:
Although this example uses disjunct cases, it's standard SQL that the first match is taken, see e.g. https://dba.stackexchange.com/a/43353 and https://stackoverflow.com/a/22641095/1138523

Access column value from within knex query

I'd like to update the value of column A by applying a function to column B.
Is there a simple solution of the form:
knex('table')
.update({
colA: func(${colB})
})

Yes, there is a way to do this within Knex.
For SQL functions which don’t have explicit support in Knex you use knex.raw(SQLstring, parmArray) to encapsulate a SQL snippet or knex.schema.raw(...) to produce an entire SQL statement. And you use single question marks ? for value replacements, and double question marks ?? for field identifier replacements. (see link)
So the SQL: UPDATE table SET colA = func(colB)
... can be produced by including a SQL snippet: (you were close)
knex('table')
.update({
colA: knex.raw( 'func(??)', ['colB'] )
})
... or as full raw SQL:
knex.schema.raw( 'UPDATE table SET ?? = func(??)', ['colA', 'colB'] )
Cheers, Gary.

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?

I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except

If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)

when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6

If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Transformation logic in hive / spark - apache-spark

I think you want: select coalesce(trim(column), '') Note that trim() doesn't take a second argument in Hive. Also, trim() doesn't return NULL unless the argument is NULL; it returns an empty string.

Related

Sybase/SAP ASE T-SQL: Why ltrim() and rtrim() native functions return null when evaluaten for empty and only-spaces strings?

How to create an expression or condition from a string value inside spark dataframe

How does when function work with multiple matching cases?

Access column value from within knex query

How can you update values in a dataset?

Categories

Resources