null column NOT IN list of strings strange results - apache-spark

I am getting weird results when using a spark SQL statements like:
select * from mytab where somecol NOT IN ('ABC','DEF')
If I set somecol to ABC it returns nothing. If I set it to XXX it returns a row.
However, if I leave the column blank, like ,, in the CSV data (so the value is read as null), it still does not return anything, even though null is not in the list of values.
This remains the case even if re-written as NOT(somecol IN ('ABC','DEF')).
I feel like this is to do with comparisons between null and strings, but I am not sure what to do about null column values that end up in IN or NOT IN clauses.
Do I need to convert them to empty strings first?

You can put explicit check for nulls in query as null comparison returns unknown in spark details here
select * from mytab where somecol NOT IN ('ABC','DEF') or somecol is null

Related

Azure SQL: join of 2 tables with 2 unicode fields returns empty when matching records exist

I have a table with a few key columns created with nvarchar(80) => unicode.
I can list the full dataset with SELECT * statement (Table1) and can confirm the values I need to filter are there.
However, I can't get any results from that table if I filter rows by using as input an alphabet char on any column.
Columns in table1 stores values in cyrilic characters.
I know it must have to do with character encoding => what I see in the result list is not what I use as input characters.
Unicode nvarchar type should resolve automatically this character type mismatch.
What do you suggest me to do in order to get results?
Thank you very much.
Paulo

Hive ORC table empty string

I have a Hive table whit data stored as ORC.
I write in some fields empty values (blank, '"") but sometimes when I run a select query on this table the empty string columns are shown as NULL in the query result.
I would like to see the empty values I entered, how is this possible?
If you want to see, empty values for NULL in hive table, then you can use NVL function, which can help you to produce default values for NULL column values.
Below is syntax,
NVL(arg1, arg2) - here argument 1 is expression or column and arg2 is default value for
NULL values.
e.g. Query - SELECT NVL(blank,'') as blank_1 AS FROM db.table;

sqlite instr function in not working in some cases

In Sqlite we have table1 with column column1
there are 4 rows with following values for column1
(p1,p10,p11,p20)
DROP TABLE IF EXISTS table1;
CREATE TABLE table1(column1 NVARCHAR);
INSERT INTO table1 (column1) values ('p1'),('p10'),('p11'),('p20');
Select instr(',p112,p108,p124,p204,p11,p1124,p1,p10,p20,',','+column1+',') from table1;
We have to get the position of each value of column1 in the given string:
,p112,p108,p124,p204,p11,p1124,p1,p10,p20,
the query
Select instr(',p112,p108,p124,p204,p11,p1124,p1,p10,p20,',column1) from table1;
returns values
(2,7,2,17)
which is not what we want
the query
Select instr(',p112,p108,p124,p204,p11,p1124,p1,p10,p20,',','+column1+',') from table1;
returns 9 for all rows -
it turned out that it is the position of first "0" symbol ???
Howe we can get the exact positions of column1 in the given string in sqlite ??
In SQLite the concatenation operator is || and not + (like SQL Server), so do this:
Select instr(',p112,p108,p124,p204,p11,p1124,p1,p10,p20,',',' || column1 || ',') from table1;
What you did with your code was number addition which resulted to 0, because none of the string operands could succesfully be converted to number,
so instr() was searching for '0' and found it always at position 9 of the string:',p112,p108,p124,p204,p11,p1124,p1,p10,p20,'.

Pandas : Cannot select row from dataframe

Here is my dataframe
Word 1_gram-Probability
0 ('A',) 0.001461
1 ('45',) 0.000730
now i just want to select the row where Word is 45. i tried
print(simple_df.loc[simple_df['Word']=='45'])
but i get
Empty DataFrame
what am i missing? Is this the correct way of accessing the row? I also tried ('45',) as the value but that did not work either.
It appears that you have the literal string value "('45',)" in the cell of your dataframe. You must select it exactly so.
simple_df.loc[simple_df['Word']=="('45',)"]

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?
Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?
EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.
With df.na.drop() you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).
In one case, I had to select records with NAs or nulls or >=0. I could so by using only coalesce function and none of like above 3 functions.
rdd.filter("coalesce(index_column, 1000) >= 0")

Resources