Hive ORC table empty string - string

I have a Hive table whit data stored as ORC.
I write in some fields empty values (blank, '"") but sometimes when I run a select query on this table the empty string columns are shown as NULL in the query result.
I would like to see the empty values I entered, how is this possible?

If you want to see, empty values for NULL in hive table, then you can use NVL function, which can help you to produce default values for NULL column values.
Below is syntax,
NVL(arg1, arg2) - here argument 1 is expression or column and arg2 is default value for
NULL values.
e.g. Query - SELECT NVL(blank,'') as blank_1 AS FROM db.table;

Related

null column NOT IN list of strings strange results

I am getting weird results when using a spark SQL statements like:
select * from mytab where somecol NOT IN ('ABC','DEF')
If I set somecol to ABC it returns nothing. If I set it to XXX it returns a row.
However, if I leave the column blank, like ,, in the CSV data (so the value is read as null), it still does not return anything, even though null is not in the list of values.
This remains the case even if re-written as NOT(somecol IN ('ABC','DEF')).
I feel like this is to do with comparisons between null and strings, but I am not sure what to do about null column values that end up in IN or NOT IN clauses.
Do I need to convert them to empty strings first?
You can put explicit check for nulls in query as null comparison returns unknown in spark details here
select * from mytab where somecol NOT IN ('ABC','DEF') or somecol is null

How can I write a query with an order by clause when the field I'm ordering by is based on the value of another column?

I'm trying to write a query where the order by clause depends on the value of a key named sortBy in a JSON column.
const userId = req.params.id
SELECT * FROM users
WHERE ...
ORDER BY (SELECT JSON_EXTRACT(preferences, '$.sortBy') FROM users where id=${userId})
I think the problem is that the subquery, which gets the value "last_active", which is one of the columns in the user table, is a text string enclosed in quotes, but the order by clause requires a string without quotes?

Azure SQL: join of 2 tables with 2 unicode fields returns empty when matching records exist

I have a table with a few key columns created with nvarchar(80) => unicode.
I can list the full dataset with SELECT * statement (Table1) and can confirm the values I need to filter are there.
However, I can't get any results from that table if I filter rows by using as input an alphabet char on any column.
Columns in table1 stores values in cyrilic characters.
I know it must have to do with character encoding => what I see in the result list is not what I use as input characters.
Unicode nvarchar type should resolve automatically this character type mismatch.
What do you suggest me to do in order to get results?
Thank you very much.
Paulo

How to query ambiguous data types in Athena?

I have a data set stored in Parquet files crawled from S3 and registered in Glue Data Catalog. Some of the columns are of ambiguous type.
For example column col is typed as struct<long:bigint,string:string>.
If I select from that table tbl, then values of col are displayed for example like this:
{long=16, string=null}
{long=null, string=15.2}
What I would like to do now is query specifically those rows where col was classified as a string.
How would I do that?
(What would a query have to look like for filtering rows from tbl whose value in the col column is classified as long and > 10?)
You can filter numeric values like this:
... WHERE col.long > 10
You can filter string values that are actually numbers using Presto try function, like this:
... WHERE try(CAST(col.string AS bigint)) > 10

HIVE rendered timestamp column data as NULL

I am trying to create an external table using Hive. Below is the Hive query I ran:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/taxi_trips/';
When I looked at the output from the 'trips_raw' table created by the query above, I saw that both the 'tpep_pickup_date_time' and 'tpep_dropoff_datetime' columns are 'NULL' in all rows. I have seen other threads talked about the reason being that the '1/1/2018 11:13:00 AM' timestamp format is not accepted by Hive, but problem is that's the timestamp format I have in my csv source data (as you can see from screenshot here).
I could specify those 2 timestamp columns as 'string' and Hive will be able to render them correctly, but I still would want those 2 columns to be 'timestamp' type so specifying those 2 columns as 'string' type is not a viable option here.
I had also tried the following technique using recommendation from this site (https://community.hortonworks.com/questions/55266/hive-date-time-problem.html) but had no success:
Create the 'trips_raw' table using 'string' as type for the 2 timestamp columns. This allows the resulting table to render the timestamps correctly, albeit in 'string' type. The Hive command I used is shown below:
create external table trips_raw
(
VendorID int,
tpep_pickup_datetime string,
tpep_dropoff_datetime string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location
'/user/taxi_trips/';
When I look at the resulting table, the dates are shown as string as you can see from this screenshot below.
But as I had mentioned earlier, I want the time columns to be in timestamp type and not string type. Therefore in the next 2 steps I tried to create a blank table and then insert the data from the table created from Step 1 but converting the string to timestamp this time.
Create an external blank table called 'trips_not_raw' using the following Hive commands:
create external table trips_not_raw
(VendorID int,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp
);
Insert data from 'trips_raw' table (which was mentioned earlier in this question), using the Hive commands below:
insert into table trips_not_raw select vendorid,
from_unixtime(unix_timestamp(tpep_pickup_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_pickup_datetime,
from_unixtime(unix_timestamp(tpep_dropoff_datetime, 'MM/dd/yyyy HH:mm:ss
aa')) as tpep_dropoff_datetime
from trips_raw;
Doing this inserts the rows into the blank table 'trips_not_raw', but the results from the 2 timestamp columns still showed as 'Null' as you can see from the screenshot below:
Is there a simple way to store the 2 time columns as 'timestamp' type and not 'string', but still be able to render them correctly in the output without seeing 'Null/None'?
I'm afraid you need to parse timestamp column and then cast string as timestamp. For example,
select cast(regexp_replace('1/1/2018 11:13:00 AM', '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp)
You can create and use a macro function for convenience, e.g.,
create temporary macro parse_date (ts string)
cast(regexp_replace(ts, '(\\d{1,2})/(\\d{1,2})/(\\d{4})\\s(\\d{2}:\\d{2}:\\d{2}) \\w{2}', '$3-$1-$2 $4') as timestamp);
then use it as follows
select parse_date('1/1/2018 11:13:00 AM');

Resources