Removing nulls from Array in presto - presto

I'm extracting data from a json column in presto and getting the output in a array like this [AL,null,NEW].The problem is i need to remove the null since the array has to be mapped another array.I tried several options but no luck.How can i remove the null and get only [AL,NEW] without unnesting?

You can use filter() for this:
SELEtrino> SELECT filter(ARRAY['AL',null,'NEW'], e -> e IS NOT NULL);
_col0
-----------
[AL, NEW]
(1 row)

Related

null column NOT IN list of strings strange results

I am getting weird results when using a spark SQL statements like:
select * from mytab where somecol NOT IN ('ABC','DEF')
If I set somecol to ABC it returns nothing. If I set it to XXX it returns a row.
However, if I leave the column blank, like ,, in the CSV data (so the value is read as null), it still does not return anything, even though null is not in the list of values.
This remains the case even if re-written as NOT(somecol IN ('ABC','DEF')).
I feel like this is to do with comparisons between null and strings, but I am not sure what to do about null column values that end up in IN or NOT IN clauses.
Do I need to convert them to empty strings first?
You can put explicit check for nulls in query as null comparison returns unknown in spark details here
select * from mytab where somecol NOT IN ('ABC','DEF') or somecol is null

How to drop rows with nulls in one column pyspark

I have a dataframe and I would like to drop all rows with NULL value in one of the columns (string). I can easily get the count of that:
df.filter(df.col_X.isNull()).count()
I have tried dropping it using following command. It executes but the count still returns as positive
df.filter(df.col_X.isNull()).drop()
I tried different attempts but it returns 'object is not callable' error.
Use either drop with subset:
df.na.drop(subset=["col_X"])
or isNotNull()
df.filter(df.col_X.isNotNull())
Dataframes are immutable. so just applying a filter that removes not null values will create a new dataframe which wouldn't have the records with null values.
df = df.filter(df.col_X. isNotNull())
if you want to drop any row in which any value is null, use
df.na.drop() //same as df.na.drop("any") default is "any"
to drop only if all values are null for that row, use
df.na.drop("all")
to drop by passing a column list, use
df.na.drop("all", Seq("col1", "col2", "col3"))
another variation is:
from pyspark.sql.functions import col
df = df.where(col("columnName").isNotNull())
you can add empty string condition also somtimes
df = df.filter(df.col_X. isNotNull() | df.col_X != "")
You can use expr() functions that accept SQL-like query syntax.
from pyspark.sql.functions import expr
filteredDF = rawDF.filter(expr("col_X is not null")).filter("col_Y is not null")

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?
Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?
EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.
With df.na.drop() you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).
In one case, I had to select records with NAs or nulls or >=0. I could so by using only coalesce function and none of like above 3 functions.
rdd.filter("coalesce(index_column, 1000) >= 0")

Postgres insert into table multiple return values from with rows as

I'd like to insert multiple (2) rows into a table in Postgres, and use the ids returned by WITH ROWS AS in an insert to another table (columns id_1 and id_2 in second_table), and return that id as the final result of the query.
first_table is the main data collection, and second_table essentially just links two first_table entries together.
Here's the gist of what I'm trying to do:
WITH row AS (
INSERT INTO first_table (some_data, more_data)
VALUES (DEFAULT, DEFAULT), (DEFAULT, DEFAULT)
RETURNING first_table_id
)
INSERT INTO second_table (id_1, id_2, other_data)
VALUES (???, ???, DEFAULT)
RETURNING second_table_id
I am using Node.js and the node-pg module; I know that this can be done in Node.js by having the first query return the two rows returned by WITH ROWS AS, then running another prepared statement using the two ids for the next insert.
I'm not sure if I can do this through a trigger, since ON INSERT only has one row. Maybe it would have two since these rows are inserted into first_table in the same transaction, but I don't have extensive experience with Postgres/SQL in general so I'm not sure how I could do this via trigger.
The problem I'm having is that the WITH ROWS AS statement returns 2 rows (intended, obviously), but I can't figure out how to access each row independently in the next insert, and I have a feeling that it isn't possible.
Can this be accomplished in a single query, or do I have to use a trigger or Node.js for this?
The second insert needs to be an INSERT INTO ... SELECT ... FROM first_cte. If you can make an assumtion, that the second generated id is larger than the first (or at least, distinct & comparable -- this is the case, if you use PostgreSQL's serial, or directly a sequence), you can use aggregation:
WITH rows AS (
INSERT INTO first_table (some_data, more_data)
VALUES (DEFAULT, DEFAULT), (DEFAULT, DEFAULT)
RETURNING first_table_id
)
INSERT INTO second_table (id_1, id_2)
SELECT min(first_table_id), max(first_table_id)
FROM rows
RETURNING second_table_id
Note that, other_data is omitted: you cannot use the DEFAULT in INSERT INTO ... SELECT ....
For a more general solution, you can use arrays:
WITH rows AS (
INSERT INTO first_table (some_data, more_data)
VALUES (DEFAULT, DEFAULT), (DEFAULT, DEFAULT)
RETURNING first_table_id
)
INSERT INTO second_table (id_1, id_2, other_data)
SELECT first_table_ids[1], first_table_ids[2], 5 -- value '5' for "other_data"
FROM (
SELECT array_agg(first_table_id) AS first_table_ids
FROM rows
) AS rows_as_array
RETURNING second_table_id

Cassandra CQL selecting rows with has values different on two columns

I create table:
CREATE TABLE T (
I int PRIMARY KEY,
A text,
B text
);
Than I add two columns X and Y using:
ALTER TABLE T ADD X int;
CREATE INDEX ON T (X);
ALTER TABLE T ADD Y int;
CREATE INDEX ON T (Y);
I put some data and now I would like to count rows which has different values on X and Y (even X < Y would be fine). I tried something like this:
select COUNT(*) from T where X < Y ;
This also doesn't work without COUNT - just simple *.
But I'm getting error no viable alternative at input ';'
Do you have some suggestions how to overcome this error?
I tried using counters instead of integers but they forced me to put all non-counter data to primary key which wasn't good idea in my case ...
I'm using Cassandra 1.2.6 and CQL 3.
PS can I perform UPDATE on all rows? without WHERE clause or with some dummy one?
As Cassandra prefers simple reads the Cassandra-way to do this is to insert a boolean flagged column on update/insert. With an (secondary) index you may query the reads faster as well.

Resources