To find non numeric rows we can do something like below in sparksql
spark.sql("select * from tabl where UPC rlike '[^0-9]'").show()
can this same query also be written like below? I tested it does not seem to work, basically I am trying to use :alpha:/:digit:/:alnum: posix commands
spark.sql("select * from tabl where UPC rlike '[^[:digit:]]'").show()
spark-sql> select * from ( select '8787687' col1) where rlike (col1,'^[[:digit:]]+$');
Time taken: 0.02 seconds
spark-sql>
Related
I am trying execute query on clustering columns on amazon keyspace, since I don't want to use ALLOW FILTERING with my native query I have created 4-5 clustering columns for better performance.
But while trying to filter it based on >= and <= with on 2 clustering columns, I am getting error with below message
message="Clustering column "start_date" cannot be restricted (preceding column "segment_id" is restricted by a non-EQ relation)"
I had also tried with multiple columns query but I am getting not supported error
message="MultiColumn relation is not yet supported."
Query for the reference
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and (segment_id, start_date,end_date)>= (-1, '2022-05-16','2017-03-28') and flag = 1;
or
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and segment_id > -1 and start_date >='2022-05-16';
I am assuming that the your table has the following primary key:
CREATE TABLE table_name (
...
PRIMARY KEY(shard_id, division, customer_id, segment_id, start_date, end_date)
)
In any case, your CQL query is invalid because you can only apply an inequality operator on the last clustering column in your query. For example, these are valid queries based on your table schema:
SELECT * FROM table_name
WHERE shard_id = ? AND division = ?
AND customer_id <= ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id > ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id = ? AND start_date >= ?
All preceding columns must be filtered by an equality operator except for the very last clustering column in your query.
If you require a complex predicate for your queries, you will need to index your Cassandra data with tools such as Elasticsearch or Apache Solr. They will allow you to run complex search parameters to retrieve data from your database. Cheers!
ALLOW Filtering gets a bad rap sometimes. It all depends on how many rows you end up scanning. It's good to understand how many rows per partition will be scanned and work backwards from there. Only the last column can contain inequality statements to bound ranges. Try to order your columns to eliminate the most columns first, which reduce the number of rows 'Filtered'.
In the example below we used the index for keys up to start date and filtered on end_data, segment_id, and flag.
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and start_date >= '2022-05-16' and end_date > '2017-03-28') and (segment_id > -1 flag = 1;```
I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.
I try to recreate a SQL query in Spark SQL. Normally i would insert into a table like this:
INSERT INTO Table_B
(
primary_key,
value_1,
value_2
)
SELECT DISTINCT
primary_key,
value_1,
value_2
FROM
Table_A
WHERE NOT EXISTS
(
SELECT 1 FROM
Table_B
WHERE
Table_B.primary_key = Table_A.primary_key
);
Spark SQL is straightforward and I can load data from a TempView in a new dataset. Unfortunately i don't know how to reconstruct the where clause.
Dataset<Row> Table_B = spark.sql("SELECT DISTINCT primary_key, value_1, value_2 FROM Table_A").where("NOT EXISTS ... ???" );
Queries with not exists in TSQL can be rewritten with left join with "where":
SELECT Table_A.*
FROM Table_A Left Join Table_B on Table_B.primary_key = Table_A.primary_key
Where Table_B.primary_key is null
Maybe, similar approach can be used in Spark, with left join. For example, for dataframes, smthing like:
aDF.join(bDF,aDF("primary_key")===bDF("primary_key"),"left_outer").filter(isnull(col("other_b_not_nullable_column")))
SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"
EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.
Is there a sql fucntion in spark sql which returns back current timestamp , example in impala NOW() is the function which returns back current timestamp is there similar in spark sql ?
Thanks
Try current_timestamp function.
current_timestamp() - Returns the current timestamp at the start of query evaluation. All calls of current_timestamp within the same query return the same value.
It is possible to use Date and Timestamp functions from pyspark sql functions.
Example:
spark-sql> select current_timestamp();
2022-05-07 16:43:43.207
Time taken: 0.17 seconds, Fetched 1 row(s)
spark-sql> select current_date();
2022-05-07
Time taken: 5.224 seconds, Fetched 1 row(s)
spark-sql>
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.current_timestamp.html
You can use the following code to get the date and timestamp in Spark with Scala code.
import org.apache.spark.sql.functions._
val newDf = df.withColumn("current_date",current_date())
.withColumn("current_timestamp",current_timestamp())
The result will be something like this.
+------------+-----------------------+
|current_date|current_timestamp |
+------------+-----------------------+
|2022-06-06 |2022-06-06 12:25:55.349|
+------------+-----------------------+
Currently I´m doing this (get pagination and count), in Informix:
select a.*, b.total from (select skip 0 first 10 * from TABLE) a,(select count(*) total from TABLE) b
The problem is that I´m repeating the same pattern - I get the first ten results from all and then I count all the results.
I want to make something like this:
select *, count(*) from TABLE:
so I can make my query much faster. It is possible?