I'm working with cassandra + spark + spark-sql. I'n not using Hive.
I'd like to create my custom aggregation function, like:
select percentile(column, 0.95) from cassandra_table
spark-sql support avg(), min(), etc.: I want to implement others like percentile, but I cannot find documentation on this.
Can someone point me to to any doc or class to start with?
Thanks!
Related
We store very complex json in one of our table columns. I would like to write a parser for this. I was reading thru table functions and functions but I never saw a great guide that would tell me how to create a function / deploy it to our cluster. Does anyone have any good pointers.
I have a notebook in databricks where I only have SQL queries, I want to know if it's better (talking about performance) to switch all of them to pyspark or if it would be the same.
In other words I want to know if databricks-sql uses spark-sql to execute the queries.
I found this question (looks pretty similar to mine), but the answer is not what I want to know.
Yes, you can definitely use PySpark in place of SQL.
The decision mostly depends on the type of data store. If your data is stored in database then SQL is the best option. If you are working with DataFrames, then PySpark is the good options as it gives you more flexibility and features with supported libraries.
It uses SparkSQL and DataFrame APIs.
Dataframe uses tungsten memory representation , catalyst optimizer used by SQL as well as DataFrame. With Dataset API, you have more control on the actual execution plan than with SparkSQL.
Refer PySpark for more details and better understanding.
I just got introduced to Spark SQL higher order functions transform(), filter() etc. I searched online, but couldn't find much advanced use-cases leveraging these functions.
Can anyone please explain transform() with a couple of advanced real-life use-case using sql query. Does it always need to work on nested complex types (arrays, struct etc) ? Or can it be used to process simple data-type records as well ?
Any help is appreciated.
Thanks
The following online resource vividly demonstrates in %sql mode :
https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
How to perform (sql like) IN queries in aerospike on secondary index. Do we need an UDF for this?
Something like this: Select * from ns.set where si_bin in (1,2,3)
Is there anything available in java aerospike client?
PS: Dont want a range query or that sort.
You can use predicate filtering. https://www.aerospike.com/docs/guide/predicate.html
Python client documentation for predicate filtering has examples for using the aerospike.predexp helper.
The Java client has examples for class PredExp in the repo.
I am currently working with Presto 0.80. I have to write a user defined function to convert degree celsius to degree fahrenheit during select query. I did the same using Hive QL but was wondering if we can replicate the same in Facebook Presto.
Any help would be highly appreciated.
Thanks!!
Here is a guide for writing a new function in presto.
https://trino.io/docs/current/develop/functions.html
After writing your function, add the plugin to the plugin directory as explained in SPI Overview.
There is another example for writing presto UDF from Qubole blog.
http://www.qubole.com/blog/product/plugging-in-presto-udfs/
You can try it if you can make it work.