How to cache subquery result in WITH clause in Spark SQL - apache-spark

I wonder if Spark SQL support caching result for the query defined in WITH clause.
The Spark SQL query is something like this:
with base_view as
(
select some_columns from some_table
WHERE
expensive_udf(some_column) = true
)
... multiple query join based on this view
While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times.
In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.
P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.

That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.

The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.

This can be done by excuting several sql query.
-- first cache sql
spark.sql("
CACHE TABLE base_view as
select some_columns
from some_table
WHERE
expensive_udf(some_column) = true")
-- then use
spark.sql("
... multiple query join based on this view
")

Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-
spark.sql("""
| create temp view my_view
| as
| WITH base_view as
| (
| select some_columns
| from some_table
| WHERE
| expensive_udf(some_column) = true
| )
| SELECT *
| from base_view
""");
spark.sql("""CACHE TABLE my_view""");
Now you can use the my_view temp view to join to other tables as shown below-
spark.sql("""
| select mv.col1, t2.col2, t3.col3
| from my_view mv
| join tab2 t2
| on mv.col2 = t2.col2
| join tab3 t3
| on mv.col3 = t3.col3
""");
Remember to uncache the view after using-
spark.sql("""UNCACHE TABLE my_view""");
Hope this helps.

Related

Can't use a new created derived column in same spark sql code

I have creating a new column in using "AS" statement in pyspark-sql code
accounts_2.price - COALESCE(cast(accounts_2.price as Numeric(38,8)), 0) AS follow_on_price
As you see here I am creating a new column "follow_on_price
but when I am trying to use this newly created column in my same spark sql code
, accounts_2.unlock_price - COALESCE(cast(accounts_2.upfront_price as Numeric(38,8)), 0) AS follow_on_price
, follow_on_price * exchange_rate_usd AS follow_on_price_usd
It does not recognise the follow_on_price used immediately in the same spark SQL so when I create a new temp view and use it as new table for the next step then its able to do the same . Please explain why so? Why can't spark Sql take the new column reference from the same spark code so that I don't have to create an extra step for "follow_on_price * exchange_rate_usd AS follow_on_price_usd" and it can be done in single step. like we do in normal sql like Postgres.
it's SQL standard behavior, to prevents ambiguities in query. we cannot reference column aliases in the same SELECT list.
you can use inner query instead like below
>>> data2 = [{"COL_A": 'A',"COL_B": "B","price":212891928.90},{"COL_A": "A1","COL_B": "cndmmkdssac","price":9496943.4593},{"COL_A": 'A',"COL_B": "cndds","price":4634609.6994}]
>>> df=spark.createDataFrame(data2)
>>> df.show()
+-----+-----------+-------------+
|COL_A| COL_B| price|
+-----+-----------+-------------+
| A| B|2.128919289E8|
| A1|cndmmkdssac| 9496943.4593|
| A| cndds| 4634609.6994|
+-----+-----------+-------------+
>>> df.registerTempTable("accounts_2")
>>> spark.sql("select s1.follow_on_price, s1.follow_on_price*70 as follow_on_price_usd from (select COALESCE(accounts_2.price) AS follow_on_price from accounts_2) s1")
+---------------+-------------------+
|follow_on_price|follow_on_price_usd|
+---------------+-------------------+
| 2.128919289E8| 1.4902435023E10|
| 9496943.4593| 6.64786042151E8|
| 4634609.6994| 3.24422678958E8|

Handle duplicate data in Kusto

Problem description:
when the data are received to the source_table the update policy in kusto is run to store the data in end_point_table.
the update function should handle the duplication data and store just the new data inside the end_point_table. that mean that if we got data form the source_table which are the same what we have in the end_point_table then no data will store.
what i did:
the end_point_table have already data
.ingest inline into table end_point_table <|
1,2020-01-01T12:00:00Z,property,128
i have a source table called source_table, which i ingest the data to it like the following
.ingest inline into table source_table <|
1,2020-01-01T12:00:00Z,128
.ingest inline into table source_table <|
1,2020-01-01T12:00:00Z,property,128
the following function will triggered automatically
let _incoming =(
source_table
| where property == "property"
| project device_id, timestamp, value
| distinct *
);
let _old_data = (
end_point_table
);
_incoming
| join kind = leftouter(
_old_data
| summarize arg_max(timestamp, *) by device_id
) on device_id
| where (
timestamp != timestamp1
or value != value1
)
| project device_id, timestamp, value
result:
when i query the data after the ingestion i got three row instead of one like this
1,2020-01-01T12:00:00Z,property,128
1,2020-01-01T12:00:00Z,property,128
1,2020-01-01T12:00:00Z,property,128
the question is:
is there any solution to avoid ingestion the duplicated data in the end_point_table. or did i used the update policy wrongly
Update policies is not the right approach for this.
There are multiple correct approaches to handle dedup, please read about them here:
https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates

Is it possible to chain subsequent queries's where clauses in Dapper based on the results of a previous query in the same connection?

Is it possible to use .QueryMultiple (or some other method) in Dapper, and use the results of each former query to be used in the where clause of the next query, without having to do each query individually, get the id, and then .Query again, get the id and so on.
For example,
string sqlString = #"select tableA_id from tableA where tableA_lastname = #lastname;
select tableB_id from tableB WHERE tableB_id = tableA_id";
db.QueryMultiple.(sqlString, new {lastname = "smith"});
Is something like this possible with Dapper or do I need a view or stored procedure to accomplish this? I can use multiple joins for one SQL statement, but in my real query there are 7 joins, and I didn't think I should return 7 objects.
Right now I'm just using object.
You can store every previous query in table parameter and then first perform select from the parameter and query for next, for example:
DECLARE #TableA AS Table(
tableA_id INT
-- ... all other columns you need..
)
INSERT #TableA
SELECT tableA_id
FROM tableA
WHERE tableA_lastname = #lastname
SELECT *
FROM #TableA
SELECT tableB_id
FROM tableB
JOIN tableA ON tableB_id = tableA_id

Result display showing weird with sql : Spark

I am trying to do some analysis with spark. I tried the same query with foreach which shows the results correctly but if I use show or in sql it is weird, it is not showing anything.
sqlContext.sql("select distinct device from TestTable1 where id = 23233").collect.foreach(println)
[ipad]
[desktop]
[playstation]
[iphone]
[android]
[smarTv]
gives proper device but if I use just show or any sql :
sqlContext.sql("select distinct device from TestTable1 where id = 23233").show()
%sql
select distinct device from TestTable1 where id = 23233
+-----------+
|device |
+-----------+
| |
| |
|ion|
| |
| |
| |
+-----------+
I need graph and charts, so I would like to use %sql. But this is giving weird results with $sql. Does any one have any idea why I am getting like this ?
show is a formatted output of your data, whereas collect.foreach(println) is merely printing the Row data. They are two different things. If you want to format your data in a specific way, then stick with foreach...keeping in mind you are printing a sequence of Row. You'll have to pull the data out of the row if you want to get your own formatting for each column.
I can probably provide more specific information if you provide the version of spark and zeppelin that you are using.
You stated that you are using %sql because you need Zeppelin's graphs and charts--i.e. you wouldn't be swapping to %sql if you didn't have to.
You can just stick with using Spark dataframes by using z.show(), for example:
%pyspark
df = sqlContext.createDataFrame([
(23233, 'ipad'),
(23233, 'ipad'),
(23233, 'desktop'),
(23233, 'playstation'),
(23233, 'iphone'),
(23233, 'android'),
(23233, 'smarTv'),
(12345, 'ipad'),
(12345, 'palmPilot'),
], ('id', 'device'))
foo = df.filter('id = 23233').select('device').distinct()
z.show(foo)
In the above, z.show(foo) renders the default Zeppelin table view, with options for the other chart types.

How many events are stored in my PredictionIO event server?

I imported an unknown number of events into my PIO eventserver and now I want to know that number (in order to measure and compare recommendation engines). I could not find an API for that, so I had a look at the MySQL database my server uses. I found two tables:
mysql> select count(*) from pio_event_1;
+----------+
| count(*) |
+----------+
| 6371759 |
+----------+
1 row in set (8.39 sec)
mysql> select count(*) from pio_event_2;
+----------+
| count(*) |
+----------+
| 2018200 |
+----------+
1 row in set (9.79 sec)
Both tables look very similar, so I am still unsure.
Which table is relevant? What is the difference between pio_event_1 and pio_event_2?
Is there a command or REST API where I can look up the number of stored events?
You could go through the spark shell, described in the troubleshooting docs
Launch the shell with
pio-shell --with-spark
Then find all events for your app and count them
import io.prediction.data.store.PEventStore
PEventStore.find(appName="MyApp1")(sc).count
You could also filter to find different subsets of events by passing more parameters to find. See the api docs for more details. The LEventStore is also an option
Connect to your database
\c db_name
List tables
\dt;
Run query
select count(*) from pio_event_1;
PHP
<?php
$dbconn = pg_connect("host=localhost port=5432 dbname=db_name user=postgres");
$result = pg_query($dbconn, "select count(*) from pio_event_1");
if (!$result) {
echo "An error occurred.\n";
exit;
}
// Not the best way, but output the total number of events.
while ($row = pg_fetch_row($result)) {
echo '<P><center>'.number_format($row[0]) .' Events</center></P>';
} ?>

Resources