I am getting complilation error while using follwowing query in u-sql:
#CourseDataExcludingUpdatedCourse = SELECT * FROM #CourseData AS cd
WHERE cd.CourseID NOT IN (SELECT CourseID FROM #UpdatedCourseData);
It is not allowing me to use NOT IN Clause in subquery. I want to show all those records which are not present in #UpdatedCourseData.
How can I achieve this in U-SQL?
In U-SQL NOT EXISTS is implemented as ANTISEMIJOIN, something like this:
#CourseDataExcludingUpdatedCourse =
SELECT cd.*
FROM #CourseData AS cd
ANTISEMIJOIN
#UpdatedCourseData AS us
ON cd.courseId == us.courseId;
See here for more info:
https://msdn.microsoft.com/en-us/library/azure/mt621330.aspx
#employees =
SELECT * FROM
( VALUES
(1, "Noah", 100, (int?)10000, new DateTime(2012,05,31)),
(2, "Sophia", 100, (int?)15000, new DateTime(2012,03,19)),
(3, "Liam", 100, (int?)30000, new DateTime(2014,09,14)),
(4, "Amy", 100, (int?)35000, new DateTime(1999,02,27)),
(5, "Justin", 600, (int?)15000, new DateTime(2015,01,12)),
(6, "Emma", 200, (int?)8000, new DateTime(2014,03,08)),
(7, "Jacob", 200, (int?)8000, new DateTime(2014,09,02)),
(8, "Olivia", 200, (int?)8000, new DateTime(2013,12,11)),
(9, "Mason", 300, (int?)50000, new DateTime(2016,01,01)),
(10, "Ava", 400, (int?)15000, new DateTime(2014,09,14))
) AS T(EmpID, EmpName, DeptID, Salary, StartDate);
#departments =
SELECT * FROM
( VALUES
(100, "Engineering"),
(200, "HR"),
(300, "Executive"),
(400, "Marketing"),
(500, "Sales"),
(600, "Clerical"),
(800, "Reserved")
) AS T(DeptID, DeptName);
/* T-SQL; Using a subquery with IN
SELECT *
FROM #employees
WHERE DeptID IN
(SELECT DeptID FROM #departments WHERE DeptName IN ('Engineering', 'Executive'));
*/
// U-SQL; Using SEMIJOIN
#result =
SELECT *
FROM #employees AS e
LEFT SEMIJOIN (SELECT DeptID FROM #departments WHERE DeptName IN ("Engineering", "Executive")) AS sc
ON e.DeptID == sc.DeptID;
OUTPUT #result
TO "/Output/ReferenceGuide/Joins/SemiJoins/SubqueryIN.txt"
USING Outputters.Tsv(outputHeader: true);
/* T-SQL; Using a subquery with NOT IN
SELECT *
FROM #employees
WHERE DeptID NOT IN
(SELECT DeptID FROM #departments WHERE DeptName IN ('Engineering', 'Executive'));
*/
// U-SQL; Using ANTISEMIJOIN
#result =
SELECT *
FROM #employees AS e
LEFT ANTISEMIJOIN (SELECT DeptID FROM #departments WHERE DeptName IN ("Engineering", "Executive")) AS sc
ON e.DeptID == sc.DeptID;
OUTPUT #result
TO "/Output/ReferenceGuide/Joins/AntiSemiJoins/SubqueryNOTIN.txt"
USING Outputters.Tsv(outputHeader: true);
// BONUS: Switch "LEFT" to "RIGHT" in the above examples and observe the results.
Related
I have a pandas dataframe that I want to pass into a psycopg2 execute statement as a temporary table. This should be very simple:
pseudo-code...
string = """
with temporary_table (id, value) as (values %s)
select * from temporary_table
"""
cur.execute(string, df)
Where df is just a dataframe with an id and value column.
What would be the syntax to use such that I'd be able to pass this data in as a temporary table and use it in my query?
I would create a temporary table in Postgres database with df.to_sql or execute insert sql query with values, query it and at the end of process delete it
A test case that I think does what you want:
import psycopg2
from psycopg2.extras import execute_values
con = psycopg2.connect("dbname=test user=postgres host=localhost port=5432")
sql_str = """WITH temporary_table (
id,
value
) AS (
VALUES %s
)
SELECT
*
FROM
temporary_table
"""
cur = con.cursor()
execute_values(cur, sql_str, ((1, 2), (2,3)))
cur.fetchall()
[(1, 2), (2, 3)]
Using execute_values from Fast Execution Helpers.
UPDATE
Sticking to just execute:
import psycopg2
from psycopg2 import sql
con = psycopg2.connect("dbname=test user=postgres host=localhost port=5432")
cur = con.cursor()
input_data = ((1,2), (3,4), (5,6))
sql_str = sql.SQL("""WITH temporary_table (
id,
value
) AS (
VALUES {}
)
SELECT
*
FROM
temporary_table
""").format(sql.SQL(', ').join(sql.Placeholder() * len(input_data)))
cur.execute(sql_str, input_data)
cur.fetchall()
[(1, 2), (3, 4), (5, 6)]
I have this column inside myTable:
myColumn
[red, green]
[green, green, red]
I need to modify it so that I can replace red with 1, green with 2:
myColumn
[1, 2]
[2, 2, 1]
In short, is there a way to apply case clause for each element in the array, row wise?
The closest I've gotten so far:
select replace(replace(to_json(myColumn), 'red', 1), 'green', 2)
On the other hand, in case we have a column of strings, I could simply use:
select (
case
when myColumn='red' then 1
when myColumn='green' then 2
end
) from myTable;
Assuming that the dataframe has registered a temporary view named tmp, use the following SQL statement to get the result.
sql = """
select
collect_list(
case col
when 'red' then 1
when 'green' then 2
end)
myColumn
from
(select mid,explode(myColumn) col
from
(select monotonically_increasing_id() mid,myColumn
from tmp)
)
group by mid
"""
df = spark.sql(sql)
df.show(truncate=False)
In pure Spark SQL, you could convert your array into a string with concat_ws, make the substitutions with regexp_replace and then recreate the array with split.
select split(
regexp_replace(
regexp_replace(
concat_ws(',', myColumn)
, 'red', '1')
, 'green', '2')
, ',') myColumn from df
I could perform a simple transform (Spark 3 onwards)
select transform(myColumn, value ->
case value
when 'red' then 1
when 'green' then 2
end
from myTable
Let's create some sample data and a map that contains the substitutions: tou want to make
val df = Seq((1, Seq("red", "green")),
(2, Seq("green", "green", "red")))
.toDF("id", "myColumn")
val values = Map("red" -> "1", "green" -> "2")
The most straight forward way would be to define a UDF that does exactly what you want:
val replace = udf((x : Array[String]) =>
x.map(value => values.getOrElse(value, value)))
df.withColumn("myColumn", replace('myColumn)).show
+---+---------+
| id| myColumn|
+---+---------+
| 1| [1, 2]|
| 2|[2, 2, 1]|
+---+---------+
Without UDFs, you could transform the array into a string with concat_ws using separators that are not in your array. Then we could use string functions to make the edits:
val sep = ","
val replace = values
.foldLeft(col("myColumn")){ case (column, (key, value)) =>
regexp_replace(column, sep + key + sep, sep + value + sep)
}
df.withColumn("myColumn", concat(lit(sep), concat_ws(sep+sep, 'myColumn), lit(sep)))
.withColumn("myColumn", regexp_replace(replace, "(^,)|(,$)", ""))
.withColumn("myColumn", split('myColumn, sep+sep))
.show
I am trying to fetch data from a postgres table using psycopg2.
Here is what i have done.
import psycopg2
con = psycopg2.connect("host=localhost dbname=crm_whatsapp user=odoo password=password")
cur = con.cursor()
sql = """SELECT * from tbl_hospital;"""
db_cursor.execute(sql)
hospital_data = db_cursor.fetchall()
print('hospital_data',hospital_data)
And the output is:
hospital_data [(1, 'hospital1', 1), (2, 'hospital2', 2), (3, 'hospital3', 3), (4, 'hospital4', 1)]
The output is not contains the cloumn header. I need that too. How can i get that.?
The cursor has the metadata in it.
From "Programming Python" by M. Lutz:
...
db_cursor.execute(sql)
colnames = [desc[0] for desc in db_cursor.description]
I need to scan through a Hive table and add values from the first record in a sequence to all linked records.
The logic would be:-
Find the first record (where previous_id is blank).
Find the next record (current_id = previous_id).
Repeat until there are no more linked records.
Add columns from original record to all linked records.
Output results to a Hive table.
Example Source Data:-
current_id previous_id start_date
---------- ----------- ----------
100 01/01/2001
200 100 02/02/2002
300 200 03/03/2003
Example Output Data:-
current_id start_date
---------- ----------
100 01/01/2001
200 01/01/2001
300 01/01/2001
I can achieve this by creating two DataFrames from the source table and performing multiple joins. However, this approach does not seem ideal as data has to be cached to avoid re-querying the source data with each iteration.
Any suggestions on how to approach this problem?
I think you can accomplish this using GraphFrames Connected components
It will help you avoid writing the checkpointing and looping logic yourself. Essentially you create a graph from the current_id and previous_id pairs and use GraphFrames to the component for each vertex. That resulting DataFrame can then be joined to the original DataFrame to get the start_date.
from graphframes import *
sc.setCheckpointDir("/tmp/chk")
input = spark.createDataFrame([
(100, None, "2001-01-01"),
(200, 100, "2002-02-02"),
(300, 200, "2003-03-03"),
(400, None, "2004-04-04"),
(500, 400, "2005-05-05"),
(600, 500, "2006-06-06"),
(700, 300, "2007-07-07")
], ["current_id", "previous_id", "start_date"])
input.show()
vertices = input.select(input.current_id.alias("id"))
edges = input.select(input.current_id.alias("src"), input.previous_id.alias("dst"))
graph = GraphFrame(vertices, edges)
result = graph.connectedComponents()
result.join(input.previous_id.isNull(), result.component == input.current_id)\
.select(result.id.alias("current_id"), input.start_date)\
.orderBy("current_id")\
.show()
Results in the following output:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001-01-01|
| 200|2001-01-01|
| 300|2001-01-01|
| 400|2004-04-04|
| 500|2004-04-04|
| 600|2004-04-04|
| 700|2001-01-01|
+----------+----------+
Here is an approach that I am not sure sits well with Spark.
There is a lack of a grouping id / key for the data.
Not sure how Catalyst would be able to optimize this - will look at at a later point in time. Memory errors if too large?
Have made the data more complicated, and this does work. Here goes:
# No grouping key evident, more a linked list with asc current_ids.
# Added more complexity to the example.
# Questions open on performance at scale. Interested to see how well Catalyst handles this.
# Need really some grouping id/key in the data.
from pyspark.sql import functions as f
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
# Started from dataframe.
# Some more realistic data? At least more complex.
columns = ['current_id', 'previous_id', 'start_date']
vals = [
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
]
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView("trans")
# Starting data. The null / None entries.
df2 = spark.sql("""
select *
from trans
where previous_id is null
""")
df2.cache
df2.createOrReplaceTempView("trans_0")
# Loop through the stuff based on traversing the list elements until exhaustion of data, and, write to dynamically named TempViews.
# May need to checkpoint? Depends on depth of chain of linked items.
# Spark not well suited to this type of processing.
dfX_cnt = 1
cnt = 1
while (dfX_cnt != 0):
tabname_prev = 'trans_' + str(cnt-1)
tabname = 'trans_' + str(cnt)
query = "select t2.current_id, t2.previous_id, t1.start_date from {} t1, trans t2 where t1.current_id = t2.previous_id".format(tabname_prev)
dfX = spark.sql(query)
dfX.cache
dfX_cnt = dfX.count()
if (dfX_cnt!=0):
#print('Looping for dynamic creation of TempViews')
dfX.createOrReplaceTempView(tabname)
cnt=cnt+1
# Reduce the TempViews all to one DF. Can reduce an array of DF's as well, but could not find my notes here in this regard.
# Will memory errors occur?
from pyspark.sql.types import *
fields = [StructField('current_id', LongType(), False),
StructField('previous_id', LongType(), True),
StructField('start_date', StringType(), False)]
schema = StructType(fields)
dfZ = spark.createDataFrame(sc.emptyRDD(), schema)
for i in range(0,cnt,1):
tabname = 'trans_' + str(i)
query = "select * from {}".format(tabname)
df = spark.sql(query)
dfZ = dfZ.union(df)
# Show final results.
dfZ.select('current_id', 'start_date').sort(col('current_id')).show()
returns:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001/01/01|
| 200|2001/01/01|
| 300|2001/01/01|
| 400|2005/01/01|
| 500|2005/01/01|
| 600|2001/01/01|
| 700|2001/01/01|
| 800|2009/02/02|
| 900|2009/02/02|
+----------+----------+
Thanks for the suggestions posted here. After trying various approaches I have gone with the following solution which works for multiple iterations (e.g. 20 loops), and does not cause any memory issues.
The "Physical Plan" is still huge, but caching means most of the steps are skipped, keeping performance acceptable.
input = spark.createDataFrame([
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
], ["current_id", "previous_id", "start_date"])
input.createOrReplaceTempView("input")
cur = spark.sql("select * from input where previous_id is null")
nxt = spark.sql("select * from input where previous_id is not null")
cur.cache()
nxt.cache()
cur.createOrReplaceTempView("cur0")
nxt.createOrReplaceTempView("nxt")
i = 1
while True:
spark.sql("set table_name=cur" + str(i - 1))
cur = spark.sql(
"""
SELECT nxt.current_id as current_id,
nxt.previous_id as previous_id,
cur.start_date as start_date
FROM ${table_name} cur,
nxt nxt
WHERE cur.current_id = nxt.previous_id
""").cache()
cur.createOrReplaceTempView("cur" + str(i))
i = i + 1
if cur.count() == 0:
break
for x in range(0, i):
spark.sql("set table_name=cur" + str(x))
cur = spark.sql("select * from ${table_name}")
if x == 0:
out = cur
else:
out = out.union(cur)
My ADLA solution is being transitioned to Spark. I'm trying to find the right replacement for U-SQL REDUCE expression to enable:
Read logical partition and store information in a list/dictionary/vector or other data structure in memory
Apply logic that requires multiple iterations
Output results as additional columns together with the original data (the original rows might be partially eliminated or duplicated)
Example of possible task:
Input dataset has sales and return transactions with their IDs and attributes
The solution is supposed finding the most likely sale for each return
Return transaction must happen after the sales transaction and be as similar to the sales transactions as possible (best available match)
Return transaction must be linked to exactly one sales transaction; sales transaction could be linked to one or no return transaction - link is supposed to be captured in the new column LinkedTransactionId
The solution could be probably achieved by groupByKey command, but I'm failing identify how to apply the logic across multiple rows. All examples I've managed to find are some variation of in-line function (usually an aggregate - e.g. .map(t => (t._1, t._2.sum))) which doesn't require information about individual records from the same partition.
Can anyone share example of similar solution or point me to the right direction?
Here is one possible solution - feedbacks and suggestions for different approach or examples of iterative Spark/Scala solutions are greatly appreciated:
Example will read Sales and Credit transactions for each customer (CustomerId) and process each customer as separate partition (outer mapPartition loop)
Credit will be mapped to the sales with closest score (i.e. smallest score difference - using the foreach inner loop inside each partition)
Mutable map trnMap is preventing double-assignmet of each transactions and captures updates from the process
Results are outputted thru an iterator as into final dataset dfOut2
Note: in this particular case the same result could have been achieved using windowing functions w/o using iterative solution, but the purpose is to test the iterative logic itself)
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.api.java.JavaRDD
case class Person(name: String, var age: Int)
case class SalesTransaction(
CustomerId : Int,
TransactionId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
case class TransactionScore(
TransactionId : Int,
Score : Int
)
case class TransactionPair(
SalesId : Int,
CreditId : Int,
ScoreDiff : Int
)
object ExampleDataFramePartition{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Example Combiner")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition(2,$"CustomerId")
df.show()
val dfOut2 = df.mapPartitions(p => {
println(p)
val trnMap = scala.collection.mutable.Map[Int, SalesTransaction]()
val trnSales = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnCredits = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnPairs = scala.collection.mutable.ArrayBuffer.empty[TransactionPair]
p.foreach(row => {
val trnKey: Int = row.getAs[Int]("TransactionId")
val trnValue: SalesTransaction = new SalesTransaction(row.getAs("CustomerId")
, trnKey
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
trnMap += (trnKey -> trnValue)
if(trnValue.Type == "Sales") {
trnSales += new TransactionScore(trnKey, trnValue.Score)}
else {
trnCredits += new TransactionScore(trnKey, trnValue.Score)}
})
if(trnCredits.size > 0 && trnSales.size > 0) {
//define transaction pairs
trnCredits.foreach(cr => {
trnSales.foreach(sl => {
trnPairs += new TransactionPair(cr.TransactionId, sl.TransactionId, math.abs(cr.Score - sl.Score))
})
})
}
trnPairs.sortBy(t => t.ScoreDiff)
.foreach(t => {
if(!trnMap(t.CreditId).IsProcessed && !trnMap(t.SalesId).IsProcessed){
trnMap(t.SalesId) = new SalesTransaction(trnMap(t.SalesId).CustomerId
, trnMap(t.SalesId).TransactionId
, trnMap(t.SalesId).Score
, trnMap(t.SalesId).Revenue
, trnMap(t.SalesId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.CreditId
, true
)
trnMap(t.CreditId) = new SalesTransaction(trnMap(t.CreditId).CustomerId
, trnMap(t.CreditId).TransactionId
, trnMap(t.CreditId).Score
, trnMap(t.CreditId).Revenue
, trnMap(t.CreditId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.SalesId
, true
)
}
})
trnMap.map(m => m._2).toIterator
})
dfOut2.show()
spark.stop()
}
}