I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?
I was able to resolve the issue by applying with windowed aggregate to a dataset with the same sorting, sorted in a subquery.
SELECT LAST(VAL) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC GROUP BY PRT
was not sufficient, nor was
SELECT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM TBL
I had to do both
SELECT DISTINCT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC
These datasets had been extracted from an Oracle 12.2 instance over JDBC. I also added SRT to the order by clause there, which had just had ORDER BY PRT.
Further -- and I think this may have been most significant -- I used the cacheTable API on the spark catalog object after extracting the data. I had been doing
.repartition
.cache
.count
in order to load all the records with a relatively small number of data connections, but I suspect it was not enough to get all the data sparkside before the aggregations took place.
I have stumbled across a bit of an issue with U-SQL which for me is a problem I haven't yet found a workaround for.
It seems U-SQL doesnt support anything else but == in joins, so you can't put > or < in the join itself.
For the use case below which I have done in oracle:
create table trf.test_1(
number_col int
);
insert into trf.test_1 VALUES (10);
insert into trf.test_1 VALUES (20);
insert into trf.test_1 VALUES (30);
insert into trf.test_1 VALUES (60);
drop table trf.test_2;
create table trf.test_2(
number_col int
);
insert into trf.test_2 VALUES (20);
insert into trf.test_2 VALUES (30);
SELECT t1.number_col, t2.number_col
FROM trf.test_1 t1
LEFT JOIN trf.test_2 t2 ON t1.number_col < t2.number_col
;
I get the following:
How might I do that in u-sql without the < join?
I tried a cross join, but if you include the < in the where clause it just turns into an inner and you don't get the rows with the nulls.
Any ideas appreciated.
#t1 =
SELECT * FROM
( VALUES
(10),
(20),
(30),
(60)
) AS T(num_col);
#t2 =
SELECT * FROM
( VALUES
(20),
(30)
) AS T(num_col);
#result =
SELECT t1.num_col, t2.num_col AS num_col_2
FROM #t1 AS t1
CROSS JOIN #t2 AS t2
WHERE t1.num_col < t2.num_col;
#result2 =
SELECT t1.num_col, t2.num_col AS num_col_2
FROM #t1 AS t1
LEFT JOIN #result AS t2 ON t1.num_col == t2.num_col;
OUTPUT #result2
TO "/Output/ReferenceGuide/Joins/exampleA.csv"
USING Outputters.Csv();
Edit - I added the left join from the #t1 dataset back to the #result set which seems to work but would be interested if there are any better solutions out there. Seems a bit of a work around.
This is a known feature and discussed extensively in the article "U-SQL SELECT Selecting from joins".
Some quotes from that article:
Join Comparisons
U-SQL, like most scaled out Big Data Query languages
that support joins, restricts the join comparison to equality
comparisons between columns in the rowsets to be joined...
...
If one has a non-equality comparison or a more complex expression (such as a method invocation) in the comparison, one can move the comparison to the SELECT’s WHERE clause. Or the more complex expression can be placed in an earlier SELECT statement’s column and then that alias can be referred to in the join comparison.
Basically they don't scale particularly well on a distributed platform like ADLA.
Is it possible to use .QueryMultiple (or some other method) in Dapper, and use the results of each former query to be used in the where clause of the next query, without having to do each query individually, get the id, and then .Query again, get the id and so on.
For example,
string sqlString = #"select tableA_id from tableA where tableA_lastname = #lastname;
select tableB_id from tableB WHERE tableB_id = tableA_id";
db.QueryMultiple.(sqlString, new {lastname = "smith"});
Is something like this possible with Dapper or do I need a view or stored procedure to accomplish this? I can use multiple joins for one SQL statement, but in my real query there are 7 joins, and I didn't think I should return 7 objects.
Right now I'm just using object.
You can store every previous query in table parameter and then first perform select from the parameter and query for next, for example:
DECLARE #TableA AS Table(
tableA_id INT
-- ... all other columns you need..
)
INSERT #TableA
SELECT tableA_id
FROM tableA
WHERE tableA_lastname = #lastname
SELECT *
FROM #TableA
SELECT tableB_id
FROM tableB
JOIN tableA ON tableB_id = tableA_id
I have a simple spark SQL query :
SELECT x, y
FROM t1 INNER JOIN t2 ON t1.key = t2.key
WHERE expensiveFunction(t1.key)
Where expensiveFunction is a spark UDF (User-defined function).
When I look at the query plan generated by spark, I see that it has two filter operations instead of just one: it checks not only expensiveFunction(t1.key), but also expensiveFunction(t2.key).
In general, this optimization is not a bad thing, because it reduces the number of records to join, and joining is an expensive operation. But in my case expensiveFunction(t2.key) always returns true, so I would like to remove it.
Is there a way to change the query plan before executing a query ? Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?
Is there a way to change the query plan before executing a query?
In general, yes. There are few extension points in Spark SQL query planner and optimizer that would make the wish doable
Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?
That's nearly impossible unless the optimization allows for that. In other words you'd have to find out whether the rule has an option to turn it off, e.g. CostBasedJoinReorder with spark.sql.cbo.enabled or spark.sql.cbo.joinReorder.enabled configuration properties (when either is off CostBasedJoinReorder does nothing).
You could write a custom logical operator that would make the optimization void (as it would not be matched given unknown logical operator) and at optimization phase you'd remove it.
Use extendedOperatorOptimizationRules to register custom optimizations.
This is happening because of the optimizer rule org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints
Code comments is as follows(github)
/**
* Infers an additional set of constraints from a given set of equality constraints.
* For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an
* additional constraint of the form `b = 5`.
*/
def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression]
You could disable this Optimizer rule using spark.sql.optimizer.excludedRules
val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
"specified by their rule names and separated by comma. It is not guaranteed that all the " +
"rules in this configuration will eventually be excluded, as some rules are necessary " +
"for correctness. The optimizer will log the rules that have indeed been excluded.")
.stringConf
.createOptional
That way the filter will not get propagated to both sids of join
You can rewrite this query like below to avoid the extra function call.
SELECT x, y
FROM (SELECT <required-columns> FROM t1 WHERE expensiveFunction(t1.key)) t0 INNER JOIN t2 ON t0.key = t2.key
To be extra sure you can persist this query (SELECT FROM t1 WHERE expensiveFunction(t1.key)) as a separate DataFrame. and then join table t2 with this DataFrame.
For example lets say we have DataFrames df1 and df2 for table t1 and t2 respectively. we do the something like the following to avoid the expensiveFunction call twice.
val df3 = df1.filter("col1 == 1")
df3.persist() // forces evaluation of this dataframe and applies the expensive function filter on df1.
df3.createOrReplaceTempView("t1")
spark.sql("""SELECT t1.col1. t2.col2
FROM t1 INNER JOIN t2 ON t1.col2 = t2.col1""") // this query now have no reference to expensiveFunction
Can someone help me how can i dynamically pass value to db2 search clause like while fetching result from other table.
I am trying this:
select * from table2 where file_name like '%(select file_name from table1)'
I've even tried CONTACT, using sysibm.sysdummy1 methods but no luck.
maybe, this help;
SELECT *
FROM table2
JOIN table1
ON table2.file_name LIKE CONCAT('%',table1.file_name)
Not having been shown the DDL for the files, nor any sample data and expected results from which a reader could determine if there might not be [other] considerations as implied obstacles, the following variation of the already-offered answer is more liberal in selecting what might be intended by the select * from table2 where file_name like '%(select file_name from table1)' from the OP; i.e. rather than effective predicates of ends-with [or a starts-with] the file-name value, the following achieves an effective predicate of contains the file-name value.
select /* t1.file_name, */ t2.*
from table2 as t2
inner join
table1 as t1
on t2.file_name like '%' concat rtrim(t1.file_name) concat '%'