JPA2.1 JPQL Query with multiple TREAT(...) operations in a JOIN - jpql

I try to do a JPQL Query on the following Entity Graph: Derived1 and Derived2 are Subclasses of ReferencedEntity. Main Entity has a List of ReferencedEntity. Assume for now, that there is only one instance of Derived1 and Derived2 in the list of each MainEntity.
+--------------+
| MainEntity |
+--------------+ +------------------+
| list | --- OneToMany ---> | ReferencedEntity |
+--------------+ +------------------+
| String a1 |
+------------------+
^
|
+-----------+-----------+
| |
+--------------+ +--------------+
| Derived1 | | Derived2 |
+--------------+ +--------------+
| String d1 | | String d2 |
+--------------+ +--------------+
Now I want to Group MainEntities by the d1 and d2 Values they contain in their List. Therefore I write the following JPQL Query:
SELECT COUNT(m) FROM MainEntity m JOIN TREAT(m.list AS Derived1) der1, TREAT(m.list AS Derived2) der2 GROUP BY der1.d1, der2.d2
I think this should work but I get:
Caused by: Exception [EclipseLink-0] (Eclipse Persistence Services - 2.5.0.v20130507-3faac2b): org.eclipse.persistence.exceptions.JPQLException
Exception Description: Syntax error parsing [SELECT COUNT(m) from MainEntity m JOIN TREAT(m.list AS Derived1) der1, TREAT(m.list AS Derived2) der2 GROUP BY der1.d1, der2.d2].
[83, 84] The FROM clause has 'TREAT(m.list' and 'AS Derived2' that are not separated by a comma.
[83, 83] The right parenthesis is missing from the sub-expression.
[77, 83] The identification variable 'm.list' is not following the rules for a Java identifier.
[84, 84] A "root object" must be specified.
[95, 127] The query contains a malformed ending.
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.buildException(HermesParser.java:155)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.validate(HermesParser.java:334)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.populateQueryImp(HermesParser.java:278)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.buildQuery(HermesParser.java:163)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.buildEJBQLDatabaseQuery(EJBQueryImpl.java:142)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.buildEJBQLDatabaseQuery(EJBQueryImpl.java:116)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.<init>(EJBQueryImpl.java:102)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.<init>(EJBQueryImpl.java:86)
at org.eclipse.persistence.internal.jpa.EntityManagerImpl.createQuery(EntityManagerImpl.java:1583)
... 2 more
Java Result: 1
Just for testing purposes I tried a different join:
SELECT COUNT(m) FROM MainEntity m JOIN TREAT(m.list AS Derived1) der1, m.list refd GROUP BY der1.d1, refd.a1
This one works and delivers a result.
Then i tried a third query and just changed the order of the join:
SELECT COUNT(m) FROM MainEntity m JOIN m.list refd, TREAT(m.list AS Derived1) der1 GROUP BY der1.d1, refd.a1
This leads to exactly the same Exception (but this time for Derived1).
Caused by: Exception [EclipseLink-0] (Eclipse Persistence Services - 2.5.0.v20130507-3faac2b): org.eclipse.persistence.exceptions.JPQLException
Exception Description: Syntax error parsing [SELECT COUNT(m) from MainEntity m JOIN m.list refd, TREAT(m.list AS Derived1) der1 GROUP BY der1.d1, refd.a1].
[64, 65] The FROM clause has 'TREAT(m.list' and 'AS Derived1' that are not separated by a comma.
[64, 64] The right parenthesis is missing from the sub-expression.
[58, 64] The identification variable 'm.list' is not following the rules for a Java identifier.
[65, 65] A "root object" must be specified.
[76, 108] The query contains a malformed ending.
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.buildException(HermesParser.java:155)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.validate(HermesParser.java:334)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.populateQueryImp(HermesParser.java:278)
at org.eclipse.persistence.internal.jpa.jpql.HermesParser.buildQuery(HermesParser.java:163)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.buildEJBQLDatabaseQuery(EJBQueryImpl.java:142)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.buildEJBQLDatabaseQuery(EJBQueryImpl.java:116)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.<init>(EJBQueryImpl.java:102)
at org.eclipse.persistence.internal.jpa.EJBQueryImpl.<init>(EJBQueryImpl.java:86)
at org.eclipse.persistence.internal.jpa.EntityManagerImpl.createQuery(EntityManagerImpl.java:1583)
... 2 more
Java Result: 1
To me it looks like there might be a bug in EclipseLink, or am I trying something thats forbidden?

According to the specification, the correct syntax should be:
... JOIN TREAT(m.list AS Derived1) der1 JOIN TREAT(m.list AS Derived2) der2 ...
without any commas.

Related

How does Spark SQL implement the group by aggregate

How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+

How to use a filter in subselect

I want to perform a subselect on a related set of data. That subdata needs to be filtered using data from the main query:
customEvents
| extend envId = tostring(customDimensions.EnvironmentId)
| extend organisation = tostring(customDimensions.OrganisationName)
| extend version = tostring(customDimensions.Version)
| extend app = tostring(customDimensions.Appname)
| where customDimensions.EventName contains "ApiSessionStartStart"
| extend dbInfo = toscalar(
customEvents
| extend dbInfo = tostring(customDimensions.dbInfo)
| extend serverEnvId = tostring(customDimensions.EnvironmentId)
| where customDimensions.EventName == "ServiceSessionStart" or customDimensions.EventName == "ServiceSessionContinuation"
| where serverEnvId = envId // This gives and error
| project dbInfo
| take 1)
| order by timestamp desc
| project timestamp, customDimensions.OrganisationName, customDimensions.Version, customDimensions.onBehalfOf, customDimensions.userId, customDimensions.Appname, customDimensions.apiKey, customDimensions.remoteIp, session_Id , dbInfo, envId
The above query results in an error:
Failed to resolve entity 'envId'
How can I filter the data in the subselect based on the field envId in the main query?
i believe you'd need to use join instead, where you'd join to get that value from the second query
docs for join: https://docs.loganalytics.io/docs/Language-Reference/Tabular-operators/join-operator
the left hand side of the join is your "outer" query, and the right hand side of the join would be that "inner" query, though instead of doing take 1, you'd probably do a simpler query that just gets distinct values of serverEnvId, dbInfo

udf that sorts list in pyspark

I have a dataframe where one of the column, called stopped is:
+--------------------+
| stopped|
+--------------------+
|[nintendo, dsi, l...|
|[nintendo, dsi, l...|
| [xl, honda, 500]|
|[black, swan, green]|
|[black, swan, green]|
|[pin, stripe, sui...|
| [shooting, braces]|
| [haus, geltow]|
|[60, cm, electric...|
| [yamaha, yl1, yl2]|
|[landwirtschaft, ...|
| [wingbar, 9581]|
| [gummi, 16mm]|
|[brillen, lupe, c...|
|[man, city, v, ba...|
|[one, plus, one, ...|
| [kapplocheisen]|
|[tractor, door, m...|
|[pro, nano, flat,...|
|[kaleidoscope, to...|
+--------------------+
I would like to create another column that contains the same list but where the keywords are ordered.
As I understand it, I need to create a udf that takes and returns a list:
udf_sort = udf(lambda x: x.sort(), ArrayType(StringType()))
ps_clean.select("*", udf_sort(ps_clean["stopped"])).show(5, False)
and I get:
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|client_id|kw_id |keyword |max_click_dt|tokenized |stopped |<lambda>(stopped)|
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|710 |4304414582|nintendo dsi lite new|2017-01-06 |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null |
|705 |4304414582|nintendo dsi lite new|2017-03-25 |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null |
|707 |647507047 |xl honda 500 s |2016-10-26 |[xl, honda, 500, s] |[xl, honda, 500] |null |
|710 |26308464 |black swan green |2016-01-01 |[black, swan, green] |[black, swan, green] |null |
|705 |26308464 |black swan green |2016-07-13 |[black, swan, green] |[black, swan, green] |null |
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
Why is the sorting not being applied?
x.sort() typically sorts the list in place (but I suspect that it won't do that in a pyspark dataframe) and it returns None. That is the reaason your column labeled <lambda>(stopped) has all null values.sorted(x) will sort the list and return a new sorted copy. So, replacing your udf with
udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))
should solve your problem.
Alternatively, you can use the built-in function sort_array instead of defining your own udf.
from pyspark.sql.functions import sort_array
ps_clean.select("*", sort_array(ps_clean["stopped"])).show(5, False)
This method is a little cleaner, and you can actually expect to get some performance gains because pyspark doesn't have to serialize your udf.
change Your udf to:
udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))
on diffrences beetwen .sort() and .sorted() read:
What is the difference between `sorted(list)` vs `list.sort()` ? python

How to do left outer join in spark sql?

I am trying to do a left outer join in spark (1.6.2) and it doesn't work. My sql query is like this:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
where t.created_year = 2016
and p.created_year = 2016").show()
The result is like this:
+--------------------+--------------------+--------------------+
| type| uuid| uuid|
+--------------------+--------------------+--------------------+
| tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
| swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|
I got same result either using LEFT JOIN or LEFT OUTER JOIN (the second uuid is not null).
I would expect the second uuid column to be null only. how to do a left outer join correctly?
=== Additional information ==
If I using dataframe to do left outer join i got correct result.
s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')
s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'),
p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()
I got result like this:
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| type| s_uuid| p_uuid| created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
| tained|6d688688-96a4-341...| null|2016-01-28 00:27:...| null| null|
Thanks,
I don't see any issues in your code. Both "left join" or "left outer join" will work fine. Please check the data again the data you are showing is for matches.
You can also perform Spark SQL join by using:
// Left outer join explicit
df1.join(df2, df1["col1"] == df2["col1"], "left_outer")
You are filtering out null values for p.created_year (and for p.uuid) with
where t.created_year = 2016
and p.created_year = 2016
The way to avoid this is to move filtering clause for p to the ON statement:
sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p
ON t.uuid = p.uuid
and p.created_year = 2016
where t.created_year = 2016").show()
This is correct but inefficient because we also need to filter on t.created_year before the join happens. So it is recommended to use subqueries:
sqlContext.sql("select t.type, t.uuid, p.uuid
from (
SELECT type, uuid FROM symptom_type WHERE created_year = 2016
) t LEFT JOIN (
SELECT uuid FROM plugin WHERE created_year = 2016
) p
ON t.uuid = p.uuid").show()
I think you just need to use LEFT OUTER JOIN instead of LEFT JOIN keyword for what you want. For more informations look at the Spark documentation.

In cassandra cqlsh, how do I select from map<ascii, ascii> in a table?

Basically here is how my table is set up:
CREATE TABLE points (
name ascii,
id varint,
attributes map<ascii, ascii>,
PRIMARY KEY (name, id)
)
and if I run the following SELECT statement I get this returned:
SELECT id, attributes from points limit 5;
id | attributes
----+------------------------------------------
1 | {STATION/Name: ABC, Type: 2, pFreq: 101}
2 | {STATION/Name: ABC, Type: 1, pFreq: 101}
3 | {STATION/Name: DEF, Type: 1, pFreq: 103}
4 | {STATION/Name: GHI, Type: 2, pFreq: 105}
5 | {STATION/Name: GHI, Type: 1, pFreq: 105}
What I would like to do is be able to form a WHERE clause based on info inside of attributes. Something like the following statement:
SELECT id FROM points WHERE name = 'NAME' AND attributes['pFreq'] = 101;
However, when I run this I get the following error:
Bad Request: line 1:56 no viable alternative at input '['
I looked at this discussion and it seems as though it is not supported yet, is this true? Or is there a way to filter on the attributes information?
Here are the versions I am working with:
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Yes is true, instead you can use CONTAINS:
SELECT * FROM points WHERE attributes CONTAINS 101;

Resources