BigQuery Struct Aggregation - struct

I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.
I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.
Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
The data I get for John with regard to the job title is:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
Whereas I am trying to achieve:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.
Hence in pseudo-SQL here I imagine I would :
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
The output would then be:
I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).
Open to alternatives. Please advise. Thanks
Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee

Consider below
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
if applied to sample data in your question - output is
Update:
use below for clarified output
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
with output

Related

Prioritise query values over others using another query for values to be prioritised

I have the following query of Olympic countries in power query which I wish to sort using another query containing "prioritised countries" (the current top 10). I wish to sort the original query such that if a country is on the prioritised list it is alphabetically sorted at the top of the query.
Below visually shows what I am trying to achieve:
The best I have been able to do is merge queries however this removes countries not on the prioritised query. I appreciate that I can create a second query of the original, append this to the prioritised countries and then remove duplicates however I am looking for a more elegant solution as this will require refreshing the data twice.
Let Q be the query to sort and P be the priority list. Then you can get your desired result by appending the intersection Q ∩ P with the set difference Q \ P.
Here's one way to do this in M:
let
Source =
Table.FromList(
List.Combine(
{
List.Sort( List.Intersect( { P[Country], Q[Country] } ) ),
List.Sort( List.RemoveItems( Q[Country], P[Country] ) )
}
),
null,
{"Country"}
)
in
Source

AWS Athena working with nested arrays, trying to search for a field within the array

I have a sql query:
SELECT id_str, entities.hashtags
FROM tweets, unnest(entities.hashtags) as t(hashtag)
WHERE cardinality(entities.hashtags)=2 and id_str='1248585590573948928'
limit 5
which returns:
id_str hashtags
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
1248585590573948928 [{text=LUCAS, indices=[75, 81]}, {text=WayV, indices=[83, 88]}]
The unnesting has returned the row twice which originally was one row, this is because there are 2 objects in this array.
The next part I wanted to add to the sql query was
select hashtag['text'] as htag to the existing select which should return 2 rows still but this time returning LUCAS and WayV in the separate rows in same column, named htag.
But I get this error - any idea what I am doing wrong?
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: '[]' cannot be applied to row(text varchar,indices array(bigint)), varchar(4)
I assume it is because I have another array within this array.. ?
Thanks in advance
I'm not entirely sure where you're adding the hashtag['text'] expression, so I can't say with confidence what your problem is, but I have two suggestions for you to try:
The error says that hashtag is of type row(text varchar, …), which suggests that hashtag.text should work.
If that doesn't work, you can try using element_at e.g. element_at(hashtag, 'text').
I came across this issue as well and since there is no solution provided I like to chip in:
After you unnest an array, you can address the result with a . reference instead of ['']:
WITH dataset AS (
SELECT ARRAY[
CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
] AS users
)
SELECT
user,
user.name
FROM dataset
cross join unnest (users) as t(user)

Multiple values in WHERE clause using sqldf in R

I am trying to query multiple values in the WHERE clause, using sqldf in R. I have the following query, however, it continues to throw an error. Any help would be appreciated.
sqldf("SELECT amount
from df
where category = 'description' and 'original description'")
ERROR: <0 rows> (or 0-length row.names)
You just need to use in condition
sqldf("SELECT amount
from df
where category in ('description','original description')")
If you want to use like operator, you need to use OR instead of AND.(not sure what other entries are in the category, if you don't have any other category that has "description" in its name, the following might be enough
sqldf("SELECT amount from df where category LIKE 'descriptio%'")
You need to define each where clause explicitly, so
SELECT amount FROM df WHERE category = 'description' OR category = 'original description'
You can pass in multiple values, it's done with the IN operator:
SELECT amount FROM df WHERE category IN ( 'description', 'original description' )

What is the right way to do a semi-join on two Spark RDDs (in PySpark)?

In my PySpark application, I have two RDD's:
items - This contains item ID and item name for all valid items. Approx 100000 items.
attributeTable - This contains the fields user ID, item ID and an attribute value of this combination in that order. These is a certain attribute for each user-item combination in the system. This RDD has several 100s of 1000s of rows.
I would like to discard all rows in attributeTable RDD that don't correspond to a valid item ID (or name) in the items RDD. In other words, a semi-join by the item ID. For instance, if these were R data frames, I would have done semi_join(attributeTable, items, by="itemID")
I tried the following approach first, but found that this takes forever to return (on my local Spark installation running on a VM on my PC). Understandably so, because there are such a huge number of comparisons involved:
# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))
After a bit of fiddling around, I found that the following approach works pretty fast (a min or so on my system).
# Create a broadcast variable for item ID to item name mapping (dictionary)
itemIdToNameMap = sc.broadcast(items.collectAsMap())
# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
.map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
.filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
.map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))
Although this works well enough for my application, it feels more like a dirty workaround and I am pretty sure there must be another cleaner or idiomatically correct (and possibly more efficient) way or ways to do this in Spark. What would you suggest? I am new to both Python and Spark, so any RTFM advices will also be helpful if you could point me to the right resources.
My Spark version is 1.3.1.
Just do a regular join and then discard the "lookup" relation (in your case items rdd).
If these are your RDDs (example taken from another answer):
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
then you'd do:
attributeTable.keyBy(lambda x: x[1])
.join(items)
.map(lambda (key, (attribute, item)): attribute)
And as a result, you only have tuples from attributeTable RDD which have a corresponding entry in the items RDD:
[(123456, 123, 'Attribute for A')]
Doing it via leftOuterJoin as suggested in another answer will also do the job, but is less efficient. Also, the other answer semi-joins items with attributeTable instead of attributeTable with items.
As others have pointed out, this is probably most easily accomplished by leveraging DataFrames. However, you might be able to accomplish your intended goal by using the leftOuterJoin and the filter functions. Something a bit hackish like the following might suffice:
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
.filter(lambda x: x[1][1] is not None)
.map(lambda x: (x[0], x[1][0])).collect())
returns
[(123, 'Item A')]

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources