How to filter column values according to a list of strings - sql-like

I am trying to filter a dataset by comparing a list of strings to column values.
This works well using "LIKE" and one string, using 3GB.
#standardSQL
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64))
ridanxietycnt,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-
02-02')
AND V2Themes LIKE 'ECON_INFLATION'
group by daydate
However, when using more than one string with "LIKE", the query gets suddenly very large (8 TB).
#standardSQL
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64))
ridanxietycnt,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-
02-02')
AND V2Themes LIKE 'ECON_INFLATION' OR V2Themes LIKE 'ECON_STOCKMARKET'
group by daydate
Is there a more efficient (and cheaper) way to compare column values to a list of strings?
Any ideas would be much appreciated.

Careful with logic and OR precedence.
This:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
AND V2Themes LIKE 'ECON_INFLATION'
OR V2Themes LIKE 'ECON_STOCKMARKET'
is not the same than:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
AND (
V2Themes LIKE 'ECON_INFLATION'
OR V2Themes LIKE 'ECON_STOCKMARKET'
)
The first one doesn't one the partitioning filters, but the second one does.

Related

Prioritise query values over others using another query for values to be prioritised

I have the following query of Olympic countries in power query which I wish to sort using another query containing "prioritised countries" (the current top 10). I wish to sort the original query such that if a country is on the prioritised list it is alphabetically sorted at the top of the query.
Below visually shows what I am trying to achieve:
The best I have been able to do is merge queries however this removes countries not on the prioritised query. I appreciate that I can create a second query of the original, append this to the prioritised countries and then remove duplicates however I am looking for a more elegant solution as this will require refreshing the data twice.
Let Q be the query to sort and P be the priority list. Then you can get your desired result by appending the intersection Q ∩ P with the set difference Q \ P.
Here's one way to do this in M:
let
Source =
Table.FromList(
List.Combine(
{
List.Sort( List.Intersect( { P[Country], Q[Country] } ) ),
List.Sort( List.RemoveItems( Q[Country], P[Country] ) )
}
),
null,
{"Country"}
)
in
Source

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

how can i dynamically pass value to db2 search clause 'like' while fetching result from other table

Can someone help me how can i dynamically pass value to db2 search clause like while fetching result from other table.
I am trying this:
select * from table2 where file_name like '%(select file_name from table1)'
I've even tried CONTACT, using sysibm.sysdummy1 methods but no luck.
maybe, this help;
SELECT *
FROM table2
JOIN table1
ON table2.file_name LIKE CONCAT('%',table1.file_name)
Not having been shown the DDL for the files, nor any sample data and expected results from which a reader could determine if there might not be [other] considerations as implied obstacles, the following variation of the already-offered answer is more liberal in selecting what might be intended by the select * from table2 where file_name like '%(select file_name from table1)' from the OP; i.e. rather than effective predicates of ends-with [or a starts-with] the file-name value, the following achieves an effective predicate of contains the file-name value.
select /* t1.file_name, */ t2.*
from table2 as t2
inner join
table1 as t1
on t2.file_name like '%' concat rtrim(t1.file_name) concat '%'

What is the right way to do a semi-join on two Spark RDDs (in PySpark)?

In my PySpark application, I have two RDD's:
items - This contains item ID and item name for all valid items. Approx 100000 items.
attributeTable - This contains the fields user ID, item ID and an attribute value of this combination in that order. These is a certain attribute for each user-item combination in the system. This RDD has several 100s of 1000s of rows.
I would like to discard all rows in attributeTable RDD that don't correspond to a valid item ID (or name) in the items RDD. In other words, a semi-join by the item ID. For instance, if these were R data frames, I would have done semi_join(attributeTable, items, by="itemID")
I tried the following approach first, but found that this takes forever to return (on my local Spark installation running on a VM on my PC). Understandably so, because there are such a huge number of comparisons involved:
# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))
After a bit of fiddling around, I found that the following approach works pretty fast (a min or so on my system).
# Create a broadcast variable for item ID to item name mapping (dictionary)
itemIdToNameMap = sc.broadcast(items.collectAsMap())
# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
.map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
.filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
.map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))
Although this works well enough for my application, it feels more like a dirty workaround and I am pretty sure there must be another cleaner or idiomatically correct (and possibly more efficient) way or ways to do this in Spark. What would you suggest? I am new to both Python and Spark, so any RTFM advices will also be helpful if you could point me to the right resources.
My Spark version is 1.3.1.
Just do a regular join and then discard the "lookup" relation (in your case items rdd).
If these are your RDDs (example taken from another answer):
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
then you'd do:
attributeTable.keyBy(lambda x: x[1])
.join(items)
.map(lambda (key, (attribute, item)): attribute)
And as a result, you only have tuples from attributeTable RDD which have a corresponding entry in the items RDD:
[(123456, 123, 'Attribute for A')]
Doing it via leftOuterJoin as suggested in another answer will also do the job, but is less efficient. Also, the other answer semi-joins items with attributeTable instead of attributeTable with items.
As others have pointed out, this is probably most easily accomplished by leveraging DataFrames. However, you might be able to accomplish your intended goal by using the leftOuterJoin and the filter functions. Something a bit hackish like the following might suffice:
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
.filter(lambda x: x[1][1] is not None)
.map(lambda x: (x[0], x[1][0])).collect())
returns
[(123, 'Item A')]

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources