I try to translate a legacy query into a standard SQL query in bigquery, but I don't find the function position() in standard SQL.
You may be looking for the bracket operator. For example,
SELECT array_column[OFFSET(0)]
FROM dataset.table
This selects the first element of an array column for each row. If you want to flatten an array and get the offset of each element, you can do so like this:
SELECT x, x_offset
FROM dataset.table,
UNNEST(array_column) AS x WITH OFFSET x_offset
See also the working with arrays documentation.
Related
I’m using ADODB to query on Sheet1. If I fetch the data using SQL query on the sheet as below without grouping I’m getting all characters from comment.
However, if I use grouping my characters are truncated to 255.
Note – My first row contains 800 len of characters so drivers have identified the datatype correctly.
Here is my query output without grouping
Select Product, Value, Comment, len(comment) from [sheet1$A1:T10000]
With grouping
Select Product, sum(value), Comment, len(comment) from [sheet1$A1:T10000] group by Product, Comment
Thanks for posting this! During my 20+ years of database development using ADO recordsets I had never faced this issue until this week. Once I traced the truncation to the recordset I was really scratching my head. Couldn't figure how/why it was happening until I found your post and you got me focused on the GROUP BY. Sure enough, that was the cause (some kind of ADO bug I guess). I was able to work around it by putting correlated scalar sub-queries in the SELECT list, vice using JOIN and GROUP BY.
To elaborate...
At least 9 times out of 10 (in my experience) JOIN/GROUP BY syntax can be replaced with correlated scalar subquery syntax, with no appreciable loss of performance. That's fortunate in this case since there is apparently a bug with ADO recordset objects whereby GROUP BY syntax results in the truncation of text when the string length is greater than 255 characters.
The first example below uses JOIN/GROUP BY. The second uses a correlated scalar subquery. Both would/should provide the same results. However, if any comment is greater than 255 characters these 2 queries will NOT return the same results if an ADODB recordset is involved.
Note that in the second example the last column in the SELECT list is itself a full select statement. It's called a scalar subquery because it will only return 1 row / 1 column. If it returned multiple rows or columns an error would be thrown. It's also known as a correlated subquery because it references something that is immediately outside its scope (e.emp_number in this case).
SELECT e.emp_number, e.emp_name, e.supv_comments, SUM(i.invoice_amt) As total_sales
FROM employees e INNER JOIN invoices i ON e.emp_number = i.emp_number
GROUP BY e.emp_number, e.emp_name, e.supv_comment
SELECT e.emp_number, e.emp_name, e.supv_comments,
(SELECT SUM(i.invoice_amt) FROM invoices i WHERE i.emp_number = e.emp_number) As total_sales
FROM employees e
I understand that rand() produces a column with random values and orderBy takes in a column to sort either in descending or ascending order.
Looking at dataframe.orderBy(rand):
I find it puzzling that orderBy can take in a column and sort, even though it has not been created on dataframe.
Compared to
dataframe.withColumn("X",rand).orderBy("X")
where dataframe("X") is already defined.
Which leads me to two questions.
Is dataframe.orderBy(rand) the same as dataframe.withColumn("X",rand).orderBy("X") in context of ordering?
Is it necessary to create additional columns for the purpose of ordering before using .orderBy?
Yes, both variants are equivalent, and are that surprising. orderBy takes expression or a name of the column. Here it is the first variant. If you're familiar with SQL,
dataframe.withColumn("X",rand).orderBy("X")
is equivalent to
SELECT * FROM (SELECT *, randr AS X FROM table) ORDER BY X
while
dataframe.orderBy(rand)
is equivalent to
SELECT * FROM table ORDER BY randr
I have data lying in multiple files with naming convention as {year}/{month}/{date} which have duplicates (every day delta where records may get updated everyday).
I want to create a view that will return the records with the duplicates merged / squashed.
The duplicates will be ranked and only the latest updated records corresponding to each primary key will be returned.
But the use of rowsets in view seems to be not supported. Basically something like this:
CREATE VIEW viewname AS
#sourcedata = EXTRACT //schema
from //filenamePattern (regex)
using Extractors.TSV()
#sourceData = SELECT *,
ROW_NUMBER() OVER(PARTITION BY primary_Key ORDER BY timestamp DESC) AS RowNumber FROM #SourceData;
SELECT //schema
from #sourceData WHERE RowNumber == 1
So that when I do
select * from viewname
I get the merged data directly from the underlying files. How to achieve this ?
It is possible to have multiple EXTRACT statements in a view stacked together with a UNION statement which would implicitly remove duplicates. However is there any particular reason you need to use a view? This will limit your options as you will have to code within the limitations of views (eg they can't be parameterised). You could also use table-valued function, stored procedure or just a plain old script. This would give you many more options, especially if your de-duplication logic is complex. A simple example:
DROP VIEW IF EXISTS vw_removeDupes;
CREATE VIEW vw_removeDupes
AS
EXTRACT someVal int
FROM "/input/input59a.txt"
USING Extractors.Tsv()
UNION
EXTRACT someVal int
FROM "/input/input59b.txt"
USING Extractors.Tsv();
I think it can be solved by table valued function. Have you tried using it?
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-functions
Teradata has a function called ZEROIFNULL, which does what the name suggests, if the value of a column is NULL, it returns zero. On similar lines, there's a function called NULLIFZERO as well.
I want to imitate/mock these functionalities in SparkSQL(not using the dataframe or RDD APIs, instead, I want to use them in SparkSQL, where you directly pass the SQLs.)
Any ideas?
Try
sqlContext.sql("select COALESCE(column,0)")
Returns zero if column is NULL.
To mimic NULLIFZERO, you could use case when
select case when col=0 then NULL end from tbl
I'm using apache-cassandra 1.2 (CLI version), I want to search a sub string in columns, just like we search in SQL using like or where clause.
Can anyone tell me how to search a sub string in rows?
I only want to do it in CLI, please don's suggest CQL cassandra.
Perhaps if cassandra supports indexing on collection columns in near future, it will be very much possible. In that case we can defragment the text into pieces and easily perform the like operation.
But for now the best you can do is prefix, if your order is alphabetical. For example I have a CF with comparator UTF8Type, and then I can do slice query and bring all columns that start with the prefix, and end with the prefix where you replace the last char with the next one in order (i.e. "aaa"-"aab")