Multiple values in WHERE clause using sqldf in R - sqldf

I am trying to query multiple values in the WHERE clause, using sqldf in R. I have the following query, however, it continues to throw an error. Any help would be appreciated.
sqldf("SELECT amount
from df
where category = 'description' and 'original description'")
ERROR: <0 rows> (or 0-length row.names)

You just need to use in condition
sqldf("SELECT amount
from df
where category in ('description','original description')")
If you want to use like operator, you need to use OR instead of AND.(not sure what other entries are in the category, if you don't have any other category that has "description" in its name, the following might be enough
sqldf("SELECT amount from df where category LIKE 'descriptio%'")

You need to define each where clause explicitly, so
SELECT amount FROM df WHERE category = 'description' OR category = 'original description'
You can pass in multiple values, it's done with the IN operator:
SELECT amount FROM df WHERE category IN ( 'description', 'original description' )

Related

BigQuery Struct Aggregation

I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.
I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.
Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
The data I get for John with regard to the job title is:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
Whereas I am trying to achieve:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.
Hence in pseudo-SQL here I imagine I would :
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
The output would then be:
I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).
Open to alternatives. Please advise. Thanks
Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee
Consider below
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
if applied to sample data in your question - output is
Update:
use below for clarified output
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
with output

Dealing with duplicates in a pandas query

I have the following DataFrame:
data = {'Customer_ID': ['123','2','1010','123'],
'Date_Create': ['12/08/2010','04/10/1998','27/05/2010','12/08/2010'],
'Purchase':[1,1,0,1]
}
df = pd.DataFrame(data, columns = ['Customer_ID', 'Date_Create','Purchase'])
I want to perform this query:
df_2 = df[['Customer_ID','Date_Create','Purchase']].groupby(['Customer_ID'],
as_index=False).sum().sort_values(by='Purchase', ascending=False)
The objective of this query is to sum all purchases(boolean field) and as output a dataframe with 3 columns: 'Customer_ID', 'Date_Create','Purchase
Problem is: the field Date_Create is not in query because it has duplicate as the date_creation of the account does not change.
How can i solve it?
thx
If im understanding it correctly and your source data has some duplicates,
There's a function specifically for this, dataframe.drop_duplicates()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
To only consider some columns in the duplicate check, use subset:
df2 = df.drop_duplicates(subset=['Customer_ID','Date_Create'])
You can add column Date_Create to groupby if same values per Customer_ID:
(df.groupby(['Customer_ID','Date_Create'], as_index=False)['Purchase']
.sum()
.sort_values(by='Purchase', ascending=False))
If not, use some aggregation function - e.g. GroupBy.first for first date per groups:
(df.groupby('Customer_ID')
.agg(Purchase = ('Purchase', 'sum'), Date_Create= ('Date_Create', 'first'))
.reset_index()
.sort_values(by='Purchase', ascending=False))

Cast some columns and select all columns without explicitly writing column names

I want to cast some columns and then select all others
id, name, property, description = column("id"), column("name"), column("property"), column("description")
select([cast(id, String).label('id'), cast(property, String).label('property'), name, description]).select_from(events_table)
Is there any way to cast some columns and select all with out mentioning all column names
I tried
select([cast(id, String).label('id'), cast(property, String).label('property')], '*').select_from(events_table)
py_.transform(return_obj, lambda acc, element: acc.append(dict(element)), [])
But I get two extra columns (total 7 columns) which are cast and I can't convert them to dictionary which throws key error.
I'm using FASTAPI, sqlalchemy and databases(async)
Thanks
Pretty sure you can do
select_columns = []
for field in events_table.keys()
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)
to select all fields from that table. You can also keep a list of fields you want to actually select instead of events_table.keys(), like
select_these = ["id", "name", "property", "description"]
select_columns = []
for field in select_these
select_columns.append(getattr(events_table.c, field))
select(select_columns).select_from(events_table)

better way to select all columns and join in pyspark data frames

I have two data frames in pyspark. Their schema's are below
df1
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]
df2
DataFrame[serial_number: string, model_name: string, mac_address: string]
Now I want to do a full outer join on these two data frames by using coalesce on the column common in both the data frames.
I have done like below. I got the expected result.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code, f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)
Now I want to do the above little differently. Instead of writing all the column names near select in the join statement i want to do something like using * on the data frame. Basically I want something like below.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')
I am getting what I want. Is there a better way to this kind of operation in pyspark
edit
This is not a duplicate of https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe?rq=1 I am using a coalesce in the join statement. I want to know if there is a way where we can exclude the column on which I am using the coalesce function
You can do something like this:
(df1
.join(df2, df1.serial_number == df2.serial_number, 'full_outer')
.select(
[df1[c] for c in df1.columns if c != 'serial_number'] +
[f.coalesce(df1.serial_number, df2.serial_number)]
))

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.
The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.
In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement
I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

Resources