apache spark pivot dynamically - apache-spark

The pivot clause is available in Apache Spark SQL. However, it expects an expression_list which works when you know in advance what columns you expect. However, I would like to pivot on columns dynamically.
In my use case, I would need to retrieve the data to list in a query and then pass that into IN.
I just don't see how I can do this, beyond dynamically building the SQL string and applying the expression_list as a parameter, like building a template to then execute the query.
This would be fine, if I was able to do this in a Databricks notebook(in my case), however, I'm writing a query in Databricks SQL, which only accepts SQL, I have no ability to build an SQL string.
Would really appreciate any advice and how to resolve this issue.
%sql
drop table person;
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
SELECT * FROM person
PIVOT (
SUM(age) AS a
FOR name IN ('John', 'Dan')
)

Related

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.
%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING
Error I'm getting:
org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type
'IntegerType' to 'bam_user' with type 'StringType'
SQL doesn't support this, but it can be done in python:
from pyspark.sql.functions import col
# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
'column_1' : 'int',
'column_2' : 'string',
'column_3' : 'double'
}
# load to dataframe, change types
df = spark.read.format('delta').load(table_path)
for column in types_to_change:
df = df.withColumn(column,col(column).cast(types_to_change[column]))
# save df with new types overwriting the schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",True).save("dbfs:" + table_path)
No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.
There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.
The way to do this in python is as follows:
Let's say this is your table :
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
You can check the table structure using the following:
DESCRIBE TABLE person
IF you need to change the id to String:
This is the code:
%py
from pyspark.sql.functions import col
df = spark.read.table("person")
df1 = df.withColumn("id",col("id").cast("string"))
df1.write
.format ("parquet")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("person")
Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).
Re databricks:
If the format is "delta" you must specify this.
Also, if the table is partitioned, it's important to mention that in the code:
For example:
df1.write
.format ("delta")
.mode("overwrite")
.partitionBy("col_to_partition1", "col_to_partition2")
.option("overwriteSchema", "true")
.save(table_location)
When table_location is where the delta table is saved.
(some of this answer is based on this)
Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"
spark.read.table("delta_table_name") .withColumn("Column_name",col("Column_name").cast("new_data_type")) .write.format("delta").mode("overwrite").option("overwriteSchema",true).saveAsTable("delta_table_name")
Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True
Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name
from pyspark.sql import functions as F
spark.read.table("<TABLE NAME>") \
.withColumn("<COLUMN NAME> ",F.col("<COLUMN NAME>").cast("<DATA TYPE>")) \
.write.format("delta").mode("overwrite").option("overwriteSchema",True).saveAsTable("<TABLE NAME>")

BigQuery Struct Aggregation

I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.
I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.
Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:
with data_set as (
select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
union all
select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
union all
select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
union all
select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source
)
select employee,
array_agg(struct(source, job_title)) as job_title,
array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee
The data I get for John with regard to the job title is:
[{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}]
Whereas I am trying to achieve:
[{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]
With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.
Hence in pseudo-SQL here I imagine I would :
select employee,
AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR as job_title,
AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref,
from data_set group by employee
The output would then be:
I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).
Open to alternatives. Please advise. Thanks
Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee
Consider below
select employee,
array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
from data_set
group by employee
if applied to sample data in your question - output is
Update:
use below for clarified output
select employee,
array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
from data_set
group by employee
with output

SELECT rows with primary key of multiple columns

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?
Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.
You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

Apache Cassandra Data modeling

I need some help for a data model to save smart meter data, im pretty new working with cassandra.
The data that has to be stored:
This is a example of 1 smart meter:
{"logical_name": "smgw_123",
"ldevs":
[{"logical_name": "sm_1", "objects": [{"capture_time": 390600, "unit": 30, "scaler": -3, "status": "000", "value": 152.361925}]},
{"logical_name": "sm_2", "objects": [{"capture_time": 390601, "unit": 33, "scaler": -3, "status": "000", "value": 0.3208547253907171}]},
{"logical_name": "sm_3", "objects": [{"capture_time": 390602, "unit": 36, "scaler": -3, "status": "000", "value": 162.636025}]}]
}
So this is 1 smart meter gateway with the logical_name "smgw_123".
And in the ldevs array are 3 smartmeters with their values described.
So the smart meter gateway has a relation to the 3 smart meters. And the smart meters again have their own data.
Questions
I dont know how I can store these data which have relations in a no sql database (in my case cassandra).
Do I have to use than 2 columns? Like smartmetergateway (logical name, smart meter1, smart meter 2, smart meter 3)
and another with smart meter (logical name, capture time, unit, scaler, status, value)
???
Another problem is, all smart meter gateways can have different amount of smart meters.
I hope I could describe my problem understandable.
thx
In Cassandra data modelling, the first thing you should do is to determine your queries. You will model partition keys and clustering columns of your tables according to your queries.
In your example, I assume you will query your smart meter gateways based on their logical names. I mean, your queries will look like
select <some_columns>
from smart_meter_gateway
where smg_logical_name = <a_smg_logical_name>;
Also I assume each smart meter gateway logical names are unique and each smart meter name in ldevs array has a unique logical name.
If this is the case, you should create a table with a partition key column of smg_logical_name and clustering column of sm_logical_name. By doing this, you will create a table where each smart meter gateway partition will contain some number of rows of smart meters:
create table smart_meter_gateway
(
smg_logical_name text,
sm_logical_name text,
capture_time int,
unit int,
scaler int,
status text,
value decimal,
primary key ((smg_logical_name), sm_logical_name)
);
And you can insert into this table by using following statements:
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_1', 390600, 30, -3, '000', 152.361925);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_2', 390601, 33, -3, '000', 0.3208547253907171);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_3', 390602, 36, -3, '000', 162.636025);
And when you query smart_meter_gateway table by smg_logical_name, you will get 3 rows in the result set:
select * from smart_meter_gateway where smg_logical_name = 'smgw_123';
The result of this query is:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
smgw_123 sm_2 390601 -3 000 33 0.3208547253907171
smgw_123 sm_3 390602 -3 000 36 162.636025
You can also add sm_name as a filter to your query:
select *
from smart_meter_gateway
where smg_logical_name = 'smgw_123' and sm_logical_name = 'sm_1';
This time you will get only 1 row in the result set:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
Note that there are other ways you can model your data. For example, you can use collection columns for ldevs array and this approach has some advantages and disadvantages. As I said in the beginning, it depends on your query needs.

Cassandra CQL retrieve various rows from list of primary keys

I have at a certain point in my software a list of primary keys of which I want to retrieve information from a massively huge table, and I'm wondering what's the most practical way of doing this. Let me illustrate:
Let this be my table structure:
CREATE TABLE table_a(
name text,
date datetime,
key int,
information1 text,
information2 text,
PRIMARY KEY ((name, date), key)
)
say I have a list of primary keys:
list = [['Jack', '2015-01-01 00:00:00', 1],
['Jack', '2015-01-01 00:00:00', 2],
['Richard', '2015-02-14 00:00:00', 5],
['David', '2015-01-01 00:00:00', 9],
...
['Last', '2014-08-13 00:00:00', 12]]
Say this list is huge (hundreds of thousands) and not ordered in any way. I want to retrieve, for every key on the list, the value of the information columns.
As of now, the way I'm solving this issue is executing a select query for each key, and that has been sufficient hitherto. However I'm worried about execution times when the list of keys get too huge. Is there a more practical way of querying cassandra for a list of rows of which I know the primary keys without executing one query per key?
If the key was a single field, I could use the select * from table where key in (1,2,6,3,2,4,8) syntax to obtain all the keys I want in one query, however I don't see how to do this with composite primary keys.
Any light on the case is appreciated.
The best way to go about something like this, is to run these queries in parallel. You can do that on the (Java) application side by using async futures, like this:
Future<List<ResultSet>> future = ResultSets.queryAllAsList(session,
"SELECT * FROM users WHERE id=?",
UUID.fromString("0a63bce5-1ee3-4bbf-9bad-d4e136e0d7d1"),
UUID.fromString("7a69657f-39b3-495f-b760-9e044b3c91a9")
);
for (ResultSet rs : future.get()) {
... // process the results here
}
Create a table that has the 3 columns worth of data piped together into a single value and store that single string value in a single column. Make that column the PK. Then you can use the IN clause to filter. For example, select * from table where key IN ('Jack|2015-01-01 00:00:00|1', 'Jack|2015-01-01 00:00:00|2').
Hope that helps!
Adam

Resources