This is related to the question in Pivot table with Apache Pig.
I have the input data as
Id Name Value
1 Column1 Row11
1 Column2 Row12
1 Column3 Row13
2 Column1 Row21
2 Column2 Row22
2 Column3 Row23
and want to pivot and get the output as
Id Column1 Column2 Column3
1 Row11 Row12 Row13
2 Row21 Row22 Row23
Pls let me know how to do it in Pig.
The simplest way to do it without UDF is to group on Id and than in nested foreach select rows for each of the column names, then join them in the generate. See script:
inpt = load '~/rows_to_cols.txt' as (Id : chararray, Name : chararray, Value: chararray);
grp = group inpt by Id;
maps = foreach grp {
col1 = filter inpt by Name == 'Column1';
col2 = filter inpt by Name == 'Column2';
col3 = filter inpt by Name == 'Column3';
generate flatten(group) as Id, flatten(col1.Value) as Column1, flatten(col2.Value) as Column2, flatten(col3.Value) as Column3;
};
Output:
(1,Row11,Row12,Row13)
(2,Row21,Row22,Row23)
Another option would be to write a UDF which converts a bag{name, value} into a map[], than use get values by using column names as keys (Ex. vals#'Column1').
Not sure about pig, but in spark, you could do this with a one-line command
df.groupBy("Id").pivot("Name").agg(first("Value"))
Related
I have a table that looks like this
select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4
I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below
select
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
output
output
I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change
For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI
https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions
I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)
EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from
select 'Alice' AS ID, STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT 'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions
Consider below query.
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
Query results
Dynamic Query
I would like it to be dynamic, so avoid writing out columns individually
You need to consider a dynamic SQL for this. By refering to the answer from #Mikhail you linked in the post, you can write a dynamic query like below.
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
For updated sample table
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
I have a Spark Dataframe as below
ID
Col A
Col B
1
null
Some Value
2
Some Value
null
I need to add a new column which contains the column name (among Col A and Col B) which is not null.
So the expected dataframe should look like,
ID
Col A
Col B
result
1
null
Some Value
Col B
2
Some Value
null
Col A
Any help would be much appreciated.
Thank you!
after creating temp views from your dataframe eg
df.createOrReplaceTempView("my_data")
you may run the following on your spark session using newdf = sparkSession.sql("query here")
SELECT
ID,
ColA,
ColB,
CASE
WHEN ColA IS NULL AND ColB IS NULL THEN NULL
WHEN ColB IS NULL THEN 'ColA'
WHEN ColA IS NULL THEN 'ColB'
ELSE 'ColA Col B'
END AS result
FROM my_data
or just using python
from pyspark.sql.functions import when, col
df = df.withColumn("result",when(
col("Col A").isNull() & col("Col B").isNull() , None
).when(
col("Col B").isNull() ,'Col A'
).when(
col("Col A").isNull() ,'Col B'
).otherwise('Col A Col B')
)
I'm attempting to perform some sort of upsert operation in U-SQL where I pull data every day from a file, and compare it with yesterdays data which is stored in a table in Data Lake Storage.
I have created an ID column in the table in DL using row_number(), and it is this "counter" I wish to continue when appending new rows to the old dataset. E.g.
Last inserted row in DL table could look like this:
ID | Column1 | Column2
---+------------+---------
10 | SomeValue | 1
I want the next rows to have the following ascending ids
11 | SomeValue | 1
12 | SomeValue | 1
How would I go about making sure that the next X rows continues the ID count incrementally such that the next rows each increases the ID column by 1 more than the last?
You could use ROW_NUMBER then add it to the the max value from the original table (ie using CROSS JOIN and MAX). A simple demo of the technique:
DECLARE #outputFile string = #"\output\output.csv";
#originalInput =
SELECT *
FROM ( VALUES
( 10, "SomeValue 1", 1 )
) AS x ( id, column1, column2 );
#newInput =
SELECT *
FROM ( VALUES
( "SomeValue 2", 2 ),
( "SomeValue 3", 3 )
) AS x ( column1, column2 );
#output =
SELECT id, column1, column2
FROM #originalInput
UNION ALL
SELECT (int)(x.id + ROW_NUMBER() OVER()) AS id, column1, column2
FROM #newInput
CROSS JOIN ( SELECT MAX(id) AS id FROM #originalInput ) AS x;
OUTPUT #output
TO #outputFile
USING Outputters.Csv(outputHeader:true);
My results:
You will have to be careful if the original table is empty and add some additional conditions / null checks but I'll leave that up to you.
I am trying to update table column1 values to be copied from column2 in same table in Cassandra.
I have tried these: but throwing error
no viable alternative at input 'where' (...emp set col1_name = [shape] where...)
UPDATE emp SETcol1_name = col2_name WHERE id IN (1,2,3);
UPDATE emp SET col1_name = select(col2_name) WHERE id IN (1,2,3);
Using SparkR (spark-2.1.0) using the DataStax cassandra connector.
I have a dataframe which connects to a table in Cassandra. Some of the columns in the cassandra table are of type map and set. I need to perform various filtering/aggregation operations on these "collection" columns.
my_data_frame <-read.df(
source = "org.apache.spark.sql.cassandra",
keyspace = "my_keyspace", table = "some_table")
my_data_frame
SparkDataFrame[id:string, col2:map<string,int>, col3:array<string>]
schema(my_data_frame)
StructType
|-name = "id", type = "StringType", nullable = TRUE
|-name = "col2", type = "MapType(StringType,IntegerType,true)", nullable = TRUE
|-name = "col3", type = "ArrayType(StringType,true)", nullable = TRUE
I would like to obtain:
A new dataframe containing the unique string KEYS in the col2 map over all rows in my_data_frame.
The sum() of VALUES in the col2 map for each row placed into a new column in my_data_frame.
The set of unique values in the col3 array over all rows in my_data_frame into a new dataframe
The map data in cassandra for col2 looks like:
VALUES ({'key1':100, 'key2':20, 'key3':50, ... })
If the original cassandra table looks like:
id col2
1 {'key1':100, 'key2':20}
2 {'key3':40, 'key4':10}
3 {'key1':10, 'key3':30}
I would like to obtain a dataframe containing the unique keys:
col2_keys
key1
key2
key3
key4
The sum of values for each id:
id col2_sum
1 120
2 60
3 40
The max of values for each id:
id col2_max
1 100
2 40
3 30
Additional info:
col2_df <- select(my_data_frame, my_data_frame$col2)
head(col2_df)
col2
1 <environment: 0x7facfb4fc4e8>
2 <environment: 0x7facfb4f3980>
3 <environment: 0x7facfb4eb980>
4 <environment: 0x7facfb4e0068>
row1 <- first(my_data_frame)
row1
col2
1 <environment: 0x7fad00023ca0>
I am new to Spark and R and have probably missed something obvious, but I don't
see any obvious functions for transforming maps and arrays in this manner.
I did see some references to using "environment" as a map in R but am not sure how that would work for my requirements.
spark-2.1.0
Cassandra 3.10
spark-cassandra-connector:2.0.0-s_2.11
JDK 1.8.0_101-b13
Thanks so much in advance for any help.