Group by in Dataframe using numeric elements Usining Spark Scala

Group by in Dataframe using numeric elements Usining Spark Scala - apache-spark

I have a Hive query which I need to convert it into Dataframe. The query is as below
select sum(col1),max(col2) from table
group by 3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24;
I don't know how do I do that in Dataframe, generally we use
df.groupBy(columnName).agg()
But how can I convert the above query to Spark Dataframe.

You can simply select the column names from array of columns (df.columns) using the indexes and then use those selected column names in groupBy and use aggregation function.
So the complete translation would be
import org.apache.spark.sql.functions._
val groupingIndexes = Seq(3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
df.groupBy(groupingIndexes.map(x => df.columns(x)).map(col): _*).agg(sum("col1"),max("col2"))
I hope the answer is helpful

val df = spark.table("tablename")
df.groupBy(lit(1), lit(2), lit(5),... ,lit(24)).agg(sum(col("col1")).as("sumval"), max(col("col2")).as("maxval")).select("maxval","sumval")
Thanks
Ravi

Related

Spark SQL - how to add two column value

How can I add one or more columns in spark-sql?
in oracle, we are doing
select name, (mark1+mark2+mark3) as total from student
I'm looking for the same operation in spark-sql.

If you register dataframe as a temporary table (for example, via createOrReplaceTempView()) then the exact same SQL statement that you specified will work.
If you are using DataFrame API instead, the Column class defines various operators, including addition. In code, it would look something like this:
val df = Seq( (1,2), (3,4), (5,6) ).toDF("c1", "c2")
df.withColumn( "c3", $"c1" + $"c2" ).show

you can do it withColumn function.
If columns are numeric you can add them directly
df.withColumn('total', 'mark1'+'mark2'+'mark3')
if columns are string and want to concat them
import pyspark.sql.functions as F
df.withColumn('total', F.concat('mark1','mark2','mark3'))

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.

What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.

you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Spark sql dataframe drop all columns from alias table after join

I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.

You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)

Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?

The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Group by in Dataframe using numeric elements Usining Spark Scala - apache-spark

val df = spark.table("tablename") df.groupBy(lit(1), lit(2), lit(5),... ,lit(24)).agg(sum(col("col1")).as("sumval"), max(col("col2")).as("maxval")).select("maxval","sumval") Thanks Ravi

Related

Spark SQL - how to add two column value

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

split,operate and union dataframe in spark

Spark sql dataframe drop all columns from alias table after join

Spark Deduplicate column in dataframe based on column in other dataframe

Categories

Resources