I have a Hive query which I need to convert it into Dataframe. The query is as below
select sum(col1),max(col2) from table
group by 3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24;
I don't know how do I do that in Dataframe, generally we use
df.groupBy(columnName).agg()
But how can I convert the above query to Spark Dataframe.
You can simply select the column names from array of columns (df.columns) using the indexes and then use those selected column names in groupBy and use aggregation function.
So the complete translation would be
import org.apache.spark.sql.functions._
val groupingIndexes = Seq(3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
df.groupBy(groupingIndexes.map(x => df.columns(x)).map(col): _*).agg(sum("col1"),max("col2"))
I hope the answer is helpful
val df = spark.table("tablename")
df.groupBy(lit(1), lit(2), lit(5),... ,lit(24)).agg(sum(col("col1")).as("sumval"), max(col("col2")).as("maxval")).select("maxval","sumval")
Thanks
Ravi
Related
How can I add one or more columns in spark-sql?
in oracle, we are doing
select name, (mark1+mark2+mark3) as total from student
I'm looking for the same operation in spark-sql.
If you register dataframe as a temporary table (for example, via createOrReplaceTempView()) then the exact same SQL statement that you specified will work.
If you are using DataFrame API instead, the Column class defines various operators, including addition. In code, it would look something like this:
val df = Seq( (1,2), (3,4), (5,6) ).toDF("c1", "c2")
df.withColumn( "c3", $"c1" + $"c2" ).show
you can do it withColumn function.
If columns are numeric you can add them directly
df.withColumn('total', 'mark1'+'mark2'+'mark3')
if columns are string and want to concat them
import pyspark.sql.functions as F
df.withColumn('total', F.concat('mark1','mark2','mark3'))
I have read that in spark you can easily do a correlated scalar subquery like so:
select
column1,
(select column2 from table2 where table2.some_key = table1.id)
from table1
What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.
Below is an example of the DataFrame code.
Dataset<Row> table1 = getTable1FromSomewhere();
Dataset<Row> table2 = getTable2FromSomewhere();
table1
.as("table1")
.join(table2.as("table2"),
col("table1.first_key").equalTo(col("table2.key")), "left")
.join(table2.as("table3"),
col("table1.second_key").equalTo(col("table3.key")), "left")
.select(col("table1.*"),
col("table2.description").as("first_key_description"),
col("table3.description").as("second_key_description"))
.show();
Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.
What I have not figured out is how to do this in the DataFrame API.
Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:
https://issues.apache.org/jira/browse/SPARK-23945
https://issues.apache.org/jira/browse/SPARK-18455
Does SparkSQL support subquery?
How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")
I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)