How can I add one or more columns in spark-sql?
in oracle, we are doing
select name, (mark1+mark2+mark3) as total from student
I'm looking for the same operation in spark-sql.
If you register dataframe as a temporary table (for example, via createOrReplaceTempView()) then the exact same SQL statement that you specified will work.
If you are using DataFrame API instead, the Column class defines various operators, including addition. In code, it would look something like this:
val df = Seq( (1,2), (3,4), (5,6) ).toDF("c1", "c2")
df.withColumn( "c3", $"c1" + $"c2" ).show
you can do it withColumn function.
If columns are numeric you can add them directly
df.withColumn('total', 'mark1'+'mark2'+'mark3')
if columns are string and want to concat them
import pyspark.sql.functions as F
df.withColumn('total', F.concat('mark1','mark2','mark3'))
Related
This is a spark related question. I have to add static data to various types of records, each type of record being processed as a different dataframe (say df1, df2, .. df6)
The static data that I intend to add, has to be repeated with all the 6 dataframes.
What would be a more performant way:
For each of the 6 dataframes, use:
.witColumn("testA", lit("somethingA"))
.witColumn("testB", lit("somethingB"))
.witColumn("testC", lit("somethingC"))
or
Create a new DF, say staticDF which has all the columns that I intend to append to each of the 6 dataframes and use a union?
or
Any other option that I have not considered?
The first way is correct. The second way wouldn't work because union add rows to a dataframe, not columns.
Another way is to use select to select all new columns at the same time:
df2 = df.select(
'*',
lit('somethingA').alias('testA'),
lit('somethingB').alias('testB'),
lit('somethingC').alias('testC')
)
I have two dataframes that are large here are sample examples..
first
firstnames|lastnames|age
tom|form|24
bob|lip|36
....
second
firstnames|lastnames|age
mary|gu|24
jane|lip|36
...
I would like to take both dataframes and combine them into one that look like:
firstnames|lastnames|age
tom|form|24
bob|lip|36
mary|gu|24
jane|lip|36
...
now I could write them both out and them read them together but that's a huge waste.
If both dataframes are identical in structure then it's straight forward -union()
df1.union(df2)
In case any dataframe have any missing column then you have add dummy column in that dataframe on that specific column position else union will throw column mismatch exception. in below example column 'c3' is missing in df1 so I am adding dummy column in df1 in last position.
from pyspark.sql.functions import lit
df1.select('c1','c2',lit('dummy')).union(df2.select('c1','c2','c3'))
this is a simple as shown here : union https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
I have a Hive query which I need to convert it into Dataframe. The query is as below
select sum(col1),max(col2) from table
group by 3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24;
I don't know how do I do that in Dataframe, generally we use
df.groupBy(columnName).agg()
But how can I convert the above query to Spark Dataframe.
You can simply select the column names from array of columns (df.columns) using the indexes and then use those selected column names in groupBy and use aggregation function.
So the complete translation would be
import org.apache.spark.sql.functions._
val groupingIndexes = Seq(3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
df.groupBy(groupingIndexes.map(x => df.columns(x)).map(col): _*).agg(sum("col1"),max("col2"))
I hope the answer is helpful
val df = spark.table("tablename")
df.groupBy(lit(1), lit(2), lit(5),... ,lit(24)).agg(sum(col("col1")).as("sumval"), max(col("col2")).as("maxval")).select("maxval","sumval")
Thanks
Ravi
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")
I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)