Adding a new column to a spark dataframe [duplicate] - apache-spark

This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Primary keys with Apache Spark
(4 answers)
Closed 4 years ago.
I want to add a column to a spark dataframe which has been registered as a table. This column needs to have an auto incrementing long.
df = spark.sql(query)
df.createOrReplaceTempView("user_stories")
df = spark.sql("ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT")
df.show(5)
This throws the following error,
Py4JJavaError: An error occurred while calling o72.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'ALTER TABLE user_stories ADD COLUMN'(line 1, pos 29)
== SQL ==
ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT
-----------------------------^^^
What am I missing here?

if you want to add new incremental column to DF, you could do in following ways.
df.show()
+-------+
| name|
+-------+
|gaurnag|
+-------+
from pyspark.sql.functions import monotonically_increasing_id
new_df = df.withColumn("id", monotonically_increasing_id())
new_df.show()
+-------+---+
| name| id|
+-------+---+
|gaurnag| 0|
+-------+---+

Related

Add an index to a dataframe. Pyspark 2.4.4 [duplicate]

This question already has answers here:
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
(7 answers)
Closed 2 years ago.
There are a lot of examples that all give the same basic example.
dfWithIndex = df.withColumn('f_index', \
pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType()))
rdd = df.rdd.zipWithIndex().map(lambda row, rowId: (list(row) + [rowId + 1]))
dfIndexed = sqlContext.createDataFrame(rdd, schema=dfWithIndex.schema)
Really new to working with these lambdas, but printScema-ing that rdd with a plain zipEithIndex() gave me a two column dataframe.. _1 (struct) and a _2 long for the index itself. That's what the lambda appears to be referencing. However I'm getting this error:
TypeError: <lambda>() missing 1 required positional argument: 'rowId'
You're close. You just need to modify the lambda function slightly. It should take in 1 argument, which is like (Row, id), and return a single Row object.
from pyspark.sql import Row
from pyspark.sql.types import StructField, LongType
df = spark.createDataFrame([['a'],['b'],['c']],['val'])
df2 = df.rdd.zipWithIndex().map(
lambda r: Row(*r[0], r[1])
).toDF(df.schema.add(StructField('id', LongType(), False)))
df2.show()
+---+---+
|val| id|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+

In Spark dataframe how to Transpose rows to columns?

this may be a very simple question. I want to transpose all the rows of dataframe to columns. I want to convert this df as shown below output DF. What are the ways in spark to achieve this?
Note : I have single column in input DF
import sparkSession.sqlContext.implicits._
val df = Seq(("row1"), ("row2"), ("row3"), ("row4"), ("row5")).toDF("COLUMN_NAME")
df.show(false)
Input DF:
+-----------+
|COLUMN_NAME|
+-----------+
|row1 |
|row2 |
|row3 |
|row4 |
|row5 |
+-----------+
Output DF
+----+----+----+----+----+
|row1|row2|row3|row4|row5|
+----+----+----+----+----+
Does this help you ?
df.withColumn("group",monotonicallyIncreasingId ).groupBy("group").pivot("COLUMN_NAME").agg(first("COLUMN_NAME")).show

Dataframe in Pyspark

I was just dropping a column from dataframe. it was dropped. after calling show method, it seems like column is not dropped in dataframe.
Code:
df.drop('Salary').show()
+-----+
| Name|
+-----+
| Arun|
| Joe|
|Jerry|
+-----+
df.show()
+-----+------+
| Name|Salary|
+-----+------+
| Arun| 5000|
| Joe| 6300|
|Jerry| 9600|
+-----+------+
I am using spark 2.4.4 version. could you please tell why its not dropped? And I thought that its like a dropping column form table in oracle database.
The drop method returns a new DataFrame. The original df is not changed by this transformation, so calling df.show() a second time will return the original data with your Salary column.
You need to save the dataframe after dropping.
df2 = df.drop('Salary')
df2.show()

Easy way to center a column in a Spark DataFrame

I want to center a column in a Spark DataFrame, i.e., subtract each element in the column by the mean of the column. Currently, I do it manually, i.e., first calculate the mean of a column, get the value out of the reduced DataFrame, and then subtract the column by the average. I wonder whether there is an easy way to do this in Spark? Any built-in function to do it?
There is no inbuilt function for this but you can use user defined function [ udf ] as below
import org.apache.spark.sql.DataFrame
val df = spark.sparkContext.parallelize(List(
(2.06,0.56),
(1.96,0.72),
(1.70,0.87),
(1.90,0.64))).toDF("c1","c2")
def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
def getCenterDF(df: DataFrame, col: String): DataFrame = {
val avg = df.select(mean(col)).first().getAs[Double](0);
df.withColumn(col, subMean(avg)(df(col)))
}
scala> df.show(false)
+----+----+
|c1 |c2 |
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.7 |0.87|
|1.9 |0.64|
+----+----+
scala> getCenterDF(df, "c2").show(false)
+----+--------------------+
|c1 |c2 |
+----+--------------------+
|2.06|-0.13750000000000007|
|1.96|0.022499999999999853|
|1.7 |0.17249999999999988 |
|1.9 |-0.05750000000000011|
+----+--------------------+

How to explode several columns into rows in Spark SQL [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Unpivot in Spark SQL / PySpark
(2 answers)
Closed 4 years ago.
I am using Spark SQL 2.2.0 and DataFrame/DataSet API.
I need to explode several columns one per row.
I have:
+------+------+------+------+------+
|col1 |col2 |col3 |col4 |col5 |
+------+------+------+------+------+
|val11 |val21 |val31 |val41 |val51 |
|val12 |val22 |val32 |val42 |val52 |
+------+------+------+------+------+
And I need to have the following:
+------+------+---------+---------+
|col1 |col2 |col_num |col_new |
+------+------+---------+---------+
|val11 |val21 |col3 |val31 |
|val11 |val21 |col4 |val41 |
|val11 |val21 |col5 |val51 |
|val12 |val21 |col3 |val32 |
|val12 |val21 |col4 |val42 |
|val12 |val21 |col5 |val52 |
+------+------+---------+---------+
I managed to do join and explode like this:
val df2 = df.select(col("col1"), col("col2"), array(col("col3"), col("col4"), col("col5")) as "array")
val df3 = df2.withColumn("array", explode(col("array")))
This works but it does not add col_num column (which I need). I tried to do it with flatMap using custom map function but it fails.
Could you please help me how to do this?

Resources