Add an index to a dataframe. Pyspark 2.4.4 [duplicate] - apache-spark

This question already has answers here:
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
(7 answers)
Closed 2 years ago.
There are a lot of examples that all give the same basic example.
dfWithIndex = df.withColumn('f_index', \
pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType()))
rdd = df.rdd.zipWithIndex().map(lambda row, rowId: (list(row) + [rowId + 1]))
dfIndexed = sqlContext.createDataFrame(rdd, schema=dfWithIndex.schema)
Really new to working with these lambdas, but printScema-ing that rdd with a plain zipEithIndex() gave me a two column dataframe.. _1 (struct) and a _2 long for the index itself. That's what the lambda appears to be referencing. However I'm getting this error:
TypeError: <lambda>() missing 1 required positional argument: 'rowId'

You're close. You just need to modify the lambda function slightly. It should take in 1 argument, which is like (Row, id), and return a single Row object.
from pyspark.sql import Row
from pyspark.sql.types import StructField, LongType
df = spark.createDataFrame([['a'],['b'],['c']],['val'])
df2 = df.rdd.zipWithIndex().map(
lambda r: Row(*r[0], r[1])
).toDF(df.schema.add(StructField('id', LongType(), False)))
df2.show()
+---+---+
|val| id|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+

Related

Spark: use value of a groupBy column as a name for an aggregate column [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I want to give aggregate column name which contains a value of one of the groupBy columns:
dataset
.groupBy("user", "action")
.agg(collect_list("timestamp").name($"action" + "timestamps")
this part: .name($"action") does not work because name expects a String, not a Column.
Base on: How to pivot Spark DataFrame?
val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")
val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()
+----+------+---+---+
|user|action| a| b|
+----+------+---+---+
| U1| a|[1]| []|
| U2| b| []|[2]|
+----+------+---+---+
Fun part with column renaming. We should rename all but first 2 columns
val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show
+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
| U1| [1]| []|
| U2| []| [2]|
+----+-----------+-----------+

PySpark: Use the primary key of a row as a seed for rand [duplicate]

This question already has an answer here:
Using a column value as a parameter to a spark DataFrame function
(1 answer)
Closed 3 years ago.
I'm trying to use the rand function in PySpark to generate a column with random numbers. I would like the rand function to take in the primary key of the row as the seed so that the number is reproducible. However, when I run:
df.withColumn('rand_key', F.rand(F.col('primary_id')))
I get the error
TypeError: 'Column' object is not callable
How can I use the value in the row as my rand seed?
The problem with using F.rand(seed) function is that it takes long seed parameter and treats it as literal (static).
One way to go around this is to create your own rand function that would take column as parameter:
import random
def rand(seed):
random.seed(seed)
return random.random()
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
rand_udf = udf(rand, DoubleType())
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (1, 'c')], ['a', 'b'])
df.withColumn('rr', rand_udf(df.a)).show()
+---+---+-------------------+
| a| b| rr|
+---+---+-------------------+
| 1| a|0.13436424411240122|
| 2| b| 0.9560342718892494|
| 1| c|0.13436424411240122|
+---+---+-------------------+

Pyspark substring of one column based on the length of another column

Using Pyspark 2.2
I have a spark DataFrame with multiple columns. I need to input 2 columns to a UDF and return a 3rd column
Input:
+-----+------+
|col_A| col_B|
+-----+------+
| abc|abcdef|
| abc| a|
+-----+------+
Both col_A and col_B are StringType()
Desired output:
+-----+------+-------+
|col_A| col_B|new_col|
+-----+------+-------+
| abc|abcdef| abc|
| abc| a| a|
+-----+------+-------+
I want new_col to be a substring of col_A with the length of col_B.
I tried
udf_substring = F.udf(lambda x: F.substring(x[0],0,F.length(x[1])), StringType())
df.withColumn('new_col', udf_substring([F.col('col_A'),F.col('col_B')])).show()
But it gives the TypeError: Column is not iterable.
Any idea how to do such manipulation?
There are two major things wrong here.
First, you defined your udf to take in one input parameter when it should take 2.
Secondly, you can't use the API functions within the udf. (Calling the udf serializes to python so you need to use python syntax and functions.)
Here's a proper udf implementation for this problem:
import pyspark.sql.functions as F
def my_substring(a, b):
# You should add in your own error checking
return a[:len(b)]
udf_substring = F.udf(lambda x, y: my_substring(a, b), StringType())
And then call it by passing in the two columns as arguments:
df.withColumn('new_col', udf_substring(F.col('col_A'),F.col('col_B')))
However, in this case you can do this without a udf using the method described in this post.
df.withColumn(
'new_col',
F.expr("substring(col_A,0,length(col_B))")
)

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?
Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.
Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+
The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}
This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Resources