Is it possible to remove rows if the values in the Block column occurs at least twice which has different values in the ID column?
My data looks like this:
ID
Block
1
A
1
C
1
C
3
A
3
B
In the above case, the value A in the Block column occurs twice, which has values 1 and 3 in the ID column. So the rows are removed.
The expected output should be:
ID
Block
1
C
1
C
3
B
I tried to use the dropDuplicates after the groupBy, but I don't know how to filter with this type of condition. It appears that I would need a set for the Block column to check with the ID column.
One way to do it is using window functions. The first one (lag) marks the row if it is different than the previous. The second (sum) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag) column.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'A'),
(1, 'C'),
(1, 'C'),
(3, 'A'),
(3, 'B')],
['ID', 'Block'])
Script:
w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
.filter('_flag').drop('_flag')
df.show()
# +---+-----+
# | ID|Block|
# +---+-----+
# | 3| B|
# | 1| C|
# | 1| C|
# +---+-----+
Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below
(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
.withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
.where(col('BlockRank')==1)
.drop('index','BlockRank')
).show()
+---+-----+
| ID|Block|
+---+-----+
| 1| A|
| 1| C|
| 1| C|
| 3| B|
+---+-----+
I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe
Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)
You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()
i have a spark dataframe and i have a selective list of fields which are required to be trimmed. In production this list of fields will vary for each data set. I am trying to write a generic piece of code which will do it for me. Here is what i have done so far-
df = sqlContext.createDataFrame([('abcd ','123 ','x ')], ['s', 'd', 'n'])
df.show()
+--------+-------+---+
| s| d| n|
+--------+-------+---+
|abcd |123 |x |
+--------+-------+---+
All of my 3 attributes have trailing spaces. However i only want to trim the spoaces from column "s" and col "d".
>>> col_list=['s','d']
>>> df.select(*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+----+---+
| s| d|
+----+---+
|abcd|123|
+----+---+
This above operation does trim the spaces for me if i pass the list to this lambda.
How do i choose the remaining columns? i have tried these-
>>> df.select('*',*map(lambda x: trim(col(x)).alias(x),col_list)).show()
+--------+-------+---+----+---+
| s| d| n| s| d|
+--------+-------+---+----+---+
|abcd |123 |x |abcd|123|
+--------+-------+---+----+---+
>>> df.select(*map(lambda x: trim(col(x)),col_list),'*').show()
File "<stdin>", line 1
SyntaxError: only named arguments may follow *expression
How do i select other attributes from this Dataframe without hardcoding?
You could do something like this:
#create a list of all columns which aren't in col_list and concat it with your map
df.select(*([item for item in df.columns if item not in col_list] + list(map(lambda x: F.trim(col(x)).alias(x),col_list))) ).show()
but for readability purposes I would recommend withColumn
for c in col_list:
df = df.withColumn(c, F.trim(F.col(c)))
df.show()
I have a pyspark DataFrame, say df1, with multiple columns.
I also have a list, say, l = ['a','b','c','d'] and these values are the subset of the values present in one of the columns in the DataFrame.
Now, I would like to do something like this:
df2 = df1.withColumn('new_column', expr("case when col_1 in l then 'yes' else 'no' end"))
But this is throwing the following error:
failure: "(" expected but identifier l found.
Any idea how to resolve this error or any better way of doing it?
You can achieve that with the isin function of the Column object:
df1 = sqlContext.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ('col1', 'col2'))
l = ['a', 'b']
from pyspark.sql.functions import *
df2 = df1.withColumn('new_column', when(col('col1').isin(l), 'yes').otherwise('no'))
df2.show()
+----+----+----------+
|col1|col2|new_column|
+----+----+----------+
| a| 1| yes|
| b| 2| yes|
| c| 3| no|
+----+----+----------+
Note: For Spark < 1.5, use inSet instead of isin.
Reference: pyspark.sql.Column documentation
Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?
Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.
Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+
The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}
This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+