Efficent Dataframe lookup in Apache Spark - python-3.x

I want to efficiently look up many IDs. What I have is a dataframe that looks like this dataframe df_source but with a couple of million records distributed to 10 Workers:
+-------+----------------+
| URI| Links_lists|
+-------+----------------+
| URI_1|[URI_8,URI_9,...|
| URI_2|[URI_6,URI_7,...|
| URI_3|[URI_4,URI_1,...|
| URI_4|[URI_1,URI_5,...|
| URI_5|[URI_3,URI_2,...|
+-------+----------------+
My first step would be to make an RDD out of df_source:
rdd_source = df_source.rdd
out of rdd_source I want to create an RDD that only contains the URIs with IDs. I do this like that:
rdd_index = rdd_source.map(lambda x: x[0]).zipWithUniqueId()
now I also .flatMap() the rdd_source in to an RDD that contains all relations. Until now only contained within the Links_list column.
rdd_relations = rdd_source.flatMap(lamda x: x)
now I transform both rdd_index and rdd_relations back into dataframes because I want to do joins and I think (I might be wrong on this) joins on dataframes are faster.
schema_index = StructType([
StructField("URI", StringType(), True),
StructField("ID", IntegerType(), True))
df_index = sqlContext.createDataFrame(rdd_index, schema=schema_index)
and
schema_relation = StructType([
StructField("URI", StringType(), True),
StructField("LINK", StringType(), True))
df_relations = sqlContext.createDataFrame(rdd_relations, schema=schema_relation )
The resulting dataframes should look like these two :
df_index:
+-------+-------+
| URI| ID|
+-------+-------+
| URI_1| 1|
| URI_2| 2|
| URI_3| 3|
| URI_4| 4|
| URI_5| 5|
+-------+-------+
df_relations:
+-------+-------+
| URI| LINK|
+-------+-------+
| URI_1| URI_5|
| URI_1| URI_8|
| URI_1| URI_9|
| URI_2| URI_3|
| URI_2| URI_4|
+-------+-------+
now to replace the long string URIs in the df_relations I will do joins on the df_index, the first join:
df_relations =\
df_relations.join(df_index, df_relations.URI == df_index.URI,'inner')\
.select(col(ID).alias(URI_ID),col('LINK'))
This should yield me a dataframe looking like this:
df_relations:
+-------+-------+
| URI_ID| LINK|
+-------+-------+
| 1| URI_5|
| 1| URI_8|
| 1| URI_9|
| 2| URI_3|
| 2| URI_4|
+-------+-------+
And the second join:
df_relations =\
df_relations.join(df_index, df_relations.LINK == df_index.URI,'inner')\
.select(col(URI_ID),col('ID').alias(LINK_ID))
this should result in the final dataframe the one I need. Looking like this
df_relations:
+-------+-------+
| URI_ID|LINK_ID|
+-------+-------+
| 1| 5|
| 1| 8|
| 1| 9|
| 2| 3|
| 2| 4|
+-------+-------+
where all URIs are replaced with IDs from df_index.
Is this the efficent way to look up the IDs for all URIs on both columns in the relation table, or is there a more effective way doing this?
I'm using Apache Spark 2.1.0 with Python 3.5

You do not need to use RDD for the operations you described. Using RDD can be very costly. Second you do not need to do two joins, you can do just one:
import pyspark.sql.functions as f
# add a unique id for each URI
withID = df_source.withColumn("URI_ID", f.monotonically_increasing_id())
# create a single line from each element in the array
exploded = withID.select("URI_ID", f.explode("Links_lists").alias("LINK")
linkID = withID.withColumnRenamed("URI_ID", "LINK_ID").drop("Links_lists")
joined= exploded.join(linkID, on=exploded.LINK==linkID.URI).drop("URI").drop("LINK")
Lastly,if linkID (which is basically df_source with a column replaced) is relatively small (i.e. can be fully contained in a single worker) you can broadcast it. add the following before the join:
linkID = f.broadcast(linkID)

Related

How to update dataframe column value while joinining with other dataframe in pyspark?

I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+

Search for 'Proper case' and mark it invalid using Pyspark

I have a big data set with multiple columns in it. Data Frame example is below, Here column 'first' holds names which I want to check whether is Proper case or not ? like aamir should be Aamir and Aamir malik should be Aamir Malik.
I want something like below.
I used Pyspark and below codes where I am getting the right answer but I want to detect it first and then make changes.
Here I have add a new column 'correct' and performed function.
name_check_1 = name_check.withColumn("correct", initcap(col("first")))
Then compare columns correct and first so it gives me not proper case name.
name_check_2 = name_check_1.filter('correct != first')
I need a way to get not proper case first and then correction.
My solution below :
Logic : Slice the string for first alphabet check it with correct string if equal it is valid else invalid. Make uppercase the first alphabet of firstname and lastname and rest to lower case and concatenate. Select only relevant columns.
from pyspark.sql.functions import *
from pyspark.sql.types import *
values = [
(1,"aamir"),
(2,"Aamir"),
(3,"atif"),
(4,"Atif"),
(5,"tahir"),
(6,"sameer"),
(7,"ifzaan"),
(8,"Ifzaan"),
(9,"Saquib"),
(10,"aamir malik"),
(11,"adcA")
]
rdd = sc.parallelize(values)
schema = StructType([
StructField("IDs", IntegerType(), True),
StructField("first", StringType(), True)
])
#create dataframe
data = spark.createDataFrame(rdd, schema)
#split first column into firstname and lastname
data = data.withColumn("firstname", split(data["first"]," ")[0]).withColumn("lastname", split(data["first"]," ")[1])
data = data \
.withColumn("flag", when((trim(substring(data["firstname"],0,1)) == upper(trim(substring(data["firstname"],0,1)))) |
(trim(substring(data["lastname"],0,1)) == upper(trim(substring(data["lastname"],0,1)))), lit("valid")).otherwise(lit("invalid"))) \
.withColumn("correct" , concat(concat(upper(trim(substring(data["firstname"],0,1))), trim(lower(substring(data["firstname"],2,1000)))),lit(" "),
when(data["lastname"].isNull(),lit("")) \
.otherwise(concat(upper(trim(substring(data["lastname"],0,1))),trim(lower(substring(data["lastname"],2,1000))))))) \
.select("IDs","first","flag","correct")
data.show()
#Result
+---+-----------+-------+-----------+
|IDs| first| flag| correct|
+---+-----------+-------+-----------+
| 1| aamir|invalid| Aamir |
| 2| Aamir| valid| Aamir |
| 3| atif|invalid| Atif |
| 4| Atif| valid| Atif |
| 5| tahir|invalid| Tahir |
| 6| sameer|invalid| Sameer |
| 7| ifzaan|invalid| Ifzaan |
| 8| Ifzaan| valid| Ifzaan |
| 9| Saquib| valid| Saquib |
| 10|aamir malik|invalid|Aamir Malik|
| 11| adcA|invalid| Adca |
+---+-----------+-------+-----------+
You know how to use initcap, so just create new column correct and compare it to the column first to check if it's already valid or not:
df.withColumn("correct", initcap(lower(col("first")))) \
.withColumn("flag", when(col("correct") != col("first"), lit("invalid")).otherwise("valid")) \
.show()
Gives:
+---+-----------+-----------+-------+
| id| first| correct| flag|
+---+-----------+-----------+-------+
| 1| aamir| Aamir|invalid|
| 2| Aamir| Aamir| valid|
| 3| atif| Atif|invalid|
| 4| Atif| Atif| valid|
| 5| tahir| Tahir|invalid|
| 6| sameer| Sameer|invalid|
| 7| ifzaan| Ifzaan|invalid|
| 8|Ifzaan Abcd|Ifzaan Abcd| valid|
| 9|Saquib abcd|Saquib Abcd|invalid|
+---+-----------+-----------+-------+

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

generate 2 rows for each row in spark using optimized DSL

I have data like:
id,ts_start,ts_end,foo_start,foo_end
1,1,2,f_s,f_e
2,3,4,foo,bar
3,3,6,foo,f_e
I.e. a single record with all the start and end information aggregated.
Using a flat map, these could be transformed to
id,ts,foo
1,1,f_s
1,2,f_e
How can I do the same using the optimized SQL DSL with explode or maybe pivot?
edit
Obviously, I do not want to read in the data two times and union the result.
Or is this the only option if I do not want to use flatmap + serde + custom code?
given:
val df = Seq(
(1,1,2,"f_s","f_e"),
(2,3,4,"foo","bar"),
(3,3,6,"foo","f_e")
).toDF("id","ts_start","ts_end","foo_start","foo_end")
you can do:
df
.select($"id",
explode(
array(
struct($"ts_start".as("ts"),$"foo_start".as("foo")),
struct($"ts_end".as("ts"),$"foo_end".as("foo"))
)
).as("tmp")
)
.select(
$"id",
$"tmp.*"
)
.show()
which gives:
+---+---+---+
| id| ts|foo|
+---+---+---+
| 1| 1|f_s|
| 1| 2|f_e|
| 2| 3|foo|
| 2| 4|bar|
| 3| 3|foo|
| 3| 6|f_e|
+---+---+---+

Add columns on a Pyspark Dataframe

I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Resources