What is the right way to do a semi-join on two Spark RDDs (in PySpark)? - apache-spark

In my PySpark application, I have two RDD's:
items - This contains item ID and item name for all valid items. Approx 100000 items.
attributeTable - This contains the fields user ID, item ID and an attribute value of this combination in that order. These is a certain attribute for each user-item combination in the system. This RDD has several 100s of 1000s of rows.
I would like to discard all rows in attributeTable RDD that don't correspond to a valid item ID (or name) in the items RDD. In other words, a semi-join by the item ID. For instance, if these were R data frames, I would have done semi_join(attributeTable, items, by="itemID")
I tried the following approach first, but found that this takes forever to return (on my local Spark installation running on a VM on my PC). Understandably so, because there are such a huge number of comparisons involved:
# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))
After a bit of fiddling around, I found that the following approach works pretty fast (a min or so on my system).
# Create a broadcast variable for item ID to item name mapping (dictionary)
itemIdToNameMap = sc.broadcast(items.collectAsMap())
# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
.map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
.filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
.map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))
Although this works well enough for my application, it feels more like a dirty workaround and I am pretty sure there must be another cleaner or idiomatically correct (and possibly more efficient) way or ways to do this in Spark. What would you suggest? I am new to both Python and Spark, so any RTFM advices will also be helpful if you could point me to the right resources.
My Spark version is 1.3.1.

Just do a regular join and then discard the "lookup" relation (in your case items rdd).
If these are your RDDs (example taken from another answer):
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
then you'd do:
attributeTable.keyBy(lambda x: x[1])
.join(items)
.map(lambda (key, (attribute, item)): attribute)
And as a result, you only have tuples from attributeTable RDD which have a corresponding entry in the items RDD:
[(123456, 123, 'Attribute for A')]
Doing it via leftOuterJoin as suggested in another answer will also do the job, but is less efficient. Also, the other answer semi-joins items with attributeTable instead of attributeTable with items.

As others have pointed out, this is probably most easily accomplished by leveraging DataFrames. However, you might be able to accomplish your intended goal by using the leftOuterJoin and the filter functions. Something a bit hackish like the following might suffice:
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
.filter(lambda x: x[1][1] is not None)
.map(lambda x: (x[0], x[1][0])).collect())
returns
[(123, 'Item A')]

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

spark: How does salting work in dealing with skewed data

I have a skewed data in a table which is then compared with other table that is small.
I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the matching happens because there will be a hit in one among the duplicate values for particular salted key of skewed able.
I also read that salting is helpful while performing groupby. My question is when random numbers are appended to the key doesn't it break the group? If it does then the meaning of group by operation has changed.
My question is when random numbers are appended to the key doesn't it break the group?
Well, it does, to mitigate this you could run group by operation twice.
Firstly with salted key, then remove salting and group again.
The second grouping will take partially aggregated data, thus significantly reduce skew impact.
E.g.
import org.apache.spark.sql.functions._
df.withColumn("salt", (rand * n).cast(IntegerType))
.groupBy("salt", groupByFields)
.agg(aggFields)
.groupBy(groupByFields)
.agg(aggFields)
"My question is when random numbers are appended to the key doesn't it break the group? If if does then the meaning of group by operation has changed."
Yes, adding a salt to an existing key will break the group. However, as #Gelerion has mentioned in his answer, you can group by the salted and original key and afterwards group by the original key. This works well for aggregations such as
count
min
max
sum
where it is possible to combine results from sub-groups. The illustration below shows an example of calculating the maximum value of a skewed Dataframe.
var df1 = Seq((1,"a"),(2,"b"),(1,"c"),(1,"x"),(1,"y"),(1,"g"),(1,"k"),(1,"u"),(1,"n")).toDF("ID","NAME")
df1.createOrReplaceTempView("fact")
var df2 = Seq((1,10),(2,30),(3,40)).toDF("ID","SALARY")
df2.createOrReplaceTempView("dim")
val salted_df1 = spark.sql("""select concat(ID, '_', FLOOR(RAND(123456)*19)) as salted_key, NAME from fact """)
salted_df1.createOrReplaceTempView("salted_fact")
val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from dim""")
//val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0 to 19)) as salted_key from dim""")
exploded_dim_df.createOrReplaceTempView("salted_dim")
val result_df = spark.sql("""select split(fact.salted_key, '_')[0] as ID, dim.SALARY
from salted_fact fact
LEFT JOIN salted_dim dim
ON fact.salted_key = concat(dim.ID, '_', dim.salted_key) """)
display(result_df)

GroupBy of a dataframe by comparison against a set of items

So I have a dataframe of movies with about 10K rows. Have a column that captures its genre in a comma separated string. Since a movie can be classified in multiple genres, I needed to create a set of genres that contained all possible genres in the 10K rows. So I went about it by doing as follows:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
This gets me a list of 24 genres from the 27K items in the simplist which is awesome. But heres the pinch:
I want to groupby genres by comparing genres to the set and then do aggregation and other operations AND
I want the output to be 24 distinct groups such that if a movie has more than one of the genres in the set - it should show up in both groups (removes sorting or tagging bias in the data gathering phase)
Is groupby even the right way to go about this?
Thanks for your input/thoughts/options/approach in advance.
Ok, so I made some headway but still unable to get to put the puzzle pieces together.
Started off by making a list and set (don't know which I will end up using) of unique values:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
g_list = list(gset)
Then, separately use df.pivot to structure the analysis:
table7 = df.pivot_table(index=['release_year'], values=['runtime'],aggfunc={'runtime': [np.median, ], 'popularity': [np.mean]}, fill_value=0, dropna=True)
But here's the thing:
it would be awesome if I could index by g_list or check 'genres' against the gset of 24 distinct items but df.pivot_table does not support that. Leaving it at Genres creates ~2000 rows and is not meaningful.
Got it!! Wanted to thank a bunch of offline folks and Pythonistas who helped me in the right direction. Turns out I'd been spinning my wheels with sets and lists when one single Pandas command (well 3 to be precise) does the trick!!
df2 = pd.DataFrame(df.genres.str.split(', ').tolist(), index=[df.col1, df.col2, df.coln]).stack()
df2 = df2.reset_index()[[0, 'col1', 'col2', 'coln',]]
df2.columns = ['Genre', 'col1', "col2", 'coln']
This should create a 2nd dataframe (df2) that has the key columns for analysis from the original dataframe and the rows duplicated/attributed to each genre. You see the true value of this when you turnaround and do something like:
revenue_table = df2.pivot_table(index=['Release Year','Genre'], values=['Profit'],aggfunc={'Profit': np.sum},fill_value=0,dropna=True) or anything to similar effect or use case.
Closing this but would appreciate any notes on more efficient ways to do this.

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

Resources