how to efficiently parse dataframe object into a map of key-value pairs - apache-spark

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
screen shot of sample data
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)

You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
you can check other aggregation api's like reduceBy and groupBy over here.
please also check aggregateByKey vs groupByKey vs ReduceByKey differences.

This is efficient assuming your dataset is small enough to fit into the driver's memory. .collect will give you an array of rows on which you are iterating which is fine. If you want scalability then instead of Map[String, Set[String]] (this will reside in driver memory) you can use PairRDD[String, Set[String]] (this will be distributed).
//NOT TESTED
//Assuming df is dataframe with 2 columns, first is your basketId and second is itemId
df.rdd.map(row => (row.getAs[String](0), row.getAs[String](1)).groupByKey().mapValues(x => x.toSet)

Related

How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets

I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile = spark.read.textFile(locationTweetFile).cache()
val mentionsExp = """([#])+""".r
}
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
((honda,2),(customer,2),(stackexchange,1))
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) spark.read.textFile(filename) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
}.distinct.Seq
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1)
mentions.map(mention => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)
.takeOrdered(20)(Ordering[Int].reverse.on(_.2))

Pyspark applying foreach

I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd

Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.
you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Spark - convert string IDs to unique integer IDs

I have a dataset which looks like this, where each user and product ID is a string:
userA, productX
userA, productX
userB, productY
with ~2.8 million products and 300 million users; about 2.1 billion user-product associations.
My end goal is to run Spark collaborative filtering (ALS) on this dataset. Since it takes int keys for users and products, my first step is to assign a unique int to each user and product, and transform the dataset above so that users and products are represented by ints.
Here's what I've tried so far:
val rawInputData = sc.textFile(params.inputPath)
.filter { line => !(line contains "\\N") }
.map { line =>
val parts = line.split("\t")
(parts(0), parts(1)) // user, product
}
// find all unique users and assign them IDs
val idx1map = rawInputData.map(_._1).distinct().zipWithUniqueId().cache()
// find all unique products and assign IDs
val idx2map = rawInputData.map(_._2).distinct().zipWithUniqueId().cache()
idx1map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx1Out)
idx2map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx2Out)
// join with user ID map:
// convert from (userStr, productStr) to (productStr, userIntId)
val rev = rawInputData.cogroup(idx1map).flatMap{
case (id1, (id2s, idx1s)) =>
val idx1 = idx1s.head
id2s.map { (_, idx1)
}
}
// join with product ID map:
// convert from (productStr, userIntId) to (userIntId, productIntId)
val converted = rev.cogroup(idx2map).flatMap{
case (id2, (idx1s, idx2s)) =>
val idx2 = idx2s.head
idx1s.map{ (_, idx2)
}
}
// save output
val convertedInts = converted.map{
case (a,b) => a.toInt.toString + "\t" + b.toInt.toString
}
convertedInts.saveAsTextFile(params.outputPath)
When I try to run this on my cluster (40 executors with 5 GB RAM each), it's able to produce the idx1map and idx2map files fine, but it fails with out of memory errors and fetch failures at the first flatMap after cogroup. I haven't done much with Spark before so I'm wondering if there is a better way to accomplish this; I don't have a good idea of what steps in this job would be expensive. Certainly cogroup would require shuffling the whole data set across the network; but what does something like this mean?
FetchFailed(BlockManagerId(25, ip-***.ec2.internal, 48690), shuffleId=2, mapId=87, reduceId=25)
The reason I'm not just using a hashing function is that I'd eventually like to run this on a much larger dataset (on the order of 1 billion products, 1 billion users, 35 billion associations), and number of Int key collisions would become quite large. Is running ALS on a dataset of that scale even close to feasible?
I looks like you are essentially collecting all lists of users, just to split them up again. Try just using join instead of cogroup, which seems to me to do more like what you want. For example:
import org.apache.spark.SparkContext._
// Create some fake data
val data = sc.parallelize(Seq(("userA", "productA"),("userA", "productB"),("userB", "productB")))
val userId = sc.parallelize(Seq(("userA",1),("userB",2)))
val productId = sc.parallelize(Seq(("productA",1),("productB",2)))
// Replace userName with ID's
val userReplaced = data.join(userId).map{case (_,(prod,user)) => (prod,user)}
// Replace product names with ID's
val bothReplaced = userReplaced.join(productId).map{case (_,(user,prod)) => (user,prod)}
// Check results:
bothReplaced.collect()) // Array((1,1), (1,2), (2,2))
Please drop a comments on how well it performs.
(I have no idea what FetchFailed(...) means)
My platform version : CDH :5.7, Spark :1.6.0/StandAlone;
My Test Data Size:31815167 all data; 31562704 distinct user strings, 4140276 distinct product strings .
First idea:
My first idea is to use collectAsMap action and then use the map idea to change the user/product string to int . With driver memory up to 12G , i got OOM or GC overhead exception (the exception is limited by driver memory).
But this idea can only use on a small data size, with bigger data size , you need a bigger driver memory .
Second idea :
Second idea is to use join method, as Tobber proposaled. Here is some test result:
Job setup:
driver: 2G , 2 cpu;
executor : (8G , 4 cpu) * 7;
I follow the steps:
1) find unique user strings and zipWithIndexes;
2) join the original data;
3) save the encoded data;
The job take about 10 minutes to finish.

Resources