How to return a single field from each line in a pyspark RDD? - apache-spark

I'm sure this is very simple, but despite by trying and research I can't find the solution. I'm working with flight info here.
I have an rdd with contents of :
[u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.2
2,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
What transform do I need in order to make a new RDD with all the 2nd fields in it.
[u'9E',u'NW',u'F9']
I've tried filtering but can't make it work. This just gives me the entire line and I only want the 2nd field from each line.
new_rdd = current_rdd.filter(lambda x: x.split(',')[1])

Here is the solution :
data = [u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : x.split(',')[1])
rdd.take(10)
# [u'9E', u'NW', u'F9']
You are using filter for the wrong purpose. So let's recall the definition of the filter function :
filter(f) - Return a new RDD containing only the elements that satisfy a predicate.
where as map returns a new RDD by applying a function to each element of this RDD, and that's what you need.
I advice to read the PythonRDD API documentation here to learn more about it.

Related

How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets

I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile = spark.read.textFile(locationTweetFile).cache()
val mentionsExp = """([#])+""".r
}
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
((honda,2),(customer,2),(stackexchange,1))
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) spark.read.textFile(filename) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
}.distinct.Seq
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1)
mentions.map(mention => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)
.takeOrdered(20)(Ordering[Int].reverse.on(_.2))

spark save taking lot of time

I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!

PySpark isin function

I am converting my legacy Python code to Spark using PySpark.
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
Both, actdataall and orddata are Spark dataframes.
I don't want to use toPandas() function given the drawback associated with it.
If both dataframes are big, you should consider using an inner join which will work as a filter:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
sc.broadcast(orderid_list)
The most direct translation of your code would be:
from pyspark.sql import functions as F
# collect all the unique ORDER_IDs to the driver
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
# filter ORDValue column by list of order_ids, then select only User ID column
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID')
However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).
If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.
order_ids = [x.ORDER_ID for x in orddata.select('ORDER_ID').distinct().collect()]
order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-broadcast.html
So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.
usersofinterest = actdataall.where(actdataall['ORDValue'].isin(orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID')

Can we apply a Action on another Action in Spark?

I am trying to run some basic spark applications.
Can we apply a Action on another Action ?
or
Action can be applied only on Transformed RDD?
val numbersRDD = sc.parallelize(Array(1,2,3,4,5));
val topnumbersRDD = numbersRDD.take(2)
scala> topnumbersRDD.count
<console>:17: error: missing arguments for method count in trait TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
topnumbersRDD.count
^
I would like to know why I am getting this above error .
Also what can I do if I want to find the count of first 2 numbers.. I need output as 2 .
Actions can be applied on RDD and DataFrame, the take method returns an Array, you could use length or size of array to count the elements.
If you want to choose datas with a condition, you could use filter and that returns a new RDD

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).
val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 = cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t"))
joinedRDD.first().foreach { println }
I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?
After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

Resources