I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd
Related
I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile = spark.read.textFile(locationTweetFile).cache()
val mentionsExp = """([#])+""".r
}
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
((honda,2),(customer,2),(stackexchange,1))
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) spark.read.textFile(filename) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
}.distinct.Seq
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1)
mentions.map(mention => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)
.takeOrdered(20)(Ordering[Int].reverse.on(_.2))
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
test = hive_context.table("dbname.tablename")
iterate = test.map(lambda p:(p.survey_date,p.pro_catg,p.metric_id))
for ite in iterate.collect() :
v = ite.map(lambda p:p.metric_id)
print (v)
Above code is giving error in for loop.How to print a single column with out changing above mapping because further i would like to write the code as
for ite in iterate.collect():
for ite11 in secondtable.collect() :
if ite.metric_id.find(ite11.column1)
result.append(ite , ite11)
Kindly any one help on this
Reason for error when running:
for ite in iterate.collect() :
v = ite.map(lambda p:p.metric_id)
The result of iterate.collect() is not RDD, it is a python list (or something like that).
map can be execute on RDD, and can't be executed on python-list.
In general collect() is NOT recommended to use in spark
The following should perform similar operation without error:
iterate = test.map(lambda p:(p.survey_date,p.pro_catg,p.metric_id))
v = iterate.map(lambda (survey_date,pro_catg,metric_id): metric_id)
print (v.collect())
Finally i got one more solution to map single column value in for loop as
for ite in iterate.collect():
for itp in prod.collect():
if itp[0] in ite[1]: result.append(p)
print(result)
It works fine. Instead of in we can use find as
if ite[1].find(itp[0]): result.append(p)
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches. The actual use case is much more complex, but I've simplified the case here to minimum reproducible code. Here is the UDF code.
def predict_id(date,zip):
filtered_ids = contest_savm.where((F.col('postal_code')==zip) & (F.col('start_date')>=date))
return filtered_ids.count()
When I define the UDF using the below code, I get a long list of console errors:
predict_id_udf = F.udf(predict_id,types.IntegerType())
The final line of the error is:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I want to know what is the best way to go about it. I also tried map like this:
result_rdd = df.select("party_id").rdd\
.map(lambda x: predict_id(x[0],x[1]))\
.distinct()
It also resulted in a similar final error. I want to know, if there is anyway, I can do a join within UDF or map function, for each row of the original dataframe.
I have to write a complex UDF, in which I have to do a join with a different table, and return the number of matches.
It is not possible by design. I you want to achieve effect like this you have to use high level DF / RDD operators:
df.join(ontest_savm,
(F.col('postal_code')==df["zip"]) & (F.col('start_date') >= df["date"])
).groupBy(*df.columns).count()
I'm sure this is very simple, but despite by trying and research I can't find the solution. I'm working with flight info here.
I have an rdd with contents of :
[u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.2
2,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
What transform do I need in order to make a new RDD with all the 2nd fields in it.
[u'9E',u'NW',u'F9']
I've tried filtering but can't make it work. This just gives me the entire line and I only want the 2nd field from each line.
new_rdd = current_rdd.filter(lambda x: x.split(',')[1])
Here is the solution :
data = [u'2007-09-22,9E,20363,TUL,OK,36.19,-95.88,MSP,MN,44.88,-93.22,1745,1737,-8,1953,1934,-19', u'2004-02-12,NW,19386,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1050,1050,0,1341,1342,1', u'2007-05-07,F9,20436,DEN,CO,39.86,-104.67,MSP,MN,44.88,-93.22,1030,1040,10,1325,1347,22']
current_rdd = sc.parallelize(data)
rdd = current_rdd.map(lambda x : x.split(',')[1])
rdd.take(10)
# [u'9E', u'NW', u'F9']
You are using filter for the wrong purpose. So let's recall the definition of the filter function :
filter(f) - Return a new RDD containing only the elements that satisfy a predicate.
where as map returns a new RDD by applying a function to each element of this RDD, and that's what you need.
I advice to read the PythonRDD API documentation here to learn more about it.
I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?
def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.