unable to get results while doing a DataFrame look in a Clojure Flambo api calls - apache-spark

I read a parquet file and and get data as RDD using Flambo api. we apply zipmap of column names and create a hash map/ Clojure map
let say my map has following values
[{:a 1 :b2}
{:a 2 :b 2}]
(:require [flambo.api :as f])
core.clj
I am using
(f/map rdd-records (f/fn[each-rdd]
(perform-calcs each-red)))
in perform-calcs function based on the input from map we do additional calculations, something like
cals.clj
(defn perform-calcs
[r]
(merge r {:c (+ (:a r) (:b r))}))
we had a new requirement is to perform another calculation based on another DataFrame from another file. we don't want to load the file for each record so kept the code to load DataFrame out side calc and defined in commons file. this DataFrame gets loaded as part of application and can be accessible across application.
commons.clj
(def another-csv-df
(load-file->df "file-name"))
calcs.clj
(defn df-lookup
[r df]
{:d (->
df (.filter (format "a = %d and b = %d" (:a r) (:b r) )
(.select (into [] (map #(Column. %) ["d"] )))
(first)
(.getString(0))})
by including this in perform-calcs function will change as follows.
(defn perform-calcs
[r]
(-> r
(merge {:c (+ (:a r) (:b r))})
(df-lookup commons/another-csv-df))
in real I see the values on the data frame... code is working as expected with out this external call of DF with this DF look up Code it was running for long... and never complete the process

Nested transformations like this one, are not allowed in Spark at all. You will have to rethink your approach, likely by converting RDD to Dataset and performing join between both.

Related

Spark : put hashmap into Dataset column?

I have a dataset Dataset<Row> which comes from reading a parquet file. Knowing that one column inside InfoMap is of type Map.
Now I want to update this column, but when I use withColumn, it tells me that I cannot put a hashmap inside because it's not a litteral.
I want to know what is the correct way to update a column of type Map for a dataset ?
Try using typedLit instead of lit
typedLit
"...The difference between this function and lit() is that this
function can handle parameterized scala types e.g.: List, Seq and Map"
data.withColumn("dictionary", typedLit(Map("foo" -> 1, "bar" -> 2)))

How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets

I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile = spark.read.textFile(locationTweetFile).cache()
val mentionsExp = """([#])+""".r
}
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
((honda,2),(customer,2),(stackexchange,1))
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) spark.read.textFile(filename) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
}.distinct.Seq
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1)
mentions.map(mention => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)
.takeOrdered(20)(Ordering[Int].reverse.on(_.2))

fastest way of inserting into cassandra using python cassandra driver

I am inserting and updating multiple entries into a table in Cassandra using python Cassandra driver. Currently my code looks like:
cluster = Cluster()
session = cluster.connect('db')
for a in list:
if bool:
# calculate b
session.execute("UPDATE table SET col2 = %s WHERE col1 = %s", (b, a))
else:
# calculate b
session.execute("INSERT INTO table(col1, col2) VALUES(%s, %s)", (a, b))
This method of insertion and update is quite slow as the number of entries in the list (all are unique) which are to be inserted is very large. Is there any faster way of doing this?
Generally for this scenario, you will see the best performance by increasing the number of concurrent writes to Cassandra.
You can do this with the Datastax Python Cassandra driver using execute_concurrent
From your description, it is worth noting that for your case there is no difference between an Update and an Insert with Cassandra. (i.e. you can simply do the insert statement from your else clause for all values of (a, b).
You will want to create a prepared statement.
Rather than doing the inserts one at a time in your for-loop, consider pre-computing groups of (a,b) pairs as input for execute_concurrent; you can also write a generator or generator expression as input for execute_concurrent.
Example:
parameters = ((a, calculate_b(a)) for a in my_list)
execute_concurrent_with_args(my_session, my_prepared_statement, parameters)

How to map single column value of a row in for loop in pyspark

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
test = hive_context.table("dbname.tablename")
iterate = test.map(lambda p:(p.survey_date,p.pro_catg,p.metric_id))
for ite in iterate.collect() :
v = ite.map(lambda p:p.metric_id)
print (v)
Above code is giving error in for loop.How to print a single column with out changing above mapping because further i would like to write the code as
for ite in iterate.collect():
for ite11 in secondtable.collect() :
if ite.metric_id.find(ite11.column1)
result.append(ite , ite11)
Kindly any one help on this
Reason for error when running:
for ite in iterate.collect() :
v = ite.map(lambda p:p.metric_id)
The result of iterate.collect() is not RDD, it is a python list (or something like that).
map can be execute on RDD, and can't be executed on python-list.
In general collect() is NOT recommended to use in spark
The following should perform similar operation without error:
iterate = test.map(lambda p:(p.survey_date,p.pro_catg,p.metric_id))
v = iterate.map(lambda (survey_date,pro_catg,metric_id): metric_id)
print (v.collect())
Finally i got one more solution to map single column value in for loop as
for ite in iterate.collect():
for itp in prod.collect():
if itp[0] in ite[1]: result.append(p)
print(result)
It works fine. Instead of in we can use find as
if ite[1].find(itp[0]): result.append(p)

How can I filter RDD rows based on an external Array() of ids

I am trying to execute the following lines of code but in a much larger RDD. Apparently, I get a heap size error when a is very large. How can I make this work? p is usually small.
val p = Array("id1", "id3", "id2");
val a = sc.parallelize(Array(("id1", ("1", "1")), ("id4", ("4", "4")), ("id2", ("2", "2"))));
val f = a.filter(x=> p contains x._1);
println(f.collect().mkString(";"));
The problem here is not the filter or the small array, but the attempt to collect a large RDD which will effectively send all data to the driver, probably exhausting the driver's available memory.
What happens to the string afterwards? What's probably needed is another method to store the results of the filter computation.
Another note: if the main usecase of the small dataset is contains, consider using a Set instead of an Array, as contains is amortized O(1) on Sets and O(n) on arrays.

Resources