Spark: TreeAgregate at IDF is taking ages - apache-spark

I am using Spark 1.6.1, and I have a DataFrame as follow:
+-----------+---------------------+-------------------------------+
|ID |dateTime |title |
+-----------+---------------------+-------------------------------+
|809907895 |2017-01-21 23:00:01.0| |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722 |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722 |2017-01-21 23:00:19.0|Sport |
|19338946129|2017-01-21 23:00:33.0| |
|979847722 |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+
I am doing the following operation:
val aggDF = df.groupBy($"ID")
.agg(concat_ws(" ", collect_list($"title")) as "titlesText")
Then on aggDF DataFrame, I am fitting a pipeline that extracts TFIDF feature from titlesText column (by applying tokenizer, stopWordRemover, HashingTF then IDF).
When I call the pipline.fit(aggDF) the code reaches a stage treeAggregate at IDF.scala:54 (I can see that on the UI), and then it gets stuck there, without any progress, without any error, I wait very long time without any progress and no helpful information on UI.
Here is an example of what I see in the UI (nothing changes for very long time):
What are the possible reasons of that?
How to track and debug such problems?
Is there any other way to extract the same feature?

Did you specify a maximum number of features in your HashingTF?
Because the amount of data the IDF has to deal with will be proportional to the number of features produced by HashingTF and it will most likely have to spill on disk for very large amounts which wastes time.

Related

KQL Load Balancer Bytes

Hi I've been trying to convert a query from bytes to GB, however I'm getting some strange results. I also get the same component showing up incrementally increasing in size, which makes sense as we are getting the bytes used, which we wold expect to see increased, but I would only like the latest set of data (when the query was run). Can anyone see where I'm going wrong?
AzureMetrics
| where TimeGenerated >=ago(1d)
| where Resource contains "LB"
| where MetricName contains "ByteCount"
| extend TotalGB = Total/1024
| summarize by Resource, TimeGenerated, TotalGB, MetricName, UnitName
| sort by TotalGB desc
| render piechart
In the table below, it shows the loadbal-1 reporting several times in a short window and the same for loadbal2 and 13. I'd like to capture these all in a single line, also I think I might have messed up the query for "TotalGB" (converting bytes to GB)
I would only like the latest set of data (when the query was run)
you can use the arg_max() aggregation function.
for example:
AzureMetrics
| where TimeGenerated >= ago(1d)
| where Resource has "LoadBal"
| where MetricName == "ByteCount"
| summarize arg_max(TimeGenerated, *) by Resource
I've been trying to convert a query from bytes to GB, however I'm getting some strange results... I think I might have messed up the query for "TotalGB" (converting bytes to GB)
If the raw data is in bytes, then you need to divide it by exp2(30) (or 1024*1024*1024) to get the value as GBs.
Or, you can use the format_bytes() function instead
for example:
print bytes = 18027051483.0
| extend gb_1 = format_bytes(bytes, 2),
gb_2 = bytes/exp2(30),
gb_3 = bytes/1024/1024/1024
bytes
gb_1
gb_2
gb_3
18027051483
16.79 GB
16.7889999998733
16.7889999998733

How to summarize time window based on a status in Kusto

I have recently started working with Kusto. I am stuck with a use case where i need to confirm the approach i am taking is right.
I have data in the following format
In the above example, if the status is 1 and if the time frame is equal to 15 seconds then i need to assume it as 1 occurrence.
So in this case 2 occurrence of status.
My approach was
if the current and next rows status is equal to 1 then take the time difference and do row_cum_sum and break it if the next(STATUS)!=0.
Even though the approach is giving me correct output, I am assuming the performance can slow down once the size is increased.
I am looking for an alternative approach if any. Also adding the complete scenario to reproduce this with a sample data.
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeTrue() {
range LoopTime from ago(365d) to now() step 6s
| project TIME=LoopTime,STATUS=toint(1)
}
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeFalse() {
range LoopTime from ago(365d) to now() step 29s
| project TIME=LoopTime,STATUS=toint(0)
}
.set-or-append FAKEDATA <| InsertFakeTrue();
.set-or-append FAKEDATA <| InsertFakeFalse();
FAKEDATA
| order by TIME asc
| serialize
| extend cstatus=STATUS
| extend nstatus=next(STATUS)
| extend WindowRowSum=row_cumsum(iff(nstatus ==1 and cstatus ==1, datetime_diff('second',next(TIME),TIME),0),cstatus !=1)
| extend windowCount=iff(nstatus !=1 or isnull(next(TIME)), iff(WindowRowSum ==15, 1,iff(WindowRowSum >15,(WindowRowSum/15)+((WindowRowSum%15)/15),0)),0 )
| summarize IDLE_COUNT=sum(windowCount)
The approach in the question is the way to achieve such calculations in Kusto and given that the logic requires sorting is also efficient (as long as the sorted data can reside on a single machine).
Regarding union operator - it runs in parallel by default, you can control the concurrency and spread using hints, see: union operator

Apache Spark (Scala) Aggregation across time with various groups

What I'm trying to accomplish is to calculate the total time a ship spends at anchor. The data I'm dealing with is time-series in nature. Throughout a ships journey from Point A -> Point B it can stop and start multiple times.
Basically, for each id (ship unique id) I want to calculate the total time spent at anchor (status === "ANCHORED"). For each "anchor" time period take the last time stamp and subtract it from the first time stamp (or vice-versa I'll just take the absolute value). I can do this easily if a ship only stops once in its journey (window function). But, I'm having trouble when it stops and starts multiple times throughout a journey. Can a window function handle this?
Here is an example of the data I'm dealing with and expected output:
val df = Seq(
(123, "UNDERWAY", 0),
(123, "ANCHORED", 12), // first anchored (first time around)
(123, "ANCHORED", 20), //take this timestamp and sub from previous
(123, "UNDERWAY", 32),
(123, "UNDERWAY", 44),
(123, "ANCHORED", 50), // first anchored (second time around)
(123, "ANCHORED", 65),
(123, "ANCHORED", 70), //take this timestamp and sub from previous
(123, "ARRIVED", 79)
).toDF("id", "status", "time")
+---+--------+----+
|id |status |time|
+---+--------+----+
|123|UNDERWAY|0 |
|123|ANCHORED|12 |
|123|ANCHORED|20 |
|123|UNDERWAY|32 |
|123|UNDERWAY|44 |
|123|ANCHORED|50 |
|123|ANCHORED|65 |
|123|ANCHORED|70 |
|123|ARRIVED |79 |
+---+--------+----+
// the resulting output I need is as follows (aggregation of total time spent at anchor)
// the ship spent 8 hours at anchor the first time, and then spent
// 20 hours at anchor the second time. So total time is 28 hours
+---+-----------------+
|id |timeSpentAtAnchor|
+---+-----------------+
|123|28 |
+---+-----------------+
Each "segment" the ship is at anchor I want to calculate the time spent at anchor and then add all those segments up to get the total time spent at anchor.
I'm new to Window functions, so it possibly could be done better, but here is what I came up with:
This solution only looks at "this - previous", as opposed to the "last - first" within each "group" of statuses. The net effect should be the same though, since it sums them all together anyway.
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"time")
df.withColumn("tdiff", when($"status" === lag($"status", 1).over(w), $"time" - lag($"time", 1).over(w)))
.where($"status" === lit("ANCHORED"))
.groupBy("id", "status")
.agg(sum("tdiff").as("timeSpentAtAnchor"))
.select("id", "timeSpentAtAnchor")
.show(false)
Which gives:
+---+-----------------+
|id |timeSpentAtAnchor|
+---+-----------------+
|123|28 |
+---+-----------------+
The answer was formed with information from this answer. And, as stated there:
Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables.

Spark ETL Unique Identifier for Entities Generated

We have a requirement in Spark where the every record coming from the feed is broken into set of entites.
Example {col1,col2,col3}=>Resource, {Col4,col5,col6}=> Account,{col7,col8}=>EntityX etc.
Now I need a unique identifier generated in the ETL Layer which can be persisted to the database table respectively for each of the above mentioned tables/entities.
This Unique Identifier acts a lookup value to identify the each table records and generate sequence in the DB.
First Approach was using the Redis keys to generate the keys for every entities identified using the Natural Unique columns in the feed.
But this approach was not stable as the redis used crash in the peak hours and redis operates in the single threaded mode.It woulbe slow when im running too many etl jobs parallely.
My Thought is to used a Crypto Alghorithm like SHA256 rather than Sha32 Algorithm has 32 bit there is possibility of hash collision for different values.were as SHA256 has more bits so the range of hash values = 2^64
so the Possibility of the HashCollision is very less since the SHA256 uses Block Cipher of 4bit to encryption.
But the Second option is not well accepted by many people.
What are the other options/solutions to Create a Unique Keys in the ETL layer which can looked back in the DB for comparison.
Thanks in Advance,
Rajesh Giriayppa
With dataframes, you can use the monotonicallyIncreasingId function that "generates monotonically increasing 64-bit integers" (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.functions$). It can be used this way:
dataframe.withColumn("INDEX", functions.monotonicallyIncreasingId())
With RDDs, you can use zipWithIndex or zipWithUniqueId. The former generates a real index (ordered between 0 and N-1, N being the size of the RDD) while the latter generates unique long IDs, without further guarantees which seems to be what you need (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD). Note that zipWithUniqueId does not even trigger a spark job and is therefore almost free.
thanks for the reply, I have tried this method which doesn’t give me me the correlation or surrogate primary key to search database.everytime I run the etl job indexes or numbers will be different for each record,if my dataset count changes.
I need unique I’d to correlate with dB record which matches only one record and the should be same record anytime in dB.
Is there any good design patterns or practices to compare etl dataset row to dB record with unique I’d.
This is a little late, but in case someone else is looking...
I ran into a similar requirement. As Oli mentioned previously, zipWithIndex will give sequential, zero-indexed id's, which you can then map onto an offset. Note, there is a critical section, so a locking mechanism could be required, depending on use case.
case class Resource(_1: String, _2: String, _3: String, id: Option[Long])
case class Account(_4: String, _5: String, _6: String, id: Option[Long])
val inDS = Seq(
("a1", "b1", "c1", "x1", "y1", "z1"),
("a2", "b2", "c2", "x2", "y2", "z2"),
("a3", "b3", "c3", "x3", "y3", "z3")).toDS()
val offset = 1001 // load actual offset from db
val withSeqIdsDS = inDS.map(x => (Resource(x._1, x._2, x._3, None), Account(x._4, x._5, x._6, None)))
.rdd.zipWithIndex // map index from 0 to n-1
.map(x => (
x._1._1.copy(id = Option(offset + x._2 * 2)),
x._1._2.copy(id = Option(offset + x._2 * 2 + 1))
)).toDS()
// save new offset to db
withSeqIdsDS.show()
+---------------+---------------+
| _1| _2|
+---------------+---------------+
|[a1,b1,c1,1001]|[x1,y1,z1,1002]|
|[a2,b2,c2,1003]|[x2,y2,z2,1004]|
|[a3,b3,c3,1005]|[x3,y3,z3,1006]|
+---------------+---------------+
withSeqIdsDS.select("_1.*", "_2.*").show
+---+---+---+----+---+---+---+----+
| _1| _2| _3| id| _4| _5| _6| id|
+---+---+---+----+---+---+---+----+
| a1| b1| c1|1001| x1| y1| z1|1002|
| a2| b2| c2|1003| x2| y2| z2|1004|
| a3| b3| c3|1005| x3| y3| z3|1006|
+---+---+---+----+---+---+---+----+

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)

Resources