Does ng-include have memory-leak? - memory-leaks

I use ng-include to switch different data pages which will do a lot of data rendering.
I found the memory usage of the browser keeps growing, and never fall back.
code
The code is quite simple.
HTML code:
<body ng-controller="MainCtrl">
<div>
<button ng-click="url='nodata.html'">No data</button>
<button ng-repeat="i in getNumArray(10)" ng-click="loadData(i)">Load data {{i}}</button>
</div>
<hr/>
[{{url}}]
<div ng-include="url"></div>
</body>
It will show a "no data" button, and 10 data buttons to load different pages.
The angular code:
app.controller('MainCtrl', function($scope) {
$scope.url = "nodata.html";
$scope.loadData = function(n) {
$scope.url = "data" + n + ".html";
}
$scope.getNumArray = function(n) {
var arr = [];
for(var i =0;i<n;i++) {
arr.push(i);
}
return arr;
}
});
app.controller('DataCtrl', function($scope, $http){
$http.get('data.json').success(function(data){
$scope.data = data;
})
});
And the "dataN.html" pages:
<div ng-controller="DataCtrl">
<table ng-repeat="x in getNumArray(500)">
<tbody>
<tr>
<td>{{data["key0"]}}</td>
<td>{{data["key1"]}}</td>
<td>{{data["key2"]}}</td>
<td>{{data["key3"]}}</td>
<td>{{data["key4"]}}</td>
<td>{{data["key5"]}}</td>
<td>{{data["key6"]}}</td>
<td>{{data["key7"]}}</td>
<td>{{data["key8"]}}</td>
<td>{{data["key9"]}}</td>
</tr>
</tbody>
</table>
</div>
The "nodata.html" page:
<div>No data yet.</div>
And the "data.json":
{
"key0": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key1": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key2": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key3": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key4": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key5": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key6": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key7": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key8": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f",
"key9": "sdf sdf sdf sdf sdf sdf sdf sdf sdf sd fds fds fsd fds fds fds fds fsd fds fds f"
}
Here is a live demo:
http://plnkr.co/edit/KGZVXIBws1kthgN2bxEJ?p=preview
When I open the live demo with chrome, the initialize memory usage is less than 100M. Then I click the "Load data" buttons, it will soon grow to 300M and never fall back, even if I click "No data" button to load the "nodata.html".
Is it normal? Does the ng-include have memory leak or do I missing anything? Or the memory usage is just fine that I don't need to worry about it?
screencast
I created a screencast to show it:

Try upgrading to version 1.0.5. It doesn't appear to have this problem.
I believe it is because there was a memory leak in 1.0.3/4 when there were top level white-space nodes in templates.

Stackoverflow is not the place to file a bugs. Please file the issue at https://github.com/angular/angular.js/issues and continue the discussion there.
I have simplified the use case into single file: http://plnkr.co/edit/Wm4YRsLGJUDqUcww2DQZ?p=preview
Here is what I found out.
Only leaks on Windows, does not leak on Mac OS X
Only leaks in plunker. When I run it outside the plunker it works fine.
Can you reproduce the issue outside the plunker?

Related

Will Spark store the excess data to disk by default, if size of input RDD is more than the memory capacity

I have an input file of size 260GB and my spark cluster memory capacity is 140 gb, upon running my spark job will, the excess data of 120 be stored to disk by default or should I use some storage levels to specify it.
I have not tried any solutions to solve this issue.
def main(args: Array[String]){
val conf:SparkConf = new SparkConf().setAppName("optimize_1").setMaster("local")
val sc:SparkContext = new SparkContext(conf)
val myRDD = sc.parallelize( List(("1", "abc", "Request"), ("1", "cba", "Response"), ("2", "def", "Request"), ("2", "fed", "Response"), ("3", "ghi", "Request"), ("3", "ihg", "Response")) )
val myRDD_1 = sc.parallelize( List(("1", "abc"), ("1", "cba"), ("2", "def"), ("2", "fed"), ("3", "ghi"), ("3", "ihg")) )
myRDD_1.map(x=>x).groupBy(_._1).take(10).foreach(println)
myRDD_1.groupByKey().foreach(println) }
Below is the expected and working output for small data:
(2,CompactBuffer(def, fed))
(3,CompactBuffer(ghi, ihg))
(1,CompactBuffer(abc, cba))
But when applying it on a large scale I am receiving the following error:
"Dspark.ui.port=0'
-Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009
-XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler#DOSSPOCVM1:33303 --executor-id 8
--hostname DOSSPOCVM1 --cores 1 --app-id application_1555417914353_0069 --user-class-path file:$PWD/app.jar
1>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stdout
2>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stderr""
ERROR YarnClusterScheduler: Lost executor 17 on DOSSPOCVM2: Container
marked as failed: container_e05_1555417914353_0069_02_000019 on host:
DOSSPOCVM2. Exit status: -100. Diagnostics: Container released on a
lost node
Please suggest a way to resolve this issue

what is the use of _spark_metadata directory

I am trying to get my head around how streaming works in spark.
I have a file in a /data/flight-data/csv/ directory. It has the following data:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
I thought to test what will happen if I read the file as a stream instead of as a batch. I first created a Dataframe using read
scala> val dataDF = spark.read.option("inferSchema","true").option("header","true").csv("data/flight-data/csv/2015-summary.csv");
[Stage 0:> dataDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
then took the schema fromm it and created a new Dataframe
scala> val staticSchema = dataDF.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.readStream.schema(staticSchema).option("header","true").csv("data/flight-data/csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
Then I started the stream. The path for checkpoint and output (I suppose) is `/home/manu/test" directory which is initially empty.
scala> dataStream.writeStream.option("checkpointLocation","home/manu/test").start("/home/manu/test");
res5: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5c7df5f1
The return value of the start is StreamingQuery which I read is A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
I notice that now the directory has a directory _spark_metadatabut there is nothing else.
Question1 - What is _spark_metadata directory? I notice it is empty. What is it used for?
Question 2 - I don't see anything else happening. Is it because I am not running any query on the Dataframe dataStream (or shall I say that the query isn't doing anything useful)?

How to Map one rdd to another with Pyspark?

I have rdd1 which have labels(0,1,4) and another rdd2 in which i have text. I want to map rdd1 with rdd2 such that row1 of rdd1 is mapped with row1 of rdd2 and so on.
I have tried:
rdd2.join(rdd1.map(lambda x: (x[0], x[0:])))
It gives me error:
RDD is empty.
Can someone please guide me here?
Sample Output: rdd1- labels & rdd2-Text
rdd1 rdd2
0 i hate painting i have white paint all over my hands.
0 Bawww I need a haircut No1 could fit me in before work tonight. Sigh.
4 I had a great day
1 what is life.
4 He sings so good
1 i need to go to sleep ....goodnight
Sample output
If you have rdd1 as
val rdd1 = sc.parallelize(List(0,0,4,1,4,1))
and rdd2 as
val rdd2 = sc.parallelize(List("i hate painting i have white paint all over my hands.",
"Bawww I need a haircut No1 could fit me in before work tonight. Sigh.",
"I had a great day",
"what is life.",
"He sings so good",
"i need to go to sleep ....goodnight"))
I want to map rdd1 with rdd2 such that row1 of rdd1 is mapped with row1 of rdd2 and so on.
using zip function
A simple zip function should meet your rquirement
rdd1.zip(rdd2)
which would give you output as
(0,i hate painting i have white paint all over my hands.)
(0,Bawww I need a haircut No1 could fit me in before work tonight. Sigh.)
(4,I had a great day)
(1,what is life.)
(4,He sings so good)
(1,i need to go to sleep ....goodnight)
zipWithIndex and join
This approach would give you the same output as explained above using zip (and this method is expensive as well)
rdd1.zipWithIndex().map(_.swap).join(rdd2.zipWithIndex().map(_.swap)).map(_._2)
I hope the answer is helpful

search replace using Apache spark java

Problem Statement:
We need to replace the Synonyms of the words in a row with its equivalent words (from large collection of synonym list ~40000+ Key Value pairs) on a large dataset(50000 rows).
Example:
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Synonym list (key value pair)
//We have 40000 entries
Key | Value
------------------------------------
Allen | Apex Tool Group
Armstrong | Columbus McKinnon
DeWALT | StanleyBlack
Above Synonym list has to be used on the Input and the Output should be as shown in the below format.
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
We have tried with 3 approaches all of them has its own limitations
Approach 1
Using UDF
public void test () {
List<Row> data = Arrays.asList(
RowFactory.create(0, "Allen jeevi pramod Allen Armstrong"),
RowFactory.create(1, "sandesh Armstrong jeevi"),
RowFactory.create(2, "harsha Nischay DeWALT")
);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> data2 = Arrays.asList(
RowFactory.create("Allen", "Apex Tool Group"),
RowFactory.create("Armstrong","Columbus McKinnon"),
RowFactory.create("DeWALT","StanleyBlack")
);
StructType schema2 = new StructType(new StructField[] {
new StructField("label2", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>()
{
private static final long serialVersionUID = -5239951370238629896L;
#Override
public Boolean call(String t1, String t2) throws Exception {
return t1.contains(t2);
}
};
spark.udf().register("contains", contains, DataTypes.BooleanType);
UDF3<String, String, String, String> replaceWithTerm = new UDF3<String,
String, String, String>() {
private static final long serialVersionUID = -2882956931420910207L;
#Override
public String call(String t1, String t2, String t3) throws Exception {
return t1.replaceAll(t2, t3);
}
};
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2")))
.withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
.select(col("sentence_replaced"));
joined.show(false);
}
`
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Actual Output
Apex Tool Group jeevi pramod Apex Tool Group Armstrong
Allen jeevi pramod Allen Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Issue with approach 1, if there are multiple synonym keys in the input dataset, that many rows are getting created as shown in the above example output.
Expected only one row with all the replacement
Approach 2.
Using ImmutableMap with replace function: Here we kept key and values pair in hashmap within ImmutableMap function, we called replace function to replace all the things
but if a row contains multiple keys then it ignores complete row without replacing single key…
try {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample").getOrCreate();
HashMap<String, String> options = new HashMap<String, String>();
options.put("header", "true");
Dataset<Row> dataFileContent = sqlContext.load("com.databricks.spark.csv", options);
dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),"[^a-zA-Z0-9\\s+]",""));
dataFileContent= dataFileContent.na().replace("ManufacturerSource",ImmutableMap.<String, String>builder()
.put("Allen", "Apex Tool Group"),
.put("Armstrong","Columbus McKinnon"),
.put("DeWALT","StanleyBlack")
//Here we have 40000 entries
.build()
);
dataFileContent.show(10,false);
} catch (Exception e) {
e.printStackTrace();
}
Here is the sample code and output:
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Actual Output
Allen jeevi pramod Allen Armstrong
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Approach 3
Using replace all within UDF
public static void main(String[] args) {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("JoinFunctions").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("StringSimiliarityExample").getOrCreate();
Dataset<Row> sourceFileContent = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("source100.csv");
sourceFileContent.show(false);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
UDF1 mode = new UDF1<String, String>() {
public String call(final String types) throws Exception {
return types.replaceAll("Allen", "Apex Tool Group")
.replaceAll("Armstrong","Columbus McKinnon")
.replaceAll("DeWALT","StanleyBlack")
//40000 more entries.....
}
};
sqlContext.udf().register("mode", mode, DataTypes.StringType);
sentenceDataFrame.createOrReplaceTempView("people");
Dataset<Row> newDF = sqlContext.sql("SELECT mode(sentence), label FROM people").withColumnRenamed("UDF(sentence)", "sentence");
newDF.show(false);
}
Output
Stackoverflow exception.
Here, we are getting stackoverflow exception. Because it resembles recursive function call.
Kindly, let us know if there are any other innovative approaches that can help to resolve this issue.
Neither will work, since you always have the issue of substring matches. For example:
ABC -> DE
ABCDE -> ABC
With text "ABCDEF HIJ KLM" what will the output be? It should be the same as the input, but your approach will at best output "DEDEF HIJ KLM" and at worst you will do a double replacement and get "DEF HIJ KLM". Either case is incorrect.
You could improve this by adding boundaries to replacements, perhaps using regex. A better way however would be to first tokenize your input correctly, apply token replacement (which can be exact match), and then un-tokenize back to original format. This may be as simple as splitting by space, but you should give proper though as to what token boundaries may exist. (Stops, hyphens, etc).

Reading gz.parquet file

Hello I need to read the data from gz.parquet files but dont know how to?? Tried with impala but i get the same result as parquet-tools cat without the table structure.
P.S: any suggestions to improve the spark code are most welcome.
I have the following parquet files gz.parquet as a result of a data pipe line created by twitter => flume => kafka => spark streaming => hive/gz.parquet files). For flume agent i am using agent1.sources.twitter-data.type = org.apache.flume.source.twitter.TwitterSource
Spark code de-queues the data from kafka and storing in hive as follows:
val sparkConf = new SparkConf().setAppName("KafkaTweet2Hive")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)//new org.apache.spark.sql.SQLContext(sc)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the data (tweets) from kafka
val tweets = messages.map(_._2)
// adding the tweets to Hive
tweets.foreachRDD { rdd =>
val hiveContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val tweetsDF = rdd.toDF()
tweetsDF.write.mode("append").saveAsTable("tweet")
}
When i run the spark-streaming app it stores the data as gz.parquet files in hdfs: /user/hive/warehouse directory as follows:
[root#quickstart /]# hdfs dfs -ls /user/hive/warehouse/tweets
Found 469 items
-rw-r--r-- 1 root supergroup 0 2016-03-30 08:36 /user/hive/warehouse/tweets/_SUCCESS
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:36 /user/hive/warehouse/tweets/_common_metadata
-rw-r--r-- 1 root supergroup 35750 2016-03-30 08:36 /user/hive/warehouse/tweets/_metadata
-rw-r--r-- 1 root supergroup 23518 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-0133fcd1-f529-4dd1-9371-36bf5c3e5df3.gz.parquet
-rw-r--r-- 1 root supergroup 9552 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-02c44f98-bfc3-47e3-a8e7-62486a1a45e7.gz.parquet
-rw-r--r-- 1 root supergroup 19228 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-0321ce99-9d2b-4c52-82ab-a9ed5f7d5036.gz.parquet
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-03415df3-c719-4a3a-90c6 462c43cfef54.gz.parquet
The schema from _metadata file is as follows:
[root#quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
creator: parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]}
file schema: root
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tweet: OPTIONAL BINARY O:UTF8 R:0 D:1
Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:
+--------------------+
| tweet|
+--------------------+
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|ڕObjavro.sch...|
|��Objavro.sc...|
|ֲObjavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|֕Objavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
+--------------------+
only showing top 20 rows
How ever i would like to see the tweets as plain text?
sqlContext.read.parquet("/user/hive/warehouse/tweets").show

Resources