Spark is creating too many threads when reading ML model - multithreading

I have a multilabel text classification problem that I tried to resolve using the binary relevance method, by creating one binary classifier per label.
I have to read 10000 models classifier to perform my classification phase, after my training phase, on all my documents, using spark.
But for an unknown reason, it becomes very slow when I try to read more than 1000 models, because spark creates a new thread each time, which progressively slow down the process, and I don't know why.
Here is the minimal code which illustrate my problem.
package entrepot.spark;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.sql.SparkSession;
public class maintest {
public static void main(String[] args) throws FileNotFoundException, IllegalArgumentException, IOException {
try(SparkSession spark = SparkSession.builder().appName("test").getOrCreate()) {
//Listing directories to get the list of labels
Set<String> labels = new HashSet<>();
FileStatus[] filesstatus = FileSystem.get(spark.sparkContext().hadoopConfiguration()).listStatus(new Path("C:\\Users\\*\\Desktop\\model\\"));
for(int i = 0; i < filesstatus.length; i++) {
if(filesstatus[i].isDirectory()) {
labels.add(filesstatus[i].getPath().getName());
}
}
List<MultilayerPerceptronClassificationModel> models = new ArrayList<>();
// Here is the problem
for(String label : labels) {
System.out.println(label);
MultilayerPerceptronClassificationModel model = MultilayerPerceptronClassificationModel.load("C:\\Users\\*\\Desktop\\model\\" + label + "\\CL\\");
models.add(model);
}
System.out.println("done");
}
}
}
I'm running the program on Windows, with Spark 2.1.1 and Hadoop 2.7.3, using the following command line:
.\bin\spark-submit^
--class entrepot.spark.maintest^
--master local[*]^
/C:/Users/*/eclipse-workspace/spark/target/spark-0.0.1-SNAPSHOT.jar
To download a small repetitive sample of one of my labels model, here is the link : we.tl/T50s9UffYV (Why can't I post a simple link ??)
PS: Even though the models are serializable, I couldn't save and load everything at once using a java collection and an object stream, because I get a scala conversion error. Instead, I'm using the save/load static method from MLLib on each model, resulting in hundreds of thousands of files.

Related

How to get all VP and NP that have 2 or 3 direct children

I'm using Stanford NLP to generate parse trees of documents. I then want to iterate over these documents and store all phrases that are 2 or 3 words long that are part of a VP (verb phrase) or NP (noun phrase). What is a strategy I can use to accomplish this?
Here's some sample code that will go through a tree and print out the words in the NP's and VP's:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.trees.*;
import java.util.*;
public class ConstituentExample {
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
// use faster shift reduce parser
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
props.setProperty("parse.maxlen", "100");
// set up Stanford CoreNLP pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// build annotation for a review
Annotation annotation =
new Annotation("The small red car turned very quickly around the corner.");
// annotate
pipeline.annotate(annotation);
// get tree
Tree tree =
annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0).get(TreeCoreAnnotations.TreeAnnotation.class);
System.out.println(tree);
Set<Constituent> treeConstituents = tree.constituents(new LabeledScoredConstituentFactory());
for (Constituent constituent : treeConstituents) {
if (constituent.label() != null &&
(constituent.label().toString().equals("VP") || constituent.label().toString().equals("NP"))) {
System.err.println("found constituent: "+constituent.toString());
System.err.println(tree.getLeaves().subList(constituent.start(), constituent.end()+1));
}
}
}
}

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:92)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:253)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:330)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:304)
at sparkExample.spExample.ClusteringDSPOC.main(ClusteringDSPOC.java:45)
17
My code is
package sparkExample.spExample;
import java.util.Properties;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.ml.clustering.KMeans;
import org.apache.spark.ml.clustering.KMeansModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ClusteringDSPOC {
private static final Pattern SPACE = Pattern.compile(" ");
private static final SparkContext sc = new SparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "xyz";
private static final String POSTGRESQL_PWD = "xyz";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/xyzdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final String POSTGRESQL_TABLE = "(select id, duration from abc where duration is not null ) as abc";
public static void main(String[] args) throws Exception {
//Datasource options
SparkSession spark = SparkSession.builder().appName("JavaKMeansExample").getOrCreate();
Class.forName(POSTGRESQL_DRIVER);
Properties options = new Properties();
Dataset<Row> sdrDS = spark.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL, POSTGRESQL_TABLE, options);
Dataset<Row> durationDS = sdrDS.select("duration");
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(durationDS);
}
}
I am following this
https://spark.apache.org/docs/latest/ml-clustering.html.
Getting this error while fit method is called.Please help me on fixing this or else some alternate option to do this.Thanks
Here I am trying to devide duration into 2 to 3 clusters and then map cluster with id.Same thing I am able to do by using Spark mllib library in this way
package sparkExample.spExample;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class ClusteringPOC1 {
private static final Pattern SPACE = Pattern.compile(" ");
private static final JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
private static final String POSTGRESQL_DRIVER = "org.postgresql.Driver";
private static final String POSTGRESQL_USERNAME = "abc";
private static final String POSTGRESQL_PWD = "abc";
private static final String POSTGRESQL_CONNECTION_URL = "jdbc:postgresql://192.168.111.130:5432/abcdb?user=" + POSTGRESQL_USERNAME + "&password=" + POSTGRESQL_PWD;
private static final SQLContext sqlContext = new SQLContext(sc);
public static void main(String[] args) throws Exception {
//Datasource options
Map<String, String> options = new HashMap<String, String>();
options.put("driver", POSTGRESQL_DRIVER);
options.put("url", POSTGRESQL_CONNECTION_URL);
options.put("dbtable", "(select id, duration from sdr_log where duration is not null ) as sdr_log");
Dataset<Row> sdrDF = sqlContext.load("jdbc", options);
JavaRDD<Row> sdrData = sdrDF.toJavaRDD();
sdrData.cache();
JavaRDD<Vector> durationData = sdrData.map(row -> {
double value = new Double(row.get(2).toString());
return Vectors.dense(value);
});
durationData.cache();
KMeansModel clusters = KMeans.train(durationData.rdd(), numClusters, numIterations);
JavaRDD<Integer> clusterLabel = clusters.predict(durationData);
JavaRDD<Long> id = sdrData.map(row -> new Long(row.get(1).toString()));
JavaPairRDD<Long, Integer> clusterLableData = id.zip(clusterLabel);
clusterLableData.saveAsTextFile("data/mlib/kmeans_output11.txt");
}
}
But I want to do this with spark ml library.
K-means is an unsupervised clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other.
Dataset<Row> durationDS = sdrDS.select("duration");
In your code, you are iterating over row while selecting a single column 'durations' and you're setting the number of clusters as 2. But how can you classify the data into clusters when you are having no basis to do so?
The essence of unsupervised learning algorithms, in this case the Kmeans, is that you are not needed to specify parameters relating to logic of the dataset while using it. You are just needed to pass (fit) the dataset in the model and it classifies it into clusters.
In the K-means algorithm, the model tries to find the K-nearest neighbour. It needs some data to classify the cluster, whereas you're passing a single column.
It is better to use the Spark's Dataframe API to resolve the error you are facing.
Spark automatically reads the schema from the MySQL table and maps its types back to Spark SQL’s types
Import into a Dataframe object
> DataFrame jdbcDF= sql.Context.read().format("libsvm").jdbc(POSTGRESQL_CONNECTION_URL,POSTGRESQL_TABLE, options);
You can now drop columns you don't want using the DF.drop('ColumnName') function.
Or/And fit your dataset this way..
> KMeansModel model = kmeans.fit(jdbcDF);
Also, It would be great if you could provide the dataset

Operation APPEND failed with HTTP500?

package org.apache.spark.examples.kafkaToflink;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.Properties;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import com.microsoft.azure.datalake.store.ADLException;
import com.microsoft.azure.datalake.store.ADLFileOutputStream;
import com.microsoft.azure.datalake.store.ADLStoreClient;
import com.microsoft.azure.datalake.store.IfExists;
import com.microsoft.azure.datalake.store.oauth2.AccessTokenProvider;
import com.microsoft.azure.datalake.store.oauth2.ClientCredsTokenProvider;
import scala.util.parsing.combinator.testing.Str;
public class App {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.72:9092");
properties.setProperty("group.id", "test");
DataStream<String> stream = env.addSource(
new FlinkKafkaConsumer010<String>("tenant", new SimpleStringSchema(), properties), "Kafka_Source");
stream.addSink(new ADLSink()).name("Custom_Sink").setParallelism(128);
env.execute("App");
}
}
class ADLSink<String> extends RichSinkFunction<String> {
private java.lang.String clientId = "***********";
private java.lang.String authTokenEndpoint = "***************";
private java.lang.String clientKey = "*****************";
private java.lang.String accountFQDN = "****************";
private java.lang.String filename = "/Bitfinex/ETHBTC/ORDERBOOK/ORDERBOOK.json";
#Override
public void invoke(String value) {
AccessTokenProvider provider = new ClientCredsTokenProvider(authTokenEndpoint, clientId, clientKey);
ADLStoreClient client = ADLStoreClient.createClient(accountFQDN, provider);
try {
client.setPermission(filename, "744");
ADLFileOutputStream stream = client.getAppendStream(filename);
System.out.println(value);
stream.write(value.toString().getBytes());
stream.close();
} catch (ADLException e) {
System.out.println(e.requestId);
} catch (Exception e) {
System.out.println(e.getMessage());
System.out.println(e.getCause());
}
}
}
I am continuously trying to append a file which is in Azure data lake Store using while loop .But sometimes it gives this , Operation APPEND failed with HTTP500, error in starting or sometimes after 10 min. I am using java
Anubhav, Azure Data Lake streams are single-writer streams - i.e., you cannot write to the same stream from multiple threads, unless you do some form of synchronization between these threads. This is because each write specifies the offset it is writing to, and with multiple threads, the offsets are not consistent.
You seem to be writing from multiple threads (.setParallelism(128) call in your code)
In your case, you have two choices:
Write to a different file in each thread. I do not know your use-case, but we have found that for a lot of cases that is the natural use of different threads - to write to different files.
If it is important to have all the threads write to the same file, then you will need to refactor the sink a little bit so that all the instances have reference to the same ADLFileOutputStream, and you will need to make sure the calls to write() and close() are synchronized.
Now, there is one more issue here - the error you got should have been an HTPP 4xx error (indicating a lease conflict, since the ADLFileOutputStreams acquire lease), rather than HTTP 500, which says there was a server-side problem. To troubleshoot that, I will need to know your account name and time of access. That info is not safe to share on StackOverflow, so please open a support ticket for that and reference this SO question, so the issue gets eventually routed to me.

Spring Batch thread-safe Map job repository

the Spring Batch docs say of the Map-backed job repository:
Note that the in-memory repository is volatile and so does not allow restart between JVM instances. It also cannot guarantee that two job instances with the same parameters are launched simultaneously, and is not suitable for use in a multi-threaded Job, or a locally partitioned Step. So use the database version of the repository wherever you need those features.
I would like to use a Map job repository, and I do not care about restarting, prevention of concurrent job executions, etc. but I do care about being able to use multi-threading and local partitioning.
My batch application has some partitioned steps, and at first glance it seems to run just fine with a Map-backed job repository.
What is the reason it said to be not possible with MapJobRepositoryFactoryBean? Looking at the implementation of Map DAOs, they are using ConcurrentHashMap. Is this not thread-safe ?
I would advise you to follow the documentation, rather than relying on implementation details. Even if the maps are individually thread-safe, there might be race conditions in changes than involve more than one of these maps.
You can use an in-memory database very easily. Example
#Grapes([
#Grab('org.springframework:spring-jdbc:4.0.5.RELEASE'),
#Grab('com.h2database:h2:1.3.175'),
#Grab('org.springframework.batch:spring-batch-core:3.0.6.RELEASE'),
// must be passed with -cp, for whatever reason the GroovyClassLoader
// is not used for com.thoughtworks.xstream.io.json.JettisonMappedXmlDriver
//#Grab('org.codehaus.jettison:jettison:1.2'),
])
import org.h2.jdbcx.JdbcDataSource
import org.springframework.batch.core.Job
import org.springframework.batch.core.JobParameters
import org.springframework.batch.core.Step
import org.springframework.batch.core.StepContribution
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory
import org.springframework.batch.core.launch.JobLauncher
import org.springframework.batch.core.scope.context.ChunkContext
import org.springframework.batch.core.step.tasklet.Tasklet
import org.springframework.batch.repeat.RepeatStatus
import org.springframework.beans.factory.annotation.Autowired
import org.springframework.context.annotation.AnnotationConfigApplicationContext
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration
import org.springframework.core.io.ResourceLoader
import org.springframework.jdbc.datasource.init.DatabasePopulatorUtils
import org.springframework.jdbc.datasource.init.ResourceDatabasePopulator
import javax.annotation.PostConstruct
import javax.sql.DataSource
#Configuration
#EnableBatchProcessing
class AppConfig {
#Autowired
private JobBuilderFactory jobs
#Autowired
private StepBuilderFactory steps
#Bean
public Job job() {
return jobs.get("myJob").start(step1()).build()
}
#Bean
Step step1() {
this.steps.get('step1')
.tasklet(new MyTasklet())
.build()
}
#Bean
DataSource dataSource() {
new JdbcDataSource().with {
url = 'jdbc:h2:mem:temp_db;DB_CLOSE_DELAY=-1'
user = 'sa'
password = 'sa'
it
}
}
#Bean
BatchSchemaPopulator batchSchemaPopulator() {
new BatchSchemaPopulator()
}
}
class BatchSchemaPopulator {
#Autowired
ResourceLoader resourceLoader
#Autowired
DataSource dataSource
#PostConstruct
void init() {
def populator = new ResourceDatabasePopulator()
populator.addScript(
resourceLoader.getResource(
'classpath:/org/springframework/batch/core/schema-h2.sql'))
DatabasePopulatorUtils.execute populator, dataSource
}
}
class MyTasklet implements Tasklet {
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
println 'TEST!'
}
}
def ctx = new AnnotationConfigApplicationContext(AppConfig)
def launcher = ctx.getBean(JobLauncher)
def jobExecution = launcher.run(ctx.getBean(Job), new JobParameters([:]))
println "Status is: ${jobExecution.status}"

Using Py4J to invoke a method that takes a JavaSparkContext and return a JavaRDD<Integer>

I am looking for some help or example code that illustrates pyspark calling user written Java code outside of spark itself that takes a spark context from Python and then returns an RDD built in Java.
For completeness, I'm using Py4J 0.81, Java 8, Python 2.7, and spark 1.3.1
Here is what I am using for the Python half:
import pyspark
sc = pyspark.SparkContext(master='local[4]',
appName='HelloWorld')
print "version", sc._jsc.version()
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
print gateway.entry_point.getRDDFromSC(sc._jsc)
The Java portion is:
import java.util.Map;
import java.util.List;
import java.util.ArrayList;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import py4j.GatewayServer;
public class HelloWorld
{
public JavaRDD<Integer> getRDDFromSC(JavaSparkContext jsc)
{
JavaRDD<Integer> result = null;
if (jsc == null)
{
System.out.println("XXX Bad mojo XXX");
return result;
}
int n = 10;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++)
{
l.add(i);
}
result = jsc.parallelize(l);
return result;
}
public static void main(String[] args)
{
HelloWorld app = new HelloWorld();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Running produces on the Python side:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit main.py
version 1.3.1
sc._jsc <class 'py4j.java_gateway.JavaObject'>
org.apache.spark.api.java.JavaSparkContext#50418105
None
The Java side reports:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit --class "HelloWorld" --master local[4] target/hello-world-1.0.jar
XXX Bad mojo XXX
The problem appears to be that I am not correctly passing the JavaSparkContext from Python to Java. The same failure of the JavaRDD being null occurs when I use from python sc._scj.sc().
What is the correct way to invoke user defined Java code that uses spark from Python?
So I've got an example of this in a branch that I'm working on for Sparkling Pandas The branch lives at https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support and the PR is at https://github.com/sparklingpandas/sparklingpandas/pull/90 .
As it stands it looks like you have two different gateway servers which seems like it might cause some problems, instead you can just use the existing gateway server and do something like:
sc._jvm.what.ever.your.class.package.is.HelloWorld.getRDDFromSC(sc._jsc)
assuming you make that a static method as well.

Resources