Java 8 CompletedFuture web crawler doesn't crawl past one URL - multithreading

I'm playing with the newly introduced concurrency features in Java 8, working exercises from the book "Java SE 8 for the Really Impatient" by Cay S. Horstmann. I created the following web crawler using the new CompletedFuture and jsoup. The basic idea is given a URL, it'll find first m URLs on that page and repeat the process n times. m and n are parameters, of course. Problem is the program fetches the URLs for the initial page but doesn't recurse. What am I missing?
static class WebCrawler {
CompletableFuture<Void> crawl(final String startingUrl,
final int depth, final int breadth) {
if (depth <= 0) {
return completedFuture(startingUrl, depth);
}
final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(
depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
allDoneFuture.join();
return allDoneFuture;
}
private CompletableFuture<Void> completedFuture(
final String startingUrl, final int depth) {
LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);
CompletableFuture<Void> future = new CompletableFuture<>();
future.complete(null);
return future;
}
private Supplier<Document> getContent(final String url) {
return () -> {
try {
return connect(url).get();
} catch (IOException e) {
throw new UncheckedIOException(
" Something went wrong trying to fetch the contents of the URL: "
+ url, e);
}
};
}
private Function<Document, Set<String>> getURLs(final int limit) {
return doc -> {
LOGGER.info("Getting URLs for document: {}.", doc.baseUri());
return doc.select("a[href]").stream()
.map(link -> link.attr("abs:href")).limit(limit)
.peek(LOGGER::info).collect(toSet());
};
}
private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
final int depth, final int breadth) {
return urls -> urls.stream().map(
url -> crawl(url, depth - 1, breadth));
}
}
Test case:
#Test
public void testCrawl() {
new WebCrawler().crawl(
"http://en.wikipedia.org/wiki/Java_%28programming_language%29",
2, 10);
}

The problem is in the following code:
final CompletableFuture<Void> allDoneFuture = allOf(
(CompletableFuture[]) of(startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
For some reason you are doing all this inside a stream of one element (is that a part of the exercise?). The result is that allDoneFuture is not tracking the completion of the sub-tasks. It's tracking the completion of the Stream<CompletableFuture> that comes from doForEach. But that stream is ready right away and the futures inside of it are never asked to complete.
Fix it by removing the stream that doesn't do anything helpful:
final CompletableFuture<Void> allDoneFuture=supplyAsync(getContent(startingUrl))
.thenApply(getURLs(breadth))
.thenApply(doForEach(depth,breadth))
.thenApply(futures -> futures.toArray(CompletableFuture[]::new))
.thenCompose(CompletableFuture::allOf);

Related

Spring IntegrationFlow CompositeFileListFilter Not Working

I have two filters regexFilter and lastModified.
return IntegrationFlows.from(Sftp.inboundAdapter(inboundSftp)
.localDirectory(this.getlocalDirectory(config.getId()))
.deleteRemoteFiles(true)
.autoCreateLocalDirectory(true)
.regexFilter(config.getRegexFilter())
.filter(new LastModifiedLsEntryFileListFilter())
.remoteDirectory(config.getInboundDirectory())
, e -> e.poller(Pollers.fixedDelay(60_000)
.errorChannel(MessageHeaders.ERROR_CHANNEL).errorHandler((ex) -> {
})))
By googling I understand I have to use CompositeFileListFilter for regex so change my code to
.filter(new CompositeFileListFilter().addFilter(new RegexPatternFileListFilter(config.getRegexFilter())))
Its compiled but on run time throws error and channel stooped and same error goes for
.filter(ftpPersistantFilter(config.getRegexFilter()))
.
.
.
public CompositeFileListFilter ftpPersistantFilter(String regexFilter) {
CompositeFileListFilter filters = new CompositeFileListFilter();
filters.addFilter(new FtpRegexPatternFileListFilter(regexFilter));
return filters;
}
I just want to filter on the basis of file name. There are 2 flows for same remote folder and both are polling with same cron but should pick their relevant file.
EDIT
adding last LastModifiedLsEntryFileListFilter. Its working fine but adding upon request.
public class LastModifiedLsEntryFileListFilter implements FileListFilter<LsEntry> {
private final Logger log = LoggerFactory.getLogger(LastModifiedLsEntryFileListFilter.class);
private static final long DEFAULT_AGE = 60;
private volatile long age = DEFAULT_AGE;
private volatile Map<String, Long> sizeMap = new HashMap<String, Long>();
public long getAge() {
return this.age;
}
public void setAge(long age) {
setAge(age, TimeUnit.SECONDS);
}
public void setAge(long age, TimeUnit unit) {
this.age = unit.toSeconds(age);
}
#Override
public List<LsEntry> filterFiles(LsEntry[] files) {
List<LsEntry> list = new ArrayList<LsEntry>();
long now = System.currentTimeMillis() / 1000;
for (LsEntry file : files) {
if (file.getAttrs()
.isDir()) {
continue;
}
String fileName = file.getFilename();
Long currentSize = file.getAttrs().getSize();
Long oldSize = sizeMap.get(fileName);
if(oldSize == null || currentSize.longValue() != oldSize.longValue() ) {
// putting size in map, will verify in next iteration of scheduler
sizeMap.put(fileName, currentSize);
log.info("[{}] old size [{}] increased to [{}]...", file.getFilename(), oldSize, currentSize);
continue;
}
int lastModifiedTime = file.getAttrs()
.getMTime();
if (lastModifiedTime + this.age <= now ) {
list.add(file);
sizeMap.remove(fileName);
} else {
log.info("File [{}] is still being uploaded...", file.getFilename());
}
}
return list;
}
}
PS : When I am testing filter for regex I have removed LastModifiedLsEntryFileListFilter just for simplicity. So my final Flow is
return IntegrationFlows.from(Sftp.inboundAdapter(inboundSftp)
.localDirectory(this.getlocalDirectory(config.getId()))
.deleteRemoteFiles(true)
.autoCreateLocalDirectory(true)
.filter(new CompositeFileListFilter().addFilter(new RegexPatternFileListFilter(config.getRegexFilter())))
//.filter(new LastModifiedLsEntryFileListFilter())
.remoteDirectory(config.getInboundDirectory()),
e -> e.poller(Pollers.fixedDelay(60_000)
.errorChannel(MessageHeaders.ERROR_CHANNEL).errorHandler((ex) -> {
try {
this.destroy(String.valueOf(config.getId()));
configurationService.removeConfigurationChannelById(config.getId());
// // logging here
} catch (Exception ex1) {
}
}))).publishSubscribeChannel(s -> s
.subscribe(f -> {
f.handle(Sftp.outboundAdapter(outboundSftp)
.useTemporaryFileName(false)
.autoCreateDirectory(true)
.remoteDirectory(config.getOutboundDirectory()), c -> c.advice(startup.deleteFileAdvice()));
})
.subscribe(f -> {
if (doArchive) {
f.handle(Sftp.outboundAdapter(inboundSftp)
.useTemporaryFileName(false)
.autoCreateDirectory(true)
.remoteDirectory(config.getInboundArchiveDirectory()));
} else {
f.handle(m -> {
});
}
})
.subscribe(f -> f
.handle(m -> {
// I am handling exception here
})
))
.get();
and here are exceptions
2020-01-27 21:36:55,731 INFO o.s.i.c.PublishSubscribeChannel - Channel
'application.2.subFlow#0.channel#0' has 0 subscriber(s).
2020-01-27 21:36:55,731 INFO o.s.i.e.EventDrivenConsumer - stopped 2.subFlow#2.org.springframework.integration.config.ConsumerEndpointFactoryBean#0
2020-01-27 21:36:55,731 INFO o.s.i.c.DirectChannel - Channel 'application.2.subFlow#2.channel#0' has 0 subscriber(s).
2020-01-27 21:36:55,731 INFO o.s.i.e.EventDrivenConsumer - stopped 2.subFlow#2.org.springframework.integration.config.ConsumerEndpointFactoryBean#1
EDIT
After passing regex to LastModifiedLsEntryFileListFilter and handle there works for me. When I use any other RegexFilter inside CompositeFileListFilter it thorws error.
.filter(new CompositeFileListFilter().addFilter(new LastModifiedLsEntryFileListFilter(config.getRegexFilter())))
Show, please, your final flow. I don't see that you use LastModifiedLsEntryFileListFilter in your CompositeFileListFilter... You definitely can't use regexFilter() and filter() together - the last one wins. To avoid confusion we suggest to use a filter() and compose all those with CompositeFileListFilter or ChainFileListFilter.
Also what is an error you are mentioning, please.

How to chan "RoomDatabase and Retrofit" in RxJava

I'm using the Repository Pattern.
I would like to implement logic that if there is no value in the internal DB returns the value of the Api Response and inserts it in the internal DB.
Received internal DB Value (Single Type) Return final value if found, Request Server Api if not found Insert in internal DB (Completable Type) Return final value (Single Type)
If any of these processes call onError, the final return value of this logic shall be onError.
fun getAllStudent(): Single<List<StudentEntity>> =
cache.getAllStudent().onErrorResumeNext { getAllStudentRemote() }
private fun getAllStudentRemote(): Single<List<StudentEntity>> =
remote.getAllMember()
.map { memberData -> memberData.students }
.map { studentList -> studentList.map { student -> studentMapper.mapToEntity(student) } }
.doOnSuccess { studentEntityList -> cache.insertStudents(studentEntityList) }
This is how I tried.
However, in the insert section, because it cannot subscribe, It cannot insert into internal DB or detect onError.
How can I implement this logic? ++ I'm sorry for my poor English.
Since you need to wait for cache.insertStudents() to complete, one thing you can do is to chain cache.insertStudents() into the stream using flatMap.
For example:
fun getAllStudent(): Single<List<StudentEntity>> =
cache.getAllStudent().onErrorResumeNext { getAllStudentRemote() }
private fun getAllStudentRemote(): Single<List<StudentEntity>> =
remote.getAllMember()
.map { memberData -> memberData.students }
.map { studentList -> studentList.map { student -> studentMapper.mapToEntity(student) } }
.flatMap { studentEntityList ->
cache.insertStudents(studentEntityList) // Completable
.toSingleDefualt(studentEntityList) // Convert to Single<List<StudentEntity>>
}
Also note that .do... operators are side-effect operators, and you should not do any operation that can affect the stream.

How to multiply strings in Haxe

I'm trying to multiply some string a by some integer b such that a * b = a + a + a... (b times). I've tried doing it the same way I would in python:
class Test {
static function main() {
var a = "Text";
var b = 4;
trace(a * b); //Assumed Output: TextTextTextText
}
}
But this raises:
Build failure Test.hx:6: characters 14-15 : String should be Int
There doesn't seem to be any information in the Haxe Programming Cookbook or the API Documentation about multiplying strings, so I'm wondering if I've mistyped something or if I should use:
class Test {
static function main() {
var a = "Text";
var b = 4;
var c = "";
for (i in 0...b) {
c = c + a;
}
trace(c); // Outputs "TextTextTextText"
}
}
Not very short, but array comprehension might help in some situations :
class Test {
static function main() {
var a = "Text";
var b = 4;
trace( [for (i in 0...b) a].join("") );
//Output: TextTextTextText
}
}
See on try.haxe.org.
The numeric multiplication operator * requires numeric types, like integer. You have a string. If you want to multiply a string, you have to do it manually by appending a target string within the loop.
The + operator is not the numeric plus in your example, but a way to combine strings.
You can achieve what you want by operator overloading:
abstract MyAbstract(String) {
public inline function new(s:String) {
this = s;
}
#:op(A * B)
public function repeat(rhs:Int):MyAbstract {
var s:StringBuf = new StringBuf();
for (i in 0...rhs)
s.add(this);
return new MyAbstract(s.toString());
}
}
class Main {
static public function main() {
var a = new MyAbstract("foo");
trace(a * 3); // foofoofoo
}
}
To build on tokiop's answer, you could also define a times function, and then use it as a static extension.
using Test.Extensions;
class Test {
static function main() {
trace ("Text".times(4));
}
}
class Extensions {
public static function times (str:String, n:Int) {
return [for (i in 0...n) str].join("");
}
}
try.haxe.org demo here
To build on bsinky answer, you can also define a times function as static extension, but avoid the array:
using Test.Extensions;
class Test {
static function main() {
trace ("Text".times(4));
}
}
class Extensions {
public static function times (str:String, n:Int) {
var v = new StringBuf();
for (i in 0...n) v.add(str);
return v.toString();
}
}
Demo: https://try.haxe.org/#e5937
StringBuf may be optimized for different targets. For example, on JavaScript target it is compiled as if you were just using strings https://api.haxe.org/StringBuf.html
The fastest method (at least on the JavaScript target from https://try.haxe.org/#195A8) seems to be using StringTools._pad.
public static inline function stringProduct ( s : String, n : Int ) {
if ( n < 0 ) {
throw ( 1 );
}
return StringTools.lpad ( "", s, s.length * n );
}
StringTools.lpad and StringTools.rpad can't seem to decide which is more efficient. It looks like rpad might be better for larger strings and lpad might be better for smaller strings, but they switch around a bit with each rerun. haxe.format.JsonPrinter uses lpad for concatenation, but I'm not sure which to recommend.

How to concatWith using information from previous Observable for pagination

Let's say I have a blocking method with is called List<UUID> listOf(int page).
If I want to paginate something like this, one idea is to do something like this:
public Observable<UUID> allOf(int initialPage) {
return fromCallable( () -> listOf(initialPage))
.concatWith( fromCallable( () -> allOf(initialPage + 1)))
.flatMap(x -> from(x));
}
If my service doesn't use the page number but the last element of the list to find next elements, how can I achieve it with RxJava?
I would still like to obtain the effect of doing something like allOf(0).take(20) and obtain, with concatWith, the call to the second Observable when the first one has completed.
But how can I do it when I need information from the previous call?
You could use a subject to send back the next page number to the beginning of a sequence:
List<Integer> service(int index) {
System.out.println("Reading " + index);
List<Integer> list = new ArrayList<>();
for (int i = index; i < index + 20; i++) {
list.add(i);
}
return list;
}
Flowable<List<Integer>> getPage(int index) {
FlowableProcessor<Integer> pager = UnicastProcessor.<Integer>create()
.toSerialized();
pager.onNext(index);
return pager.observeOn(Schedulers.trampoline(), true, 1)
.map(v -> {
List<Integer> list = service(v);
pager.onNext(list.get(list.size() - 1) + 1);
return list;
})
;
}
#Test
public void testPager() {
getPage(0).take(20)
.subscribe(System.out::println, Throwable::printStackTrace);
}

Groovy - String each method

I have just started learning Groovy which looks really awesome!
This is very simple example.
"Groovy".each {a -> println a};
It nicely prints as given below.
G
r
o
o
v
y
My question is - 'each' method is not part of String object as per the link below. Then how come it works?
http://beta.groovy-lang.org/docs/latest/html/groovy-jdk/
How can i get the parameters list for a closure of an object?
example String.each has 1 parameter, Map.each has 1 or 2 parameters like entry or key & value.
The relevant code in DefaultGroovyMethods is
public static Iterator iterator(Object o) {
return DefaultTypeTransformation.asCollection(o).iterator();
}
which contains:
else if (value instanceof String) {
return StringGroovyMethods.toList((String) value);
}
String toList is:
public static List<String> toList(String self) {
int size = self.length();
List<String> answer = new ArrayList<String>(size);
for (int i = 0; i < size; i++) {
answer.add(self.substring(i, i + 1));
}
return answer;
}

Resources