HazelcastJet kafka throttling

HazelcastJet kafka throttling - hazelcast-jet

I couldn't find any possibilities to create pipeline in hazelcast-jet-kafka that will limit the throughput to a specific number of elements per time unit, anybody could suggest me possible solutions? I know that alpaka (https://doc.akka.io/docs/alpakka-kafka/current/) has such functionality

You can define this function:
private <T, S extends GeneralStage<T>> FunctionEx<S, S> throttle(int itemsPerSecond) {
// context for the mapUsingService stage
class Service {
final int ratePerSecond;
final TreeMap<Long, Long> counts = new TreeMap<>();
public Service(int ratePerSecond) {
this.ratePerSecond = ratePerSecond;
}
}
// factory for the service
ServiceFactory<?, Service> serviceFactory = ServiceFactories
.nonSharedService(procCtx ->
// divide the count for the actual number of processors we have
new Service(Math.max(1, itemsPerSecond / procCtx.totalParallelism())))
// non-cooperative is needed because we sleep in the mapping function
.toNonCooperative();
return stage -> (S) stage
.mapUsingService(serviceFactory,
(ctx, item) -> {
// current time in 10ths of a second
long now = System.nanoTime() / 100_000_000;
// include this item in the counts
ctx.counts.merge(now, 1L, Long::sum);
// clear items emitted more than second ago
ctx.counts.headMap(now - 10, true).clear();
long countInLastSecond =
ctx.counts.values().stream().mapToLong(Long::longValue).sum();
// if we emitted too many items, sleep a while
if (countInLastSecond > ctx.ratePerSecond) {
Thread.sleep(
(countInLastSecond - ctx.ratePerSecond) * 1000/ctx.ratePerSecond);
}
// now we can pass the item on
return item;
}
);
}
Then use it to throttle in the pipeline:
Pipeline p = Pipeline.create();
p.readFrom(TestSources.items(IntStream.range(0, 2_000).boxed().toArray(Integer[]::new)))
.apply(throttle(100))
.writeTo(Sinks.noop());
The above job will take about 20 seconds to complete because it has 2000 items and the rate is limited to 100 items/s. The rate is evaluated over the last second, so if there are less than 100 items/s, items will be forwarded immediately. If there are 101 items during one millisecond, 100 will be forwarded immediately and the next after a sleep.
Also make sure that your source is distributed. The rate is divided by the number of processors in the cluster and if your source isn't distributed and some members don't see any data, your overall rate will be only a fraction of the desired rate.

Related

I logged the file while nutch is crawling i am not getting 399054 SCHEDULE_REJECTED,5892 URLS_SKIPPED_PER_HOST_OVERFLOW

while crawling i saw it showing
Generator: number of items rejected during selection:
Generator: 67 HOSTS_AFFECTED_PER_HOST_OVERFLOW
Generator: 3 MALFORMED_URL
Generator: 399054 SCHEDULE_REJECTED
Generator: 5892 URLS_SKIPPED_PER_HOST_OVERFLOW
I understand 67 HOSTS_AFFECTED_PER_HOST_OVERFLOW,3 MALFORMED_URL
I did not understand what it means 399054 SCHEDULE_REJECTED,5892 URLS_SKIPPED_PER_HOST_OVERFLOW.
Can anyone explain what it means.

Generator phase has different counters to know filtered or skipped url's in Genertor MapReduce phase.
SCHEDULE_REJECTED
if(!schedule.shouldFetch(url, crawlDatum, curTime)){
context.getCounter("Generator", "SCHEDULE_REJECTED").increment(1);
return;}
As per the property defined in nutch-site.xml default schedule value is DefaultFetchSchedule
db.fetch.schedule.clas = org.apache.nutch.crawl.DefaultFetchSchedule
shouldFetch method in AbstractFetchSchedule will decide where to allow url for now or not in to Fetcher Phase.
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long a fetchInterval are adjusted so that they fit within
// a maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9f);
}
datum.setFetchTime(curTime);
}
if (datum.getFetchTime() > curTime) {
return false; // not time yet
}
return true;
}
above logic say the URL once fetched in the last iterations can be fetched once again in the upcoming iterations when it's fetchTime is expired and the window of fetchTime is decided by db.fetch.interval.default and default values is 30 days.
shouldFetch makes sure that a url once successfully fetched will be once again retried fetching only after 30days otherwise rejected in generator.
WAIT_FOR_UPDATE (Default value to wait is 7 days )
This counter only makes sense when you enabled generate.update.crawldb=true otherwise it does not have any sense.
This counter will be used to track highly-concurrent
environments, where several generate/fetch/update cycles may overlap,
setting this to true ensures that generate will create different
fetchlists and it uses crawl.gen.delay to ensure different fetchlists.
crawl.gen.delay defines how long items already generated are blocked and (default is 7 days)
LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData()
.get(Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null) { // awaiting fetch & update
if (oldGenTime.get() + genDelay > curTime) // still wait for
// update
context.getCounter("Generator", "WAIT_FOR_UPDATE").increment(1);
return;
}
MALFORMED_URL : This counter will track urls which not have proper url sytax or url encoding issue
HOSTS_AFFECTED_PER_HOST_OVERFLOW/URLS_SKIPPED_PER_HOST_OVERFLOW :
if (maxCount > 0) {int[] hostCount = hostCounts.get(hostordomain);
if (hostCount == null) {
hostCount = new int[]{1, 0};
hostCounts.put(hostordomain, hostCount);
}
// increment hostCount
hostCount[1]++;
// check if topN reached, select next segment if it is
while (segCounts[hostCount[0] - 1] >= limit
&& hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
}
// reached the limit of allowed URLs per host / domain
// see if we can put it in the next segment?
if (hostCount[1] > maxCount) {
if (hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 1;
} else {
if (hostCount[1] == (maxCount + 1)) {
context
.getCounter("Generator", "HOSTS_AFFECTED_PER_HOST_OVERFLOW")
.increment(1);
LOG.info(
"Host or domain {} has more than {} URLs for all {} segments. Additional URLs won't be included in the fetchlist.",
hostordomain, maxCount, maxNumSegments);
}
// skip this entry
context.getCounter("Generator", "URLS_SKIPPED_PER_HOST_OVERFLOW")
.increment(1);
continue;
}
}
entry.segnum = new IntWritable(hostCount[0]);
segCounts[hostCount[0] - 1]++;
} else {
entry.segnum = new IntWritable(currentsegmentnum);
segCounts[currentsegmentnum - 1]++;
}
As per above code hostCounts object is used to track <domain,[segmentNumber,urlCounts]>
hostCount[1] == ([maxCount][4] + 1) && hostCount[0] > maxNumSegments will be true only when we reached full threadhold per domain for all the segments and count will be tracted in HOSTS_AFFECTED_PER_HOST_OVERFLOW.
HOSTS_AFFECTED_PER_HOST_OVERFLOW
It basically tracks all the hosts/domains which missed to allocated by margin of 1 space in the final segment.
URLS_SKIPPED_PER_HOST_OVERFLOW is used to count all the domain/hosts which got no room to fill in all the segments.
And Other Counters like INTERVAL_REJECTED,SCORE_TOO_LOW,STATUS_REJECTED are all self-explanatory and for better clarity you can check Generator code.

How do I train on time when I already know the date?

I'm looking to filter by a time interval. For example, from 9am to noon. It is unclear how to create that functionality in the training and action concept. For example, I currently have this:

You should train it with DateTimeExpression and use the DateTimeInterval component to get the actual difference
action (TestDateTimeInterval) {
type(Search)
description (__DESCRIPTION__)
collect {
input (dateTimeExpression) {
type (time.DateTimeExpression)
min (Optional) max (One)
}
}
output (core.Integer)
}
Action Javascript
module.exports.function = function testDateTimeInterval (dateTimeExpression) {
var dates = require ('dates')
var whenStart;
var whenEnd;
if (dateTimeExpression.dateTimeInterval) {
whenStart = dates.ZonedDateTime.of(
dateTimeExpression.dateTimeInterval.start.date.year,
dateTimeExpression.dateTimeInterval.start.date.month,
dateTimeExpression.dateTimeInterval.start.date.day,
dateTimeExpression.dateTimeInterval.start.time.hour,
dateTimeExpression.dateTimeInterval.start.time.minute,
dateTimeExpression.dateTimeInterval.start.time.second);
whenEnd = dates.ZonedDateTime.of(
dateTimeExpression.dateTimeInterval.end.date.year,
dateTimeExpression.dateTimeInterval.end.date.month,
dateTimeExpression.dateTimeInterval.end.date.day,
dateTimeExpression.dateTimeInterval.end.time.hour,
dateTimeExpression.dateTimeInterval.end.time.minute,
dateTimeExpression.dateTimeInterval.end.time.second);
// If you intend to return the difference between the number of hours
return dateTimeExpression.dateTimeInterval.end.time.hour - dateTimeExpression.dateTimeInterval.start.time.hour
}
return -1;
}

How to safely select across channels where some may get concurrently closed?

While answering a question I attempted to implement a setup where the main thread joins the efforts of the CommonPool to execute a number of independent tasks in parallel (this is how java.util.streams operates).
I create as many actors as there are CommonPool threads, plus a channel for the main thread. The actors use rendezvous channels:
val resultChannel = Channel<Double>(UNLIMITED)
val poolComputeChannels = (1..commonPool().parallelism).map {
actor<Task>(CommonPool) {
for (task in channel) {
task.execute().also { resultChannel.send(it) }
}
}
}
val mainComputeChannel = Channel<Task>()
val allComputeChannels = poolComputeChannels + mainComputeChannel
This allows me to distribute the load by using a select expression to find an idle actor for each task:
select {
allComputeChannels.forEach { chan ->
chan.onSend(task) {}
}
}
So I send all the tasks and close the channels:
launch(CommonPool) {
jobs.forEach { task ->
select {
allComputeChannels.forEach { chan ->
chan.onSend(task) {}
}
}
}
allComputeChannels.forEach { it.close() }
}
Now I have to write the code for the main thread. Here I decided to serve both the mainComputeChannel, executing the tasks submitted to the main thread, and the resultChannel, accumulating the individual results into the final sum:
return runBlocking {
var completedCount = 0
var sum = 0.0
while (completedCount < NUM_TASKS) {
select<Unit> {
mainComputeChannel.onReceive { task ->
task.execute().also { resultChannel.send(it) }
}
resultChannel.onReceive { result ->
sum += result
completedCount++
}
}
}
resultChannel.close()
sum
}
This gives rise to the situation where mainComputeChannel may be closed from a CommonPool thread, but the resultChannel still needs serving. If the channel is closed, onReceive will throw an exception and onReceiveOrNull will immediately select with null. Neither option is acceptable. I didn't find a way to avoid registering the mainComputeChannel if it's closed, either. If I use if (!mainComputeChannel.isClosedForReceive), it will not be atomic with the registration call.
This leads me to my question: what would be a good idiom to select over channels where some may get closed by another thread while others are still live?

The kotlinx.coroutines library is currently missing a primitive to make it convenient. The outstanding proposal is to add receiveOrClose function and onReceiveOrClosed clause for select that would make writing code like this possible.
However, you will still have to manually track the fact that your mainComputeChannel was closed and stop selecting on it when it was. So, using a proposed onReceiveOrClosed clause you'll write something like this:
// outside of loop
var mainComputeChannelClosed = false
// inside loop
select<Unit> {
if (!mainComputeChannelClosed) {
mainComputeChannel.onReceiveOrClosed {
if (it.isClosed) mainComputeChannelClosed = true
else { /* do something with it */ }
}
}
// more clauses
}
See https://github.com/Kotlin/kotlinx.coroutines/issues/330 for details.
There are no proposals on the table to further simplify this kind of pattern.

geolocalisation is very slow

I've a application for the geolocalisation and I retrieve the current geoposition but the display on the application is VERY slow...
The constructor :
public TaskGeo()
{
InitializeComponent();
_geolocator = new Geolocator();
_geolocator.DesiredAccuracy = PositionAccuracy.High;
_geolocator.MovementThreshold = 100;
_geolocator.PositionChanged += _geolocator_PositionChanged;
_geolocator.StatusChanged += _geolocator_StatusChanged;
if (_geolocator.LocationStatus == PositionStatus.Disabled)
this.DisplayNeedGPS();
}
the code for the display on the app :
void _geolocator_PositionChanged(Geolocator sender, PositionChangedEventArgs args)
{
// saving and display of the position
App.RootFrame.Dispatcher.BeginInvoke(() =>
{
this._CurrentPosition = args.Position;
this.lblLon.Text = "Lon: " + this._CurrentPosition.Coordinate.Longitude;
this.lblLat.Text = "Lat: " + this._CurrentPosition.Coordinate.Latitude;
this.LocationChanged(this._CurrentPosition.Coordinate.Longitude, this._CurrentPosition.Coordinate.Latitude);
});
}
And the code for the query :
private void LocationChanged(double lat, double lon)
{
ReverseGeocodeQuery rgq = new ReverseGeocodeQuery();
rgq.GeoCoordinate = new GeoCoordinate(lat, lon);
rgq.QueryCompleted += rgq_QueryCompleted;
rgq.QueryAsync();
}
How can I improve the code to display faster the position ? Thanks in advance !

Getting this sort of information is basically pretty slow. To quote the great Louis C. K. "It is going to space, give it a second". Because you've specified PositionAccuracy.High this means that the location must be found using GPS, which is comparatively slow, and not any of the faster fallback methods such as using local wi-fi or cell phone towers.
You could reduce your demands for accuracy overall or initially request a lower accuracy and then refine it once the information from the GPS is available. The second option is better. If you look at a map application they typically do this by showing you about where you are and then improving it after the GPS lock is acquired.

CRM 2011 server side paging / data parallelism using LINQ provider

Given that the CRM 2011 linq provider performs paging automatically behind the scenes.
Is there a way to set an upper limit on the number of records fetched when a linq
query is executed (similar to setting a PagingInfo.Count on a QueryExpression for paging)
I have a scenario where I need approx 20K+ records to be pulled for an update(no I cannot and do not need to filter down the record set further). Ideally I'd prefer to use the Skip & Take operators but since Count is not supported how would you know how many records to skip and when
to stop fetching more records.
Ideally I'd like to use TPL and processes batches of say 3K or 5K records in parallel so that I can get more throughput and don't have to block. The OrganizationserviceContext is not thread safe from what I know. Are there any good examples that illustrate how to partition the dataset in this case say using Parallel.For or Parallel.ForEach.
How would you partition and would you need to use a different context object for each parition?
Thanks.
UPDATE:
Here is what I came up with:
The idea is to get the total count of records to process and use PLINQ to farm out the processing of each subset of data across tasks using a new OrganizationServiceContext object per task.
static void Main(string[] args)
{
int pagesize = 2000;
// use FetchXML aggregate functions to get total count
// Reference: http://msdn.microsoft.com/en-us/library/gg309565.aspx
int totalcount = GetTotalCount();
int totalPages = (int)Math.Ceiling((double)totalcount / (double)pagesize);
try
{
Parallel.For(0, totalPages, () => new MyOrgserviceContext(),
(pageIndex, state, ctx) =>
{
var items = ctx.myEntitySet.Skip((pageIndex - 1) * pagesize).Take(pagesize);
var itemsArray = items.ToArray();
Console.WriteLine("Page:{0} - Fetched:{1}", pageIndex, itemsArray.Length);
return ctx;
},
ctx => ctx.Dispose()
);
}
catch (AggregateException ex)
{
//handle as needed
}
}

So the way I would do this would be to keep querying the records using skip and take until I run out of records.
Check out my example below, it uses int's for simplicity, but the approach should still apply to Linq-to-Crm.
So just keep performing your query, skipping previous records, taking the ones you want for that page, then counting at the end to see if you recieved a full page - if you didnt then you have run out of records.
Code
List<int> ints = new List<int>()
{
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
};
int pageNumber = 0;
int recordsPerPage = 3;
while(true)
{
IEnumerable<int> page = ints.Where(i => i < 11).Skip(recordsPerPage * pageNumber).Take(recordsPerPage);
foreach(int i in page)
{
Console.WriteLine(i);
}
Console.WriteLine("end of page");
pageNumber++;
if (page.Count() < recordsPerPage)
{
break;
}
}
Output:
1
2
3
end of page
4
5
6
end of page
7
8
9
end of page
10
end of page

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

HazelcastJet kafka throttling - hazelcast-jet

I couldn't find any possibilities to create pipeline in hazelcast-jet-kafka that will limit the throughput to a specific number of elements per time unit, anybody could suggest me possible solutions? I know that alpaka (https://doc.akka.io/docs/alpakka-kafka/current/) has such functionality

Related

I logged the file while nutch is crawling i am not getting 399054 SCHEDULE_REJECTED,5892 URLS_SKIPPED_PER_HOST_OVERFLOW

How do I train on time when I already know the date?

How to safely select across channels where some may get concurrently closed?

geolocalisation is very slow

CRM 2011 server side paging / data parallelism using LINQ provider

Categories

Resources