Given that the CRM 2011 linq provider performs paging automatically behind the scenes.
Is there a way to set an upper limit on the number of records fetched when a linq
query is executed (similar to setting a PagingInfo.Count on a QueryExpression for paging)
I have a scenario where I need approx 20K+ records to be pulled for an update(no I cannot and do not need to filter down the record set further). Ideally I'd prefer to use the Skip & Take operators but since Count is not supported how would you know how many records to skip and when
to stop fetching more records.
Ideally I'd like to use TPL and processes batches of say 3K or 5K records in parallel so that I can get more throughput and don't have to block. The OrganizationserviceContext is not thread safe from what I know. Are there any good examples that illustrate how to partition the dataset in this case say using Parallel.For or Parallel.ForEach.
How would you partition and would you need to use a different context object for each parition?
Thanks.
UPDATE:
Here is what I came up with:
The idea is to get the total count of records to process and use PLINQ to farm out the processing of each subset of data across tasks using a new OrganizationServiceContext object per task.
static void Main(string[] args)
{
int pagesize = 2000;
// use FetchXML aggregate functions to get total count
// Reference: http://msdn.microsoft.com/en-us/library/gg309565.aspx
int totalcount = GetTotalCount();
int totalPages = (int)Math.Ceiling((double)totalcount / (double)pagesize);
try
{
Parallel.For(0, totalPages, () => new MyOrgserviceContext(),
(pageIndex, state, ctx) =>
{
var items = ctx.myEntitySet.Skip((pageIndex - 1) * pagesize).Take(pagesize);
var itemsArray = items.ToArray();
Console.WriteLine("Page:{0} - Fetched:{1}", pageIndex, itemsArray.Length);
return ctx;
},
ctx => ctx.Dispose()
);
}
catch (AggregateException ex)
{
//handle as needed
}
}
So the way I would do this would be to keep querying the records using skip and take until I run out of records.
Check out my example below, it uses int's for simplicity, but the approach should still apply to Linq-to-Crm.
So just keep performing your query, skipping previous records, taking the ones you want for that page, then counting at the end to see if you recieved a full page - if you didnt then you have run out of records.
Code
List<int> ints = new List<int>()
{
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
};
int pageNumber = 0;
int recordsPerPage = 3;
while(true)
{
IEnumerable<int> page = ints.Where(i => i < 11).Skip(recordsPerPage * pageNumber).Take(recordsPerPage);
foreach(int i in page)
{
Console.WriteLine(i);
}
Console.WriteLine("end of page");
pageNumber++;
if (page.Count() < recordsPerPage)
{
break;
}
}
Output:
1
2
3
end of page
4
5
6
end of page
7
8
9
end of page
10
end of page
Related
while crawling i saw it showing
Generator: number of items rejected during selection:
Generator: 67 HOSTS_AFFECTED_PER_HOST_OVERFLOW
Generator: 3 MALFORMED_URL
Generator: 399054 SCHEDULE_REJECTED
Generator: 5892 URLS_SKIPPED_PER_HOST_OVERFLOW
I understand 67 HOSTS_AFFECTED_PER_HOST_OVERFLOW,3 MALFORMED_URL
I did not understand what it means 399054 SCHEDULE_REJECTED,5892 URLS_SKIPPED_PER_HOST_OVERFLOW.
Can anyone explain what it means.
Generator phase has different counters to know filtered or skipped url's in Genertor MapReduce phase.
SCHEDULE_REJECTED
if(!schedule.shouldFetch(url, crawlDatum, curTime)){
context.getCounter("Generator", "SCHEDULE_REJECTED").increment(1);
return;}
As per the property defined in nutch-site.xml default schedule value is DefaultFetchSchedule
db.fetch.schedule.clas = org.apache.nutch.crawl.DefaultFetchSchedule
shouldFetch method in AbstractFetchSchedule will decide where to allow url for now or not in to Fetcher Phase.
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long a fetchInterval are adjusted so that they fit within
// a maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
if (datum.getFetchInterval() > maxInterval) {
datum.setFetchInterval(maxInterval * 0.9f);
}
datum.setFetchTime(curTime);
}
if (datum.getFetchTime() > curTime) {
return false; // not time yet
}
return true;
}
above logic say the URL once fetched in the last iterations can be fetched once again in the upcoming iterations when it's fetchTime is expired and the window of fetchTime is decided by db.fetch.interval.default and default values is 30 days.
shouldFetch makes sure that a url once successfully fetched will be once again retried fetching only after 30days otherwise rejected in generator.
WAIT_FOR_UPDATE (Default value to wait is 7 days )
This counter only makes sense when you enabled generate.update.crawldb=true otherwise it does not have any sense.
This counter will be used to track highly-concurrent
environments, where several generate/fetch/update cycles may overlap,
setting this to true ensures that generate will create different
fetchlists and it uses crawl.gen.delay to ensure different fetchlists.
crawl.gen.delay defines how long items already generated are blocked and (default is 7 days)
LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData()
.get(Nutch.WRITABLE_GENERATE_TIME_KEY);
if (oldGenTime != null) { // awaiting fetch & update
if (oldGenTime.get() + genDelay > curTime) // still wait for
// update
context.getCounter("Generator", "WAIT_FOR_UPDATE").increment(1);
return;
}
MALFORMED_URL : This counter will track urls which not have proper url sytax or url encoding issue
HOSTS_AFFECTED_PER_HOST_OVERFLOW/URLS_SKIPPED_PER_HOST_OVERFLOW :
if (maxCount > 0) {int[] hostCount = hostCounts.get(hostordomain);
if (hostCount == null) {
hostCount = new int[]{1, 0};
hostCounts.put(hostordomain, hostCount);
}
// increment hostCount
hostCount[1]++;
// check if topN reached, select next segment if it is
while (segCounts[hostCount[0] - 1] >= limit
&& hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
}
// reached the limit of allowed URLs per host / domain
// see if we can put it in the next segment?
if (hostCount[1] > maxCount) {
if (hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 1;
} else {
if (hostCount[1] == (maxCount + 1)) {
context
.getCounter("Generator", "HOSTS_AFFECTED_PER_HOST_OVERFLOW")
.increment(1);
LOG.info(
"Host or domain {} has more than {} URLs for all {} segments. Additional URLs won't be included in the fetchlist.",
hostordomain, maxCount, maxNumSegments);
}
// skip this entry
context.getCounter("Generator", "URLS_SKIPPED_PER_HOST_OVERFLOW")
.increment(1);
continue;
}
}
entry.segnum = new IntWritable(hostCount[0]);
segCounts[hostCount[0] - 1]++;
} else {
entry.segnum = new IntWritable(currentsegmentnum);
segCounts[currentsegmentnum - 1]++;
}
As per above code hostCounts object is used to track <domain,[segmentNumber,urlCounts]>
hostCount[1] == ([maxCount][4] + 1) && hostCount[0] > maxNumSegments will be true only when we reached full threadhold per domain for all the segments and count will be tracted in HOSTS_AFFECTED_PER_HOST_OVERFLOW.
HOSTS_AFFECTED_PER_HOST_OVERFLOW
It basically tracks all the hosts/domains which missed to allocated by margin of 1 space in the final segment.
URLS_SKIPPED_PER_HOST_OVERFLOW is used to count all the domain/hosts which got no room to fill in all the segments.
And Other Counters like INTERVAL_REJECTED,SCORE_TOO_LOW,STATUS_REJECTED are all self-explanatory and for better clarity you can check Generator code.
I couldn't find any possibilities to create pipeline in hazelcast-jet-kafka that will limit the throughput to a specific number of elements per time unit, anybody could suggest me possible solutions? I know that alpaka (https://doc.akka.io/docs/alpakka-kafka/current/) has such functionality
You can define this function:
private <T, S extends GeneralStage<T>> FunctionEx<S, S> throttle(int itemsPerSecond) {
// context for the mapUsingService stage
class Service {
final int ratePerSecond;
final TreeMap<Long, Long> counts = new TreeMap<>();
public Service(int ratePerSecond) {
this.ratePerSecond = ratePerSecond;
}
}
// factory for the service
ServiceFactory<?, Service> serviceFactory = ServiceFactories
.nonSharedService(procCtx ->
// divide the count for the actual number of processors we have
new Service(Math.max(1, itemsPerSecond / procCtx.totalParallelism())))
// non-cooperative is needed because we sleep in the mapping function
.toNonCooperative();
return stage -> (S) stage
.mapUsingService(serviceFactory,
(ctx, item) -> {
// current time in 10ths of a second
long now = System.nanoTime() / 100_000_000;
// include this item in the counts
ctx.counts.merge(now, 1L, Long::sum);
// clear items emitted more than second ago
ctx.counts.headMap(now - 10, true).clear();
long countInLastSecond =
ctx.counts.values().stream().mapToLong(Long::longValue).sum();
// if we emitted too many items, sleep a while
if (countInLastSecond > ctx.ratePerSecond) {
Thread.sleep(
(countInLastSecond - ctx.ratePerSecond) * 1000/ctx.ratePerSecond);
}
// now we can pass the item on
return item;
}
);
}
Then use it to throttle in the pipeline:
Pipeline p = Pipeline.create();
p.readFrom(TestSources.items(IntStream.range(0, 2_000).boxed().toArray(Integer[]::new)))
.apply(throttle(100))
.writeTo(Sinks.noop());
The above job will take about 20 seconds to complete because it has 2000 items and the rate is limited to 100 items/s. The rate is evaluated over the last second, so if there are less than 100 items/s, items will be forwarded immediately. If there are 101 items during one millisecond, 100 will be forwarded immediately and the next after a sleep.
Also make sure that your source is distributed. The rate is divided by the number of processors in the cluster and if your source isn't distributed and some members don't see any data, your overall rate will be only a fraction of the desired rate.
I am offloading my search feature on a relational database to Azure Search. My Products tables contains columns like serialNumber, PartNumber etc.. (there can be multiple serialNumbers with the same partNumber).
I want to create a suggestor that can autocomplete partNumbers. But in my scenario I am getting a lot of duplicates in the suggestions because the partNumber match was found in multiple entries.
How can I solve this problem ?
The Suggest API suggests documents, not queries. If you repeat the partNumber information for each serialNumber in your index and then suggest based on partNumber, you will get a result for each matching document. You can see this more clearly by including the key field in the $select parameter. Azure Search will eliminate duplicates within the same document, but not across documents. You will have to do that on the client side, or build a secondary index of partNumbers just for suggestions.
See this forum thread for a more in-depth discussion.
Also, feel free to vote on this UserVoice item to help us prioritize improvements to Suggestions.
I'm facing this problem myself. My solution does not involve a new index (this will only get messy and cost us money).
My take on this is a while-loop adding 'UserIdentity' (in your case, 'partNumber') to a filter, and re-search until my take/top-limit is met or no more suggestions exists:
public async Task<List<MachineSuggestionDTO>> SuggestMachineUser(string searchText, int take, string[] searchFields)
{
var indexClientMachine = _searchServiceClient.Indexes.GetClient(INDEX_MACHINE);
var suggestions = new List<MachineSuggestionDTO>();
var sp = new SuggestParameters
{
UseFuzzyMatching = true,
Top = 100 // Get maximum result for a chance to reduce search calls.
};
// Add searchfields if set
if (searchFields != null && searchFields.Count() != 0)
{
sp.SearchFields = searchFields;
}
// Loop until you get the desired ammount of suggestions, or if under desired ammount, the maximum.
while (suggestions.Count < take)
{
if (!await DistinctSuggestMachineUser(searchText, take, searchFields, suggestions, indexClientMachine, sp))
{
// If no more suggestions is found, we break the while-loop
break;
}
}
// Since the list might me bigger then the take, we return a narrowed list
return suggestions.Take(take).ToList();
}
private async Task<bool> DistinctSuggestMachineUser(string searchText, int take, string[] searchFields, List<MachineSuggestionDTO> suggestions, ISearchIndexClient indexClientMachine, SuggestParameters sp)
{
var response = await indexClientMachine.Documents.SuggestAsync<MachineSearchDocument>(searchText, SUGGESTION_MACHINE, sp);
if(response.Results.Count > 0){
// Fix filter if search is triggered once more
if (!string.IsNullOrEmpty(sp.Filter))
{
sp.Filter += " and ";
}
foreach (var result in response.Results.DistinctBy(r => new { r.Document.UserIdentity, r.Document.UserName, r.Document.UserCode}).Take(take))
{
var d = result.Document;
suggestions.Add(new MachineSuggestionDTO { Id = d.UserIdentity, Namn = d.UserNamn, Hkod = d.UserHkod, Intnr = d.UserIntnr });
// Add found UserIdentity to filter
sp.Filter += $"UserIdentity ne '{d.UserIdentity}' and ";
}
// Remove end of filter if it is run once more
if (sp.Filter.EndsWith(" and "))
{
sp.Filter = sp.Filter.Substring(0, sp.Filter.LastIndexOf(" and ", StringComparison.Ordinal));
}
}
// Returns false if no more suggestions is found
return response.Results.Count > 0;
}
public async Task<List<string>> SuggestionsAsync(bool highlights, bool fuzzy, string term)
{
SuggestParameters sp = new SuggestParameters()
{
UseFuzzyMatching = fuzzy,
Top = 100
};
if (highlights)
{
sp.HighlightPreTag = "<em>";
sp.HighlightPostTag = "</em>";
}
var suggestResult = await searchConfig.IndexClient.Documents.SuggestAsync(term, "mysuggestion", sp);
// Convert the suggest query results to a list that can be displayed in the client.
return suggestResult.Results.Select(x => x.Text).Distinct().Take(10).ToList();
}
After getting top 100 and using distinct it works for me.
You can use the Autocomplete API for that where does the grouping by default. However, if you need more fields together with the result, like, the partNo plus description it doesn't support it. The partNo will be distinct though.
1 .Hi SO, I have a created a class for fetching user's tweets from twitter with the help of screen name. My problem is I'm getting rate limit exceeded very frequently.
2 .I had created table for screen name in which I'm saving all screen names and
3 .I had created another table to store user's tweets.
Below is my Code:
public List<TwitterProfileDetails> GetAllTweets(Func<SingleUserAuthorizer> AuthenticateCredentials,string screenname)
{
List<TwitterProfileDetails> lstofTweets = new List<TwitterProfileDetails>();
TwitterProfileDetails details = new TwitterProfileDetails();
var twitterCtx = new LinqToTwitter.TwitterContext(AuthenticateCredentials());
var helpResult =
(from help in twitterCtx.Help
where help.Type == HelpType.RateLimits &&
help.Resources == "search,users,socialgraph"
select help)
.SingleOrDefault();
foreach (var category in helpResult.RateLimits)
{
Console.WriteLine("\nCategory: {0}", category.Key);
foreach (var limit in category.Value)
{
Console.WriteLine(
"\n Resource: {0}\n Remaining: {1}\n Reset: {2}\n Limit: {3}",
limit.Resource, limit.Remaining, limit.Reset, limit.Limit);
}
}
var tweets = from t in twitterCtx.Status
where t.Type == StatusType.User && t.ScreenName == screename && t.Count == 15
select t;
if (tweets != null)
{
foreach (var tweetStatus in tweets)
{
if (tweetStatus != null)
{
lstofTweets.Add(new TwitterProfileDetails { Name = tweetStatus.User.Name, ProfileImagePath = tweetStatus.User.ProfileImageUrl, Tweets = tweetStatus.Text, UserID = tweetStatus.User.Identifier.UserID, PostedDate = Convert.ToDateTime(tweetStatus.CreatedAt),ScreenName=screename });
}
}
}
return lstofTweets;
}
I am using above method has below..
foreach (var screenObj in screenName)
{
var getTweets = api.GetAllTweets(api.AuthenticateCredentials, screenObj.UserName);
foreach (var obj in getTweets)
{
using (DBcontext = new DBContext())
{
tweets.Name = obj.Name;
tweets.ProfileImage = obj.ProfileImagePath;
tweets.PostedOn = obj.PostedDate;
tweets.Tweets = obj.Tweets;
tweets.CreatedOn = DateTime.Now;
tweets.ModifiedOn = DateTime.Now;
tweets.Status = EntityStatus.Active;
tweets.ScreenName = obj.ScreenName;
var exist = context.UserTweets.Any(user => user.Tweets.Equals(obj.Tweets));
if (!exist)
context.UserTweets.Add(tweets);
context.SaveChanges();
}
}
}
I see that you found the Help/RateLimits query. There are various approaches you can take. e.g. add a delay between queries, delay the next query if the limit has been exceeded, or catch the exception and delay until the next 15 minute window.
If you want to monitor interactively, you can watch the rate limit for each query. The TwitterContext instance you use for performing the query contains RateLimitXxx properties that populate after every query. You'll need to read those values after the query, which appears to be inside your GetAllTweets method. You have to expose those values to your loop somehow, via return object, out params, static field, or whatever logic you feel is necessary.
// the first time through, you have the whole rate limit for the 15 minute window
foreach (var screenObj in screenName)
{
var getTweets = api.GetAllTweets(api.AuthenticateCredentials, screenObj.UserName);
// your processing logic ...
// assuming you have the RateLimitXxx values in scope
if (rateLimitRemaining == 0)
Thread.Sleep(CalculateRemainingMilliseconds(RateLimitReset));
}
RateLimitRemaining is how many queries you can do in the current 15 minute window and RateLimitReset is the number of epoch seconds remaining until the rate limit resets (when you can start querying again).
It would be helpful to review the Twitter docs on Rate Limiting.
For reference, here are a couple other questions that might provide more ideas:
Twitter rate limiting
Get all followers using LINQ to Twitter
I'm really engaged with subsonic but I'm not sure how make it work with paging
I mean how can I get "the page" in a list or how is the best way to managing
the total table in my base, page by page
You'll see I tried three things:
m02colegio is an class generated from activerecord
IList<m02colegio> loscolegios;
loscolegios = m02colegio.GetPaged(0, 80).ToList();
----------- and:
SubSonic.Schema.PagedList<m02colegio> loscolegios;
loscolegios = m02colegio.GetPaged(0, 80);
----------- and:
var paged = m02colegio.GetPaged(0,80).All<m02colegio>(x=>x.m02ccolnom.Contains(" "));
// 'cause i dont know how to tell it to consider all records
loscolegios = m02colegio.All().ToList();
but after every try I don't get any exception and loscolegios always is NULL
I need to access the records in this manner
so, what is the best way?
how can I get the first page and then how advance among pages??
public ActionResult Index(int? page)
{
if (!validateInt(page.ToString()))
page = 0;
else
page = page - 1;
if (page < 0) page = 0;
const int pagesize = 9;
IQueryable<m02colegio> Mym02colegio = m02colegio.All().Where(x => x.category == "test").OrderBy(x => x.id);
ViewData["numpages"] = m02colegio.All().Where(x => x.category == "test").OrderBy(x => x.id).Count() / pagesize;
ViewData["curpage"] = page;
return View(new PagedList<material>(Mym02colegio, page ?? 0, pagesize));
}
that is in a MVC sense however it gives you the idea, Index accepts a null or a page number
you get all the records then return a pagelist of the records you got.
I'm not sure if this is a bug that's been fixed in the current github source or if it's by design but I've found that GetPaged only works with a 1 based index for the first argument. So if you do the following you should find it works as you'd expect:
IList<m02colegio> loscolegios = m02colegio.GetPaged(1, 80);