Fatest string comparison using eloquent - excel

I'm importing data using Maatwebsite from an excel file and before creating a new row in my model I check if the register already exists to avoid duplicates. But this takes too long.
In my ProductImport.php:
public function model(array $row)
{
$exists = Product::
where('product_description', $row["product_description"])
->where('product_code', $row["product_code"])
->first();
if($exists ){
return null;
}
++$this->rows;
// Autoincrement id
return new Product([
"product_description" => $row["art_descripcion"],
"product_code" => $row["cui"],
"id_user" => $this->id_user,
...
]);
}
public function chunkSize(): int
{
return 1000;
}
As you see, I'm also using chunkSize, because there are 5000 rows per excel.
The problem:
The size of the product_description varies between 800 to 900 characters (varchar[1000]) and it makes the query (where()) very slow per iteration within the foreach.
Is there a better way to handle this? Maybe using updateOrCreate instead of searching first and then creating? Because I think it will be the same approach.
So the main problem is how do I compare those 800 - 900 size string quicker? Because this search is taking a lot of time to execute:
$exists = Product::
where('product_description', $row["product_description"])
->where('product_code', $row["product_code"])
->first();

Related

Distinct values in Azure Search Suggestions?

I am offloading my search feature on a relational database to Azure Search. My Products tables contains columns like serialNumber, PartNumber etc.. (there can be multiple serialNumbers with the same partNumber).
I want to create a suggestor that can autocomplete partNumbers. But in my scenario I am getting a lot of duplicates in the suggestions because the partNumber match was found in multiple entries.
How can I solve this problem ?
The Suggest API suggests documents, not queries. If you repeat the partNumber information for each serialNumber in your index and then suggest based on partNumber, you will get a result for each matching document. You can see this more clearly by including the key field in the $select parameter. Azure Search will eliminate duplicates within the same document, but not across documents. You will have to do that on the client side, or build a secondary index of partNumbers just for suggestions.
See this forum thread for a more in-depth discussion.
Also, feel free to vote on this UserVoice item to help us prioritize improvements to Suggestions.
I'm facing this problem myself. My solution does not involve a new index (this will only get messy and cost us money).
My take on this is a while-loop adding 'UserIdentity' (in your case, 'partNumber') to a filter, and re-search until my take/top-limit is met or no more suggestions exists:
public async Task<List<MachineSuggestionDTO>> SuggestMachineUser(string searchText, int take, string[] searchFields)
{
var indexClientMachine = _searchServiceClient.Indexes.GetClient(INDEX_MACHINE);
var suggestions = new List<MachineSuggestionDTO>();
var sp = new SuggestParameters
{
UseFuzzyMatching = true,
Top = 100 // Get maximum result for a chance to reduce search calls.
};
// Add searchfields if set
if (searchFields != null && searchFields.Count() != 0)
{
sp.SearchFields = searchFields;
}
// Loop until you get the desired ammount of suggestions, or if under desired ammount, the maximum.
while (suggestions.Count < take)
{
if (!await DistinctSuggestMachineUser(searchText, take, searchFields, suggestions, indexClientMachine, sp))
{
// If no more suggestions is found, we break the while-loop
break;
}
}
// Since the list might me bigger then the take, we return a narrowed list
return suggestions.Take(take).ToList();
}
private async Task<bool> DistinctSuggestMachineUser(string searchText, int take, string[] searchFields, List<MachineSuggestionDTO> suggestions, ISearchIndexClient indexClientMachine, SuggestParameters sp)
{
var response = await indexClientMachine.Documents.SuggestAsync<MachineSearchDocument>(searchText, SUGGESTION_MACHINE, sp);
if(response.Results.Count > 0){
// Fix filter if search is triggered once more
if (!string.IsNullOrEmpty(sp.Filter))
{
sp.Filter += " and ";
}
foreach (var result in response.Results.DistinctBy(r => new { r.Document.UserIdentity, r.Document.UserName, r.Document.UserCode}).Take(take))
{
var d = result.Document;
suggestions.Add(new MachineSuggestionDTO { Id = d.UserIdentity, Namn = d.UserNamn, Hkod = d.UserHkod, Intnr = d.UserIntnr });
// Add found UserIdentity to filter
sp.Filter += $"UserIdentity ne '{d.UserIdentity}' and ";
}
// Remove end of filter if it is run once more
if (sp.Filter.EndsWith(" and "))
{
sp.Filter = sp.Filter.Substring(0, sp.Filter.LastIndexOf(" and ", StringComparison.Ordinal));
}
}
// Returns false if no more suggestions is found
return response.Results.Count > 0;
}
public async Task<List<string>> SuggestionsAsync(bool highlights, bool fuzzy, string term)
{
SuggestParameters sp = new SuggestParameters()
{
UseFuzzyMatching = fuzzy,
Top = 100
};
if (highlights)
{
sp.HighlightPreTag = "<em>";
sp.HighlightPostTag = "</em>";
}
var suggestResult = await searchConfig.IndexClient.Documents.SuggestAsync(term, "mysuggestion", sp);
// Convert the suggest query results to a list that can be displayed in the client.
return suggestResult.Results.Select(x => x.Text).Distinct().Take(10).ToList();
}
After getting top 100 and using distinct it works for me.
You can use the Autocomplete API for that where does the grouping by default. However, if you need more fields together with the result, like, the partNo plus description it doesn't support it. The partNo will be distinct though.

How i can get latest record by using FirstOrDefault() method

Suppose i have 2 records in data base
1) 2007-12-10 10:35:31.000
2) 2008-12-10 10:35:31.000
FirstOrDefault() method will give me the first record match in sequence like 2007-12-10 10:35:31.000 but i need the latest one which is 2008-12-10 10:35:31.000
if ((from value in _names where value != null select value.ExpiryDate < now).Any())
{
return _names.FirstOrDefault();
}
You can use:
return _names.LastOrDefault();
However, your if just sends another unnecessary query (and it is a wrong query too). If you don't have any record, LastOrDefault and FirstOrDefault will return null. You can use something like this to improve the code:
var name = _names.LastOrDefault();
if(name != null)
{
return name;
}
// other code here
If you really want to use FirstOrDefault, you should order descending, like:
var name = _names.Where(n => n.ExpiryDate < now).OrderByDescending(n => n.ExpiryDate).FirstOrDefault();

CRM 2011 server side paging / data parallelism using LINQ provider

Given that the CRM 2011 linq provider performs paging automatically behind the scenes.
Is there a way to set an upper limit on the number of records fetched when a linq
query is executed (similar to setting a PagingInfo.Count on a QueryExpression for paging)
I have a scenario where I need approx 20K+ records to be pulled for an update(no I cannot and do not need to filter down the record set further). Ideally I'd prefer to use the Skip & Take operators but since Count is not supported how would you know how many records to skip and when
to stop fetching more records.
Ideally I'd like to use TPL and processes batches of say 3K or 5K records in parallel so that I can get more throughput and don't have to block. The OrganizationserviceContext is not thread safe from what I know. Are there any good examples that illustrate how to partition the dataset in this case say using Parallel.For or Parallel.ForEach.
How would you partition and would you need to use a different context object for each parition?
Thanks.
UPDATE:
Here is what I came up with:
The idea is to get the total count of records to process and use PLINQ to farm out the processing of each subset of data across tasks using a new OrganizationServiceContext object per task.
static void Main(string[] args)
{
int pagesize = 2000;
// use FetchXML aggregate functions to get total count
// Reference: http://msdn.microsoft.com/en-us/library/gg309565.aspx
int totalcount = GetTotalCount();
int totalPages = (int)Math.Ceiling((double)totalcount / (double)pagesize);
try
{
Parallel.For(0, totalPages, () => new MyOrgserviceContext(),
(pageIndex, state, ctx) =>
{
var items = ctx.myEntitySet.Skip((pageIndex - 1) * pagesize).Take(pagesize);
var itemsArray = items.ToArray();
Console.WriteLine("Page:{0} - Fetched:{1}", pageIndex, itemsArray.Length);
return ctx;
},
ctx => ctx.Dispose()
);
}
catch (AggregateException ex)
{
//handle as needed
}
}
So the way I would do this would be to keep querying the records using skip and take until I run out of records.
Check out my example below, it uses int's for simplicity, but the approach should still apply to Linq-to-Crm.
So just keep performing your query, skipping previous records, taking the ones you want for that page, then counting at the end to see if you recieved a full page - if you didnt then you have run out of records.
Code
List<int> ints = new List<int>()
{
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
};
int pageNumber = 0;
int recordsPerPage = 3;
while(true)
{
IEnumerable<int> page = ints.Where(i => i < 11).Skip(recordsPerPage * pageNumber).Take(recordsPerPage);
foreach(int i in page)
{
Console.WriteLine(i);
}
Console.WriteLine("end of page");
pageNumber++;
if (page.Count() < recordsPerPage)
{
break;
}
}
Output:
1
2
3
end of page
4
5
6
end of page
7
8
9
end of page
10
end of page

multi insert in kohana orm3

In my application i have a loop that executes about 1000 times, inside it i'm creating object and saving it. This is the part of application where i populate my database with data. In common this looks like this:
foreach(...){
...
try{
$object = new Model_Whatever;
$object->whatever=$whatever;
$object->save();}
catch(Exception $e){
...}
}
}
This produces 1000 of INSERT queries. Is it possible to, in some way, made kohana produce multi inserts. Split this into 10 inserts with 100 data sets in each. Is it possible and if yes that what is the way doing so?
Whilst the Kohana ORM doesn't support multi inserts, you can still use the query builder as follows:
$query = DB::insert('tablename', array('column1', 'column2','column3'));
foreach ($data as $d) {
$query->values($d);
}
try {
$result = $query->execute();
} catch ( Database_Exception $e ) {
echo $e->getMessage();
}
you'll still need to split the data up so the above doesn't try to execute a query with 1000 inserts.
$data assumes an array of arrays with the values corresponding to the order of the columns
thanks Isaiah in #kohana
php work very slow when insert multi array very big (so that method ::values have array_merge) so more fast:
class Database_Query_Builder_Bath_Insert
extends Database_Query_Builder_Insert{
public static function doExecute($table, $data) {
$insertQuery = DB::insert($table, array_keys(current($data)));
$insertQuery->_values = $data;
$insertQuery->execute();
}
}
call_user_func_array([$query, 'values'], $data);

Select top/latest 10 in couchdb?

How would I execute a query equivalent to "select top 10" in couch db?
For example I have a "schema" like so:
title body modified
and I want to select the last 10 modified documents.
As an added bonus if anyone can come up with a way to do the same only per category. So for:
title category body modified
return a list of latest 10 documents in each category.
I am just wondering if such a query is possible in couchdb.
To get the first 10 documents from your db you can use the limit query option.
E.g. calling
http://localhost:5984/yourdb/_design/design_doc/_view/view_name?limit=10
You get the first 10 documents.
View rows are sorted by the key; adding descending=true in the querystring will reverse their order. You can also emit only the documents you are interested using again the querystring to select the keys you are interested.
So in your view you write your map function like:
function(doc) {
emit([doc.category, doc.modified], doc);
}
And you query it like this:
http://localhost:5984/yourdb/_design/design_doc/_view/view_name?startkey=["youcategory"]&endkey=["youcategory", date_in_the_future]&limit=10&descending=true
here is what you need to do.
Map function
function(doc)
{
if (doc.category)
{
emit(['category', doc.category], doc.modified);
}
}
then you need a list function that groups them, you might be temped to abuse a reduce and do this, but it will probably throw errors because of not reducing fast enough with large sets of data.
function(head, req)
{
% this sort function assumes that modifed is a number
% and it sorts in descending order
function sortCategory(a,b) { b.value - a.value; }
var categories = {};
var category;
var id;
var row;
while (row = getRow())
{
if (!categories[row.key[0]])
{
categories[row.key[0]] = [];
}
categories[row.key[0]].push(row);
}
for (var cat in categories)
{
categories[cat].sort(sortCategory);
categories[cat] = categories[cat].slice(0,10);
}
send(toJSON(categories));
}
you can get all categories top 10 now with
http://localhost:5984/database/_design/doc/_list/top_ten/by_categories
and get the docs with
http://localhost:5984/database/_design/doc/_list/top_ten/by_categories?include_docs=true
now you can query this with a multiple range POST and limit which categories
curl -X POST http://localhost:5984/database/_design/doc/_list/top_ten/by_categories -d '{"keys":[["category1"],["category2",["category3"]]}'
you could also not hard code the 10 and pass the number in through the req variable.
Here is some more View/List trickery.
slight correction. it was not sorting untill I added the "return" keyword in your sortCategory function. It should be like this:
function sortCategory(a,b) { return b.value - a.value; }

Resources