Efficiently validating large list of objects - c#-4.0

I have a function that is meant to remove items from a Collection if a certain field does not pass a validation check (either email or phone, but that's not important in this context). Problem is that a regular expression is relatively slow, and I have lists of 1 million+ items.
My function
public HashSet<ListItemModel> RemoveInvalid(HashSet<ListItemModel> listItems)
{
string pattern = (this.phoneOrEmail == "email")//phoneOrEmail is set via config file
?
//RFC 5322 compliant email regex. see http://www.regular-expressions.info/email.html
#"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
:
//north-american phone number regex. see http://stackoverflow.com/questions/12101125/regex-to-allow-only-digits-hypens-space-parentheses-and-should-end-with-a-dig
#"(?:\d{3}(?:\d{7}|\-\d{3}\-\d{4}))|(?:\(\d{3}\)(?:\-\d{3}\-)|(?: \d{3} )\d{4})";
Regex re = new Regex(pattern);
if (phoneOrEmail == "email")
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Email,0)));
}
else
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Tel, 0)));
}
}
This takes way too long to execute. Is there a faster way of returning a subset that contains only valid emails/phone numbers?
I need to come up with something that is lightning quick. My other operations usually take only a couple of seconds on 700k+ items, but this method is taking forever and I hate that. I will be experimenting with a series of LINQ .Contains(x,y,z) checks, but in the meantime, I'd like some input from people who are smarter than me.

Related

Transforming large array of objects to csv using json2csv

I need to transform a large array of JSON (that can have over 100k positions) into a CSV.
This array is created directly in the application, it's not the result of an uploaded file.
Looking at the documentation, I've thought on using parser but it says that:
For that reason is rarely a good reason to use it until your data is very small or your application doesn't do anything else.
Because the data is not small and my app will do other things than creating the csv, I don't think it'll be the best approach but I may be misunderstanding the documentation.
Is it possible to use the others options (async parser or transform) with an already created data (and not a stream of data)?
FYI: It's a nest application but I'm using this node.js lib.
Update: I've tryied to insert with an array with over 300k positions, and it went smoothly.
Why do you need any external modules?
Converting JSON into a javascript array of javascript objects is a piece of cake with the native JSON.parse() function.
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!"
And, then, converting a javascript array into a CSV is very straightforward.
The most obvious and absurd case is just mapping every element of the array into a string that is the JSON representation of the object element. You end up with a useless CSV with a single column containing every element of your original array. And then joining the resulting strings array into a single string, separated by newlines \n. It's good for nothing but, heck, it's a CSV!
let csvtxt = mythings.map(JSON.stringify).join("\n");
await fs.writeFile("mythings.csv",csvtxt,"utf8");
Now, you can feel that you are almost there. Replace the useless mapping function into your own
let csvtxt = mythings.map(mapElementToColumns).join("\n");
and choose a good mapping between the fields of the objects of your array, and the columns of your csv.
function mapElementToColumns(element) {
return `${JSON.stringify(element.id)},${JSON.stringify(element.name)},${JSON.stringify(element.value)}`;
}
or, in a more thorough way
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
that you may invoke in your map
mythings.map(mapElementToColumns(["id","name","element"])).join("\n");
Finally, you might decide to use an automated for "all fields in all objects" approach; which requires that all the objects in the original array maintain a similar fields schema.
You extract all the fields of the first object of the array, and use them as the header row of the csv and as the template for extracting the rest of the elements.
let fieldnames = Object.keys(mythings[0]);
and then use this field names array as parameter of your map function
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
and, also, prepending them as the CSV header
csvtxt.unshift(fieldnames.join(','))
Putting all the pieces together...
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!";
let fieldnames = Object.keys(mythings[0]);
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
csvtxt.unshift(fieldnames.join(','));
await fs.writeFile("mythings.csv",csvtxt,"utf8");
And that's it. Pretty neat, uh?

Search in new Sitecore ContentSearch using whole words

I am using new Sitecore search, and the issue I ran into is having results come for words that have nothing to do with my search term.
For example, searching for "lies" will find "applies". And this is not what I am looking for.
This is an example of search I am doing (simplified). It is a direct LINQ check for "Contains" on the "Content" property of the SearchResultItem, and most likely not what I supposed to do. It is just happen to be that samples I find online are practically doing so.
Example of my code (simplified). In here I break down the search sentence to separate keywords. By the way, I am looking for a way to show full sentence match first.
using (var context = ContentSearchManager.GetIndex("sitecore_web_index").CreateSearchContext())
{
var results = context.GetQueryable<SearchResultItem>()
.Filter(i => i.Path.StartsWith(Home.Paths.FullPath))
.Filter(GetTermPredicate(Term));
// use results here
}
protected Expression<Func<SearchResultItem, bool>> GetTermPredicate(string term)
{
var predicate = PredicateBuilder.True<SearchResultItem>();
foreach (var tempTerm in term.Split(' '))
{
predicate = predicate.And(p => p.Content.Contains(tempTerm));
}
return predicate;
}
Thank you in advance!!
All right. I got help from Sitecore Support.
In my version of Sitecore I can use the following to acheive search for a whole word instead of partial:
instead of:
predicate = predicate.And(p => p.Content.Contains(tempTerm));
use
predicate = predicate.And(p => p.Content.Equals(tempTerm));
Issue solved.
Replace Filter in your code by Where, it should be fine,
here is an example :
var currentIndex = ContentSearchManager.GetIndex("sitecore_web_index");
using (var context = currentIndex.CreateSearchContext())
{
var predicate = PredicateBuilder.True();
foreach (var currentWord in term.Split(‘ ‘))
{
predicate = predicate.Or(x => x.Content.Contains(currentWord ));
}
var results = context.GetQueryable().Where(predicate).GetResults();
}
As Ahmed notes, you should use Where instead of Filter, since Filter has no effect on search rank. The classic use case for filters is to apply a facet chosen by the user without distorting the ordering of results, as would happen if you used a Where clause.
Filtering is similar to using Where to restrict the result list. Both methods will affect the result in the same
result list, but when you use a Filter the scoring/ranking of the search hits is not affected by the filters.
Developer's Guide to Item Buckets and Search
There's a good dicussion of when to use Filter on the Sitecore 7 team blog:
Making Good Part 4: Filters, Paging and I'm feeling lucky.
If you only want to search for whole words, you could prefix and postfix the searchterm with a space. Allthough this doesn't catch all situations.

How to maintain counters with LinqToObjects?

I have the following c# code:
private XElement BuildXmlBlob(string id, Part part, out int counter)
{
// return some unique xml particular to the parameters passed
// remember to increment the counter also before returning.
}
Which is called by:
var counter = 0;
result.AddRange(from rec in listOfRecordings
from par in rec.Parts
let id = GetId("mods", rec.CKey + par.UniqueId)
select BuildXmlBlob(id, par, counter));
Above code samples are symbolic of what I am trying to achieve.
According to the Eric Lippert, the out keyword and linq does not mix. OK fair enough but can someone help me refactor the above so it does work? A colleague at work mentioned accumulator and aggregate functions but I am novice to Linq and my google searches were bearing any real fruit so I thought I would ask here :).
To Clarify:
I am counting the number of parts I might have which could be any number of them each time the code is called. So every time the BuildXmlBlob() method is called, the resulting xml produced will have a unique element in there denoting the 'partNumber'.
So if the counter is currently on 7, that means we are processing 7th part so far!! That means XML returned from BuildXmlBlob() will have the counter value embedded in there somewhere. That's why I need it somehow to be passed and incremented every time the BuildXmlBlob() is called per run through.
If you want to keep this purely in LINQ and you need to maintain a running count for use within your queries, the cleanest way to do so would be to make use of the Select() overloads that includes the index in the query to get the current index.
In this case, it would be cleaner to do a query which collects the inputs first, then use the overload to do the projection.
var inputs =
from recording in listOfRecordings
from part in recording.Parts
select new
{
Id = GetId("mods", recording.CKey + part.UniqueId),
Part = part,
};
result.AddRange(inputs.Select((x, i) => BuildXmlBlob(x.Id, x.Part, i)));
Then you wouldn't need to use the out/ref parameter.
XElement BuildXmlBlob(string id, Part part, int counter)
{
// implementation
}
Below is what I managed to figure out on my own:.
result.AddRange(listOfRecordings.SelectMany(rec => rec.Parts, (rec, par) => new {rec, par})
.Select(#t => new
{
#t,
Id = GetStructMapItemId("mods", #t.rec.CKey + #t.par.UniqueId)
})
.Select((#t, i) => BuildPartsDmdSec(#t.Id, #t.#t.par, i)));
I used resharper to convert it into a method chain which constructed the basics for what I needed and then i simply tacked on the select statement right at the end.

Configuring Solr for Suggestive/Predictive Auto Complete Search

We are working on integrating Solr 3.6 to an eCommerce site. We have indexed data & search is performing really good.
We have some difficulties figuring how to use Predictive Search / Auto Complete Search Suggestion. Also interested to learn the best practices for implementing this feature.
Our goal is to offer predictive search similar to http://www.amazon.com/, but don't know how to implement it with Solr. More specifically I want to understand how to build those terms from Solr, or is it managed by something else external to solr? How the dictionary should be built for offering these kind of suggestions? Moreover, for some field, search should offer to search in category. Try typing "xper" into Amazon search box, and you will note that apart from xperia, xperia s, xperia p, it also list xperia s in Cell phones & accessories, which is a category.
Using a custom dictionary this would be difficult to manage. Or may be we don't know how to do it correctly. Looking to you to guide us on how best utilize solr to achieve this kind of suggestive search.
I would suggest you a couple of blogpost:
This one which shows you a really nice complete solution which works well but requires some additional work to be made, and uses a specific lucene index (solr core) for that specific purpose
I used the Highlight approach because the facet.prefix one is too heavy for big index, and the other ones had few or unclear documentation (i'm a stupid programmer)
So let's suppose the user has just typed "aaa bbb ccc"
Our autocomplete function (java/javascript) will call solr using the following params
q="aaa bbb"~100 ...base query, all the typed words except the last
fq=ccc* ...suggest word filter using last typed word
hl=true
hl.q=ccc* ...highlight word will be the one to suggest
fl=NONE ...return empty docs in result tag
hl.pre=### ...escape chars to locate highlight word in the response
hl.post=### ...see above
you can also control the number of suggestion with 'rows' and 'hl.fragsize' parameters
the highlight words in each document will be the right candidates for the suggestion with "aaa bbb" string
more suggestion words are the ones before/after the highlight words and, of course, you can implement more filters to extract valid words, avoid duplicates, limit suggestions
if interested i can send you some examples...
EDITED: Some further details about the approach
The portion of example i give supposes the 'autocomplete' mechanism given by jquery: we invoke a jsp (or a servlet) inside a web application passing as request param 'q' the words just typed by user.
This is the code of the jsp
ByteArrayInputStream is=null; // Used to manage Solr response
try{
StringBuffer queryUrl=new StringBuffer('putHereTheUrlOfSolrServer');
queryUrl.append("/select?wt=xml");
String typedWords=request.getParameter("q");
String base="";
if(typedWords.indexOf(" ")<=0) {
// No space typed by user: the 'easy case'
queryUrl.append("&q=text:");
queryUrl.append(URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
} else {
// Space chars present
// we split the search in base phrase and last typed word
base=typedWords.substring(0,typedWords.lastIndexOf(" "));
queryUrl.append("&q=text:");
if(base.indexOf(" ")>0)
queryUrl.append("\""+URLEncoder.encode(base, "UTF-8")+"\"~1000");
else
queryUrl.append(URLEncoder.encode(base, "UTF-8"));
typedWords=typedWords.substring(typedWords.lastIndexOf(" ")+1);
queryUrl.append("&fq=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
queryUrl.append("&hl.q=text:"+URLEncoder.encode(typedWords+"*", "UTF-8"));
}
// The additional parameters to control the solr response
queryUrl.append("&rows="+suggestPageSize); // Number of results returned, a parameter to control the number of suggestions
queryUrl.append("&fl=A_FIELD_NAME_THAT_DOES_NOT_EXIST"); // Interested only in highlights section, Solr return a 'light' answer
queryUrl.append("&start=0"); // Use only first page of results
queryUrl.append("&hl=true"); // Enable highlights feature
queryUrl.append("&hl.simple.pre=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.simple.post=***"); // Use *** as 'highlight border'
queryUrl.append("&hl.fragsize="+suggestFragSize); // Another parameter to control the number of suggestions
queryUrl.append("&hl.fl=content,title"); // Look for result only in some fields
queryUrl.append("&facet=false"); // Disable facets
/* Omitted section: use a new URL(queryUrl.toString()) to get the solr response inside a byte array */
is=new ByteArrayInputStream(solrResponseByteArray);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//response/lst[#name=\"highlighting\"]/lst/arr[#name=\"content\"]/str");
NodeList valueList = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
Vector<String> suggestions=new Vector<String>();
for (int j = 0; j < valueList.getLength(); ++j) {
Element value = (Element) valueList.item(j);
String[] result=value.getTextContent().split("\\*\\*\\*");
for(int k=0;k<result.length;k++){
String suggestedWord=result[k].toLowerCase();
if((k%2)!=0){
//Highlighted words management
if(suggestedWord.length()>=suggestedWord.length() && !suggestions.contains(suggestedWord))
suggestions.add(suggestedWord);
}else{
/* Words before/after highlighted words
we can put these words inside another vector
and use them if not enough suggestions */
}
}
}
/* Finally we build a Json Answer to be managed by our jquery function */
out.print(request.getParameter("json.wrf")+"({ \"suggestions\" : [");
boolean firstSugg=true;
for(String suggestionW:suggestions) {
out.print((firstSugg?" ":" ,"));
out.print("{ \"suggest\" : \"");
if(base.length()>0) {
out.print(base);
out.print(" ");
}
out.print(suggestionW+"\" }");
firstSugg=false;
}
out.print(" ]})");
}catch (Exception x) {
System.err.println("Exception during main process: " + x);
x.printStackTrace();
}finally{
//Gracefully close streams//
try{is.close();}catch(Exception x){;}
}
Hope to be helpfull,
Nik
This might help you out.I am trying to do the same.
http://solr.pl/en/2010/10/18/solr-and-autocomplete-part-1/

Pattern Matching for URL classification

As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.
Eg : If the url is : http://cnnworld/sports/abcd, we would classify it under the category "sports"
To accomplish this, we have a database with mappings of the format : Keyword -> Category
Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.
But this approach has a few problems, the main ones being :
(i) Our database is very big and such repeated querying runs extremely slowly
(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.
I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.
A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url.
I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)
In our commercial classifier we have a database of 4m keywords :) and we also search the body of the HTML, there are number of ways to solve this:
Use Aho-Corasick, we have used a modified algorithm specially to work with web content, for example treat: tab, space, \r, \n as space, as only one, so two spaces would be considered as one space, and also ignore lower/upper case.
Another option is to put all your keywords inside a tree (std::map for example) so the search becomes very fast, the downside is that this takes memory, and a lot, but if it's on a server, you wouldn't feel it.
I think your suggestion of breaking apart the URL to find useful bits and then querying for just those items sounds like a decent way to go.
I tossed together some Java that might help illustrate code-wise what I think this would entail. The most valuable portions are probably the regexes, but I hope the general algorithm of it helps some as well:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;
public class CategoryParser
{
/** The db field that keywords should be checked against */
private static final String DB_KEYWORD_FIELD_NAME = "keyword";
/** The db field that categories should be pulled from */
private static final String DB_CATEGORY_FIELD_NAME = "category";
/** The name of the table to query */
private static final String DB_TABLE_NAME = "KeywordCategoryMap";
/**
* This method takes a URL and from that text alone determines what categories that URL belongs in.
* #param url - String URL to categorize
* #return categories - A List<String&rt; of categories the URL seemingly belongs in
*/
public static List<String> getCategoriesFromUrl(String url) {
// Clean the URL to remove useless bits and encoding artifacts
String normalizedUrl = normalizeURL(url);
// Break the url apart and get the good stuff
String[] keywords = tokenizeURL(normalizedUrl);
// Construct the query we can query the database with
String query = constructKeywordCategoryQuery(keywords);
System.out.println("Generated Query: " + query);
// At this point, you'd need to fire this query off to your database,
// and the results you'd get back should each be a valid category
// for your URL. This code is not provided because it's very implementation specific,
// and you already know how to deal with databases.
// Returning null to make this compile, even though you'd obviously want to return the
// actual List of Strings
return null;
}
/**
* Removes the protocol, if it exists, from the front and
* removes any random encoding characters
* Extend this to do other url cleaning/pre-processing
* #param url - The String URL to normalize
* #return normalizedUrl - The String URL that has no junk or surprises
*/
private static String normalizeURL(String url)
{
// Decode URL to remove any %20 type stuff
String normalizedUrl = url;
try {
// I've used a URLDecoder that's part of Java here,
// but this functionality exists in most modern languages
// and is universally called url decoding
normalizedUrl = URLDecoder.decode(url, "UTF-8");
}
catch(UnsupportedEncodingException uee)
{
System.err.println("Unable to Decode URL. Decoding skipped.");
uee.printStackTrace();
}
// Remove the protocol, http:// ftp:// or similar from the front
if (normalizedUrl.contains("://"))
{
normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
}
// Room here to do more pre-processing
return normalizedUrl;
}
/**
* Takes apart the url into the pieces that make at least some sense
* This doesn't guarantee that each token is a potentially valid keyword, however
* because that would require actually iterating over them again, which might be
* seen as a waste.
* #param url - Url to be tokenized
* #return tokens - A String array of all the tokens
*/
private static String[] tokenizeURL(String url)
{
// I assume that we're going to use the whole URL to find tokens in
// If you want to just look in the GET parameters, or you want to ignore the domain
// or you want to use the domain as a token itself, that would have to be
// processed above the next line, and only the remaining parts split
String[] tokens = url.split("\\b|_");
// One could alternatively use a more complex regex to remove more invalid matches
// but this is subject to your (?:in)?ability to actually write the regex you want
// These next two get rid of tokens that are too short, also.
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 character long
//String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 or 2 characters long
//String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");
return tokens;
}
private static String constructKeywordCategoryQuery(String[] keywords)
{
// This will hold our WHERE body, keyword OR keyword2 OR keyword3
StringBuilder whereItems = new StringBuilder();
// Potential query, if we find anything valid
String query = null;
// Iterate over every found token
for (String keyword : keywords)
{
// Reject invalid keywords
if (isKeywordValid(keyword))
{
// If we need an OR
if (whereItems.length() > 0)
{
whereItems.append(" OR ");
}
// Simply append this item to the query
// Yields something like "keyword='thisKeyword'"
whereItems.append(DB_KEYWORD_FIELD_NAME);
whereItems.append("='");
whereItems.append(keyword);
whereItems.append("'");
}
}
// If a valid keyword actually made it into the query
if (whereItems.length() > 0)
{
query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
+ " WHERE " + whereItems.toString() + ";";
}
return query;
}
private static boolean isKeywordValid(String keyword)
{
// Keywords better be at least 2 characters long
return keyword.length() > 1
// And they better be only composed of letters and numbers
&& keyword.matches("\\w+")
// And they better not be *just* numbers
// && !keyword.matches("\\d+") // If you want this
;
}
// How this would be used
public static void main(String[] args)
{
List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
}
}
The Generated Query for the SO link would look like:
SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'
Plenty of room for optimization, but I imagine it to be much faster than checking the string for every possible keyword.
Aho-corasick algorithm is best for searching intermediate string with one traversal. You can form a tree (aho-corasick tree) of your keyword. At the last node contains a number mapped with a particular keyword.
Now, You just need to traverse the URL string on the tree. When you got some number (work as flag in our scenario), it means that we got some mapped category. Go on with that number on hash map and find respective category for further use.
I think this will help you.
Go to this link: good animation of aho-corasick by ivan
If you have (many) fewer categories than keywords, you could create a regex for each category, where it would match any of the keywords for that category. Then you'd run your URL against each category's regex. This would also address the issue of matching multiple categories.

Resources