Isolate lines that dont exist in txt file - text

I have two text files that have camera models, however not all models on one text file are present in the other, so, I want to find the missing models. One issue tho, some models have extra strings in their name e.g., :
NIKON D610
D610
CANON POWERSHOT A1200
POWERSHOT A1200
"Nikon" and "Canon" is non-existent in one file.
~~ I'm scratching my head since 2 days.

At first there are some assumtions required for this answer:
Two strings describing the same model differ in a way that they either do or do not contain a manufacturer string.
It is feasibl to make a list of all possible manufacturer strins.
If these two assumptions are satisfied, one can ingore every string that is part of the manufacturer sting list while comparing two model stings. This way only the rest of the model string is evaluated.
Here is an example in C#. The local strings aClean and bClean are used to not mess up the original strings.
List<string> manufacturers // List of all possible manufacturer stings
List<string> modelsA // List of all models strings form file A
List<string> modelsB // List of all models strings form file B
foreach (string a in modelsA)
{
// Remove manufacturer name and spaces
string aClean = RemoveManufacturer(a).Replace(" ", "");
foreach (string b in modelsB)
{
// Remove manufacturer name and spaces
string bClean = RemoveManufacturer(b).Replace(" ", "");
// Now compare and process the strings.
// Store original strings a or b if required
...
}
}
string RemoveManufacturer(string model)
{
foreach (string manufacturer in manufacturers)
{
// remove manufacturer from model if possible
model.Replace(manufacturer, "");
}
return model;
}
This code is far from optimized. But it seems that your use case is not exactly performance sensitive anyways.

Related

A csv file has two column CATEGORY and MILES, I have to find % of Miles under Business Category and % of Miles under Personal Category, in python

CATEGORY MILES
Business 5.1
Business 4.6
Business 3.9
Personal 8.5
Business 3.7
Personal 6.2
Personal 11
This is an excerpt from the excel sheet
So you have a text file in CSV format, and you need to read it, convert it into combinations of [Category, Miles], and you want for every Category "% of miles", whatever that may be.
Category Miles
X 1
X 4
X 2
Y 3
I think that you want: "Category X has 70% of the miles, Category Y also has 30% of the miles".
To solve this, it is best to cut your problem into smaller pieces.
Given a string fileName, read the file as text
Given a string in CSV format, convert it into a sequence of class BusinessMiles
Given a sequence of BusinessMiles, convert it to "% of miles", according to the definition above.
Cutting your problem into smaller pieces has several advantages:
The function of each piece will be easier to understand
Easier to unit test.
Easier to change, for instance if you don't read from a CSV file, but from a database, or if you don't read from a CSV string, but from an XML or JSON file.
Most important: you will be able to reuse the pieces for other tasks, like: "How many of my rows are about Business?"
The reusability is demonstrated most clearly, because several of these pieces already exist and can be used freely: reading the file and converting the file to CSV.
For this, consider to use Nuget Package CSV helper. Easy to use, versatile, and thus one of the most used CSV packages.
So let's assume you have procedures to read the CSV file and to convert it to a sequence of BusinessMiles
enum Category
{
Business,
Personal,
}
class BusinessMile // TODO: invent proper name
{
public Category Category {get; set;}
public Decimal Miles {get; set;}
}
By using an enum you can be certain that after reading the CSV there won't be any incorrect Categories. It will be easy to add new Categories for a future version. If you don't know at compile time which Categories are allowed, consider to use a string for it. This has the danger that someone might have a typing error, which leads to a complete new Category, without anyone noticing "Personnel" instead of "Personal"
IEnumerable<BusinessMile> ReadBusinessMiles(string csvText)
{
// use CSVHelper to convert the csvText to the sequence
}
IEnumerable<BusinessMile> ReadBusineMilesFile(string fileName)
{
// either use CSVHelper, or read the file and call the other method
}
After this, your problem will be easy:
string fileName = ...
IEnumerable<BusinessMile> businessMiles = ReadBusinesMilesFile(fileName);
Make groups of BusinessMiles that have the same Category:
var categoryGroups = businessMiles.GroupBy(
businessMile => businessMile.Category,
// parameter resultSelector: for every Category, and all BusinessMiles
// that have this Category to make one new
(category, businessMilesInThisCategory) => new
{
Category = category,
TotalMiles = businesMilesInThisCategory
.Select(businessMile => businessMile.Miles)
.Sum(),
});
So now you've got:
Category TotalMiles
X 7
Y 3
If you really want to have percentages, you need to get the total of all Miles of all Categories (=6), and divide TotalMiles by this total
var totalMiles = categorieGroups.Select(group => group.TotalMiles).Sum();
var result = categoryGroups.Select(group => new
{
Category = group.Category,
TotalMilesPercentage = 100.0M * group.TotalMiles / totalMiles,
})
In my definition of BusinessMiles, the Miles are a decimal. Take care to convert it if your Miles are integers.

Search text for matching large number of strings

I have a use case where i have to check if a text which i receive as input contains any of 3 million strings i have.
I tried regex matching but once the list of strings crossed 50k the performance is very bad
i am doing this for each word in the search list
inText = java.util.regex.Pattern.compile("\\b" + findStr + "\\b",
java.util.regex.Pattern.CASE_INSENSITIVE).matcher(intext).replaceAll(repl);
I understand we can use search indexes like lucene, but i feel those primarily work to search a particular text from predefined text, but my use case is the opposite, i need to send a large text and check if any of the pre defined strings are there in the text
I think, you could take at it the other way around. Your predefined strings are documents stored in the inverted index, and your incoming text is a query, that you will test against your documents. Since predefined strings will not change much it will be very performant.
I prepared some Elasticsearch code, that will do the trick.
public void add(String string, String id) {
IndexRequest indexRequest = new IndexRequest(INDEX, TYPE, id);
indexRequest.source(string);
index(INDEX, TYPE, id, string);
}
#Test
public void scoring() throws Exception {
// adding your predefined strings
add("{\"str\":\"string1\"}", "1");
add("{\"str\":\"alice\"}", "2");
add("{\"str\":\"bob\"}", "3");
add("{\"str\":\"string2\"}", "4");
add("{\"str\":\"melanie\"}", "5");
add("{\"str\":\"moana\"}", "6");
refresh(); // otherwise we would not anything
indexExists(INDEX); // verifies that index exists
ensureGreen(INDEX); // ensures cluster status is green
// querying your text separated by space, if the hits length is bigger than 0, you're good
SearchResponse searchResponse = client().prepareSearch(INDEX).setQuery(QueryBuilders.termsQuery("str", "string1", "string3", "melani")).execute().actionGet();
SearchHit[] hits = searchResponse.getHits().getHits();
assertThat(hits.length, equalTo(1));
for (SearchHit hit: hits) {
System.out.println(hit.getSource());
}
}

Un-nesting nested tuples to single terms

I have written an udf (extends EvalFunc<Tuple>) which has as output tuples with inner tuples (nested).
For example the dump looks like:
(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))
Now I want to process each of this terms, distinct it and give it an id (RANK). To do this, i need to get rid of the brackets. A simple FLATTENdoes not help in this case.
The final output should be like:
1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....
My code (not the udf part and not the raw parsing):
tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;
I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.
Instead of
(((wedg,wedge),(audusd,audusd)))
you will have
({(wedg,wedge),(audusd,audusd)})
and you will be able to FLATTEN that bag to:
1. make the DISTINCT
2. RANK the tags
To do so, update your UDF like this :
class MyUDF extends EvalFunc <DataBag> {
#Override
public DataBag exec(Tuple input) throws IOException {
// create DataBag
}
}

Efficiently validating large list of objects

I have a function that is meant to remove items from a Collection if a certain field does not pass a validation check (either email or phone, but that's not important in this context). Problem is that a regular expression is relatively slow, and I have lists of 1 million+ items.
My function
public HashSet<ListItemModel> RemoveInvalid(HashSet<ListItemModel> listItems)
{
string pattern = (this.phoneOrEmail == "email")//phoneOrEmail is set via config file
?
//RFC 5322 compliant email regex. see http://www.regular-expressions.info/email.html
#"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
:
//north-american phone number regex. see http://stackoverflow.com/questions/12101125/regex-to-allow-only-digits-hypens-space-parentheses-and-should-end-with-a-dig
#"(?:\d{3}(?:\d{7}|\-\d{3}\-\d{4}))|(?:\(\d{3}\)(?:\-\d{3}\-)|(?: \d{3} )\d{4})";
Regex re = new Regex(pattern);
if (phoneOrEmail == "email")
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Email,0)));
}
else
{
return new HashSet<ListItemModel>(listItems.Where(x => re.IsMatch(x.Tel, 0)));
}
}
This takes way too long to execute. Is there a faster way of returning a subset that contains only valid emails/phone numbers?
I need to come up with something that is lightning quick. My other operations usually take only a couple of seconds on 700k+ items, but this method is taking forever and I hate that. I will be experimenting with a series of LINQ .Contains(x,y,z) checks, but in the meantime, I'd like some input from people who are smarter than me.

Pattern Matching for URL classification

As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.
Eg : If the url is : http://cnnworld/sports/abcd, we would classify it under the category "sports"
To accomplish this, we have a database with mappings of the format : Keyword -> Category
Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.
But this approach has a few problems, the main ones being :
(i) Our database is very big and such repeated querying runs extremely slowly
(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.
I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.
A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url.
I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)
In our commercial classifier we have a database of 4m keywords :) and we also search the body of the HTML, there are number of ways to solve this:
Use Aho-Corasick, we have used a modified algorithm specially to work with web content, for example treat: tab, space, \r, \n as space, as only one, so two spaces would be considered as one space, and also ignore lower/upper case.
Another option is to put all your keywords inside a tree (std::map for example) so the search becomes very fast, the downside is that this takes memory, and a lot, but if it's on a server, you wouldn't feel it.
I think your suggestion of breaking apart the URL to find useful bits and then querying for just those items sounds like a decent way to go.
I tossed together some Java that might help illustrate code-wise what I think this would entail. The most valuable portions are probably the regexes, but I hope the general algorithm of it helps some as well:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;
public class CategoryParser
{
/** The db field that keywords should be checked against */
private static final String DB_KEYWORD_FIELD_NAME = "keyword";
/** The db field that categories should be pulled from */
private static final String DB_CATEGORY_FIELD_NAME = "category";
/** The name of the table to query */
private static final String DB_TABLE_NAME = "KeywordCategoryMap";
/**
* This method takes a URL and from that text alone determines what categories that URL belongs in.
* #param url - String URL to categorize
* #return categories - A List<String&rt; of categories the URL seemingly belongs in
*/
public static List<String> getCategoriesFromUrl(String url) {
// Clean the URL to remove useless bits and encoding artifacts
String normalizedUrl = normalizeURL(url);
// Break the url apart and get the good stuff
String[] keywords = tokenizeURL(normalizedUrl);
// Construct the query we can query the database with
String query = constructKeywordCategoryQuery(keywords);
System.out.println("Generated Query: " + query);
// At this point, you'd need to fire this query off to your database,
// and the results you'd get back should each be a valid category
// for your URL. This code is not provided because it's very implementation specific,
// and you already know how to deal with databases.
// Returning null to make this compile, even though you'd obviously want to return the
// actual List of Strings
return null;
}
/**
* Removes the protocol, if it exists, from the front and
* removes any random encoding characters
* Extend this to do other url cleaning/pre-processing
* #param url - The String URL to normalize
* #return normalizedUrl - The String URL that has no junk or surprises
*/
private static String normalizeURL(String url)
{
// Decode URL to remove any %20 type stuff
String normalizedUrl = url;
try {
// I've used a URLDecoder that's part of Java here,
// but this functionality exists in most modern languages
// and is universally called url decoding
normalizedUrl = URLDecoder.decode(url, "UTF-8");
}
catch(UnsupportedEncodingException uee)
{
System.err.println("Unable to Decode URL. Decoding skipped.");
uee.printStackTrace();
}
// Remove the protocol, http:// ftp:// or similar from the front
if (normalizedUrl.contains("://"))
{
normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
}
// Room here to do more pre-processing
return normalizedUrl;
}
/**
* Takes apart the url into the pieces that make at least some sense
* This doesn't guarantee that each token is a potentially valid keyword, however
* because that would require actually iterating over them again, which might be
* seen as a waste.
* #param url - Url to be tokenized
* #return tokens - A String array of all the tokens
*/
private static String[] tokenizeURL(String url)
{
// I assume that we're going to use the whole URL to find tokens in
// If you want to just look in the GET parameters, or you want to ignore the domain
// or you want to use the domain as a token itself, that would have to be
// processed above the next line, and only the remaining parts split
String[] tokens = url.split("\\b|_");
// One could alternatively use a more complex regex to remove more invalid matches
// but this is subject to your (?:in)?ability to actually write the regex you want
// These next two get rid of tokens that are too short, also.
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 character long
//String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 or 2 characters long
//String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");
return tokens;
}
private static String constructKeywordCategoryQuery(String[] keywords)
{
// This will hold our WHERE body, keyword OR keyword2 OR keyword3
StringBuilder whereItems = new StringBuilder();
// Potential query, if we find anything valid
String query = null;
// Iterate over every found token
for (String keyword : keywords)
{
// Reject invalid keywords
if (isKeywordValid(keyword))
{
// If we need an OR
if (whereItems.length() > 0)
{
whereItems.append(" OR ");
}
// Simply append this item to the query
// Yields something like "keyword='thisKeyword'"
whereItems.append(DB_KEYWORD_FIELD_NAME);
whereItems.append("='");
whereItems.append(keyword);
whereItems.append("'");
}
}
// If a valid keyword actually made it into the query
if (whereItems.length() > 0)
{
query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
+ " WHERE " + whereItems.toString() + ";";
}
return query;
}
private static boolean isKeywordValid(String keyword)
{
// Keywords better be at least 2 characters long
return keyword.length() > 1
// And they better be only composed of letters and numbers
&& keyword.matches("\\w+")
// And they better not be *just* numbers
// && !keyword.matches("\\d+") // If you want this
;
}
// How this would be used
public static void main(String[] args)
{
List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
}
}
The Generated Query for the SO link would look like:
SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'
Plenty of room for optimization, but I imagine it to be much faster than checking the string for every possible keyword.
Aho-corasick algorithm is best for searching intermediate string with one traversal. You can form a tree (aho-corasick tree) of your keyword. At the last node contains a number mapped with a particular keyword.
Now, You just need to traverse the URL string on the tree. When you got some number (work as flag in our scenario), it means that we got some mapped category. Go on with that number on hash map and find respective category for further use.
I think this will help you.
Go to this link: good animation of aho-corasick by ivan
If you have (many) fewer categories than keywords, you could create a regex for each category, where it would match any of the keywords for that category. Then you'd run your URL against each category's regex. This would also address the issue of matching multiple categories.

Resources