I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?
So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.
This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.
Related
My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?
All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.
I'll try to be as succinct as possible as this will require some explanation. I am in a situation where I have to match strings, and extract values from those strings based on template strings that we define.
For example, the template sting would be:
I want to go to the $websiteurl homepage
And the other string could be
I want to go to the google.com homepage
We've had initial success checking that these strings match by measuring Levenshtein Distances and by using "creative" regular expressions, but we're trying to grow this to be more fault tolerant and accurate and the amount of complicated code is growing more than we would like.
Some scenarios we need to check compound words or we need to ignore extra adjectives/descriptive-words etc.
Adjective/Descriptive-word Example:
I want to immediately go to the google.com homepage
Compound Word Example (homepage is broken into two words):
I want to go to the google.com home page
And you can probably imagine even more complicated real-world sentences that should match this one, but won't match or work without some extra cases or additional checks for the sentence.
Obviously our current setup isn't ideal, due to the fact we need to do multiple passes over this string for every individual case we need to check which not only slows things down, but makes maintenance and debugging more complicated than it should be.
Is there a data structure that would be ideal for holding and comparing sentences in this manor? Something that would be fairly fault tolerant to extra or even missing words (within reason, obviously)? I imagine some kind of tree, but I do not know what type of tree or data-structure in general would be the best for this situation.
Thanks to any/all in advance!
Somebody has presented me with a very large list of copyedits to make to a long HTML document. The edits are in the format:
"religious" should be "religions"
"their" should be "there"
"you must persistent" should be "you must be persistent"
The copyedits were typed by hand; in some cases, the "actual" value on the left is not an exact match for the content in the document. The order of edits is usually correct, but even that is not guaranteed.
It's a straightforward but very large task to apply these edits by hand to the document. I'd like to automate the process as much as possible, e.g. by automatically searching for the snippets.
In a long document like this, I can't just search for all instances of "their" and replace them with "there." Sometimes "their" was used correctly, just not in one particular instance.
In other words, I'm looking for a fuzzy text match, where the order of the edits influences the search.
What's a good approach to a problem like this? I'm hoping that there's some off-the-shelf open source project that can search for the snippets in a fuzzy order.
I am not aware of any tool. But I would use edit distance for both:
for non-exact string match: probably std. Levenstein + swap (i.e. Damerau-Levenstein distance)
for non-exact sequence match: this time probably only with Match and Swap operations. You can use free (zero-cost) Insert to get the words that should not be edited.
It should not be hard to implement. But the computational complexity will be quite high. I would use some heuristics to skip hopeless matches. Preprocessing words in the document and the edit list could be good: get a set of chars for each word to allow a quick comparison before calculating full edit distance), etc.
What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy
When I search for something, I get content that have the same text and title.
Of course, there is always an original (where others copy/leech from)
If you have expertise in search and crawling...how do you recommend that I remove these duplicates? (in a very feasible and efficient mannter)
Sounds like a programming question to me.
If you have a clear idea about what the stolen and original components of these pages are, and those differences are general enough that you can write a filter to separate them, then do that, hash the 'stolen' content, and then you should be able to compare hashes to determine if two pages are the same.
I guess web-page thieves might go to some further code-obfuscation to mess you up, including changing whitespace, so you might want to normalise the html before hashing, for instance removing any redundant whitespace, making all attributes use " quotes etc.
Here's a technique based on simhash.
Here's one that uses stopwords to work around ads.
Have you tried looking at the origin date of the site? After comparing a value of word strings to verify duplication, whitelist the one that is earlier.