Monitoring Process of Cases[] on a Very Large Body of Information - text

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.
The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.
In the following, I would like to monitor the process of the Cases[ ] command:
input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];

Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:
input = ExampleData[{"Text", "PrideAndPrejudice"}];
wordList =
Module[{charCount = 0, wordCount = 0, allWords}
, PrintTemporary[
Row[
{ "Characters: "
, ProgressIndicator[Dynamic[charCount], {0, StringLength#input}]
}]]
; allWords = StringSplit[
ToLowerCase[input]
, (_ /; (++charCount; False)) | Except[WordCharacter]
]
; PrintTemporary[
Row[
{ "Words: "
, ProgressIndicator[Dynamic[wordCount], {0, Length#allWords}]
}]]
; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]
]
The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this
Module[{list, t = 0},
list = ReadList["/users/USER/alltext.txt", "String"];
Monitor[wordlist =
Flatten#Table[
StringCases[ToLowerCase[list[[t]]], WordCharacter ..],
{t, Length[list]}],
Labeled[ProgressIndicator[t/Length[list]], N#t/Length[list], Right]];
Print["Ready"]]
On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List
{"5", "6", "7", Sequence[]}
becomes
{"5", "6", "7"}.
So, try
bigList /. "" -> Sequence[]
it should operate faster as it is not building up a large List from nothing.

Related

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Lua: Parsing and Manipulating Input with Loops - Looking for Guidance

I am currently attempting to parse data that is sent from an outside source serially. An example is as such:
DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
This data can come in many different lengths, but the first few pieces are all the same. Each "piece" originally comes in with CRLF after, so I've replaced them with string.gsub(input,"\r\n","|") so that is why my input looks the way it does.
The part I would like to parse is:
4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
The "4" tells me that there will be four lines total to create this file. I'm using this as a means to set the amount of passes in the loop.
The 7x5 is the font height.
The 1 is the xpos.
The 25 is the ypos.
The variable data (172-24 in this case) is the text at these parameters.
As you can see, it should continue to loop this pattern throughout the input string received. Now the "4" can actually be any variable > 0; with each number equaling a set of four variables to capture.
Here is what I have so far. Please excuse the loop variable, start variable, and print commands. I'm using Linux to run this function to try to troubleshoot.
function loop_input(input)
var = tonumber(string.match(val, "DATA|0|(%d*).*"))
loop = string.match(val, "DATA|0|")
start = string.match(val, loop.."(%d*)|.*")
for obj = 1, var do
for i = 1, 4 do
if i == 1 then
i = "font" -- want the first group to be set to font
elseif i == 2 then
i = "xpos" -- want the second group to be set to xpos
elseif i == 3 then
i = "ypos" -- want the third group to be set to ypos
else
i = "txt" -- want the fourth group to be set to text
end
obj = font..xpos..ypos..txt
--print (i)
end
objects = objects..obj -- concatenate newly created obj variables with each pass
end
end
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
print(loop_input(val))
Ideally, I want to create a loop that, depending on the var variable, will plug in the captured variables between the pipe deliminators and then I can use them freely as I wish. When trying to troubleshoot with parenthesis around my four variables (like I have above), I receive the full list of four variables four times in a row. Now I'm having difficulty actually cycling through the input string and actually grabbing them out as the loop moves down the data string. I was thinking that using the pipes as a means to delineate variables from one another would help. Am I wrong? If it doesn't matter and I can keep the [/r/n]+ instead of each "|" then I am definitely all for that.
I've searched around and found some threads that I thought would help but I'm not sure if tables or splitting the inputs would be advisable. Like these threads:
Setting a variable in a for loop (with temporary variable) Lua
How do I make a dynamic variable name in Lua?
Most efficient way to parse a file in Lua
I'm fairly new to programming and trying to teach myself. So please excuse my beginner thread. I have both the "Lua Reference Manual" and "Programming in Lua" books in paperback which is how I've tried to mock my function(s) off of. But I'm having a problem making the connection.
I thank you all for any input or guidance you can offer!
Cheers.
Try this:
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
val = val .. "|"
data = val:match("DATA|0|%d+|(.*)$")
for fh,xpos,ypos,text in data:gmatch("(.-)|(.-)|(.-)|(.-)|") do
print(fh,xpos,ypos,text)
end

Oracle/SQL - Removing undefined chars from string

I currently have an assignemnt where i have to handle data from a lot of countries. My customer have given me a list of acceptable characters, lets call it:
'aber =*'
All other characters should just be changed to '_'.
I know the conversion for my country's specific chars (æøå), easily done with something like
select replace ('Ål', 'Å', 'AA') from dual;
But how would i go about removing all unwanted "noise" without splitting it up in char-by-char comparison?
For example "bear*2 = fear" should become "bear*_ = _ear" as 2 and f are not in the accepted list.
Oracle 10g and up. As one of the approaches, you can use regular expression function regexp_replace():
select regexp_replace('bear*2 = fear', '[^aber =*]', '_') as res
from dual
res
------------------------------
bear*_ = _ear
Find out more about regexp_replace() function.

How can I achieve fast and effective String compression in Actionscript 3?

I have an Object which stores pairs for a find and replace that I perform on up to 1500 Strings at a time.
The Object is populated with pairs using a method that will accept a String and then store this as a property with the value being an automatically assigned base 36 number, like this:
function addShort(long:String):void
{
_pairs[long] = _nextShort;
}
_nextShort returns an automatically incremented value being the subject of .toString(36), so running the above a few times might make _pairs look like this:
_pairs:Object = {
"class": "0",
"testing.objects.TestBlock": "1",
"skin.x": "2",
"skin.y": "3",
...........
"someString": "az1"
};
This Object could realistically end up being really large, having over a couple hundred pairs stored.
I then have a method that will take a "long" String (which will include the Strings I've given to addShort() previously) and return a new String where these have been replaced with their respective short value.
The method looks like this:
public function shorten(long:String):String
{
for(var i:String in _pairs)
long = long.split(i).join(_pairs[i]);
return long;
}
Nice an simple, however in my mind I foresee a massive problem in a case where I might want to "shorten" 2000+ Strings and the _pairs Object has at the same time has over 500 pairs.
That ends up being 1,000,000 iterations all up which obviously doesn't seem very efficient at all.
How can I improve this process significantly?
Based on comments from #kapep I realized what I needed is actually a compression library that will do this work for me.
I stumbled across an LZW compression class within a package called Calista which works great.
I did notice that the compression was really slow, which is understandable, but if there are any suggestions for something quicker I'm open to them.
How about Regular Expressions for replacing String patterns? Catch some code.

Searching for Number of Term Appearances in Mathematica

I'm trying to search across a large array of textual files in Mathematica 8 (12k+). So far, I've been able to plot the sheer numbers of times that a word appears (i.e. the word "love" appears 5,000 times across those 12k files). However, I'm running into difficulty determining the number of files in which "love" appears once - which might only be in 1,000 files, with it repeating several times in others.
I'm finding the documentation WRT FindList, streams, RecordSeparators, etc. a bit murky. Is there a way to set it up so it finds an incidence of a term once in a file and then moves onto the next?
Example of filelist:
{"89001.txt", "89002.txt", "89003.txt", "89004.txt", "89005.txt", "89006.txt", "89007.txt", "89008.txt", "89009.txt", "89010.txt", "89011.txt", "89012.txt", "89013.txt", "89014.txt", "89015.txt", "89016.txt", "89017.txt", "89018.txt", "89019.txt", "89020.txt", "89021.txt", "89022.txt", "89023.txt", "89024.txt"}
The following returns all of the lines with love across every file. Is there a way to return only the first incidence of love in each file before moving onto the next one?
FindList[filelist, "love"]
Thanks so much. This is my first post and I'm largely learning Mathematica through peer/supervisory help, online tutorials, and the documentation.
In addition to Daniel's answer, you also seem to be asking for a list of files where the word only occurs once. To do that, I'd continue to run FindList across all the files
res =FindList[filelist, "love"]
Then, reduce the results to single lines only, via
lines = Select[ res, Length[#]==1& ]
But, this doesn't eliminate the cases where there is more than one occurrence in a single line. To do that, you could use StringCount and only accept instances where it is 1, as follows
Select[ lines, StringCount[ #, RegularExpression[ "\\blove\\b" ] ] == 1& ]
The RegularExpression specifies that "love" must be a distinct word using the word boundary marker (\\b), so that words like "lovely" won't be included.
Edit: It appears that FindList when passed a list of files returns a flattened list, so you can't determine which item goes with which file. For instance, if you have 3 files, and they contain the word "love", 0, 1, and 2 times, respectively, you'd get a list that looked like
{, love, love, love }
which is clearly not useful. To overcome this, you'll have to process each file individually, and that is best done via Map (/#), as follows
res = FindList[#, "love"]& /# filelist
and the rest of the above code works as expected.
But, if you want to associate the results with a file name, you have to change it a little.
res = {#, FindList[#, "love"]}& /# filelist
lines = Select[res,
Length[ #[[2]] ] ==1 && (* <-- Note the use of [[2]] *)
StringCount[ #[[2]], RegularExpression[ "\\blove\\b" ] ] == 1&
]
which returns a list of the form
{ {filename, { "string with love in it" },
{filename, { "string with love in it" }, ...}
To extract the file names, you simply type lines[[All, 1]].
Note, in order to Select on the properties you wanted, I used Part ([[ ]]) to specify the second element in each datum, and the same goes for extracting the file names.
Help > Documentation Center > FindList item 4:
"FindList[files,text,n]
includes only the first n lines found."
So you could set n to 1.
Daniel Lichtblau

Resources