Apriori Algorithm for text - text

I have taken a data mining course and we have to run an apriori algorithm on a data set with text , ie strings.
['acornSquash', 'cottageCheese', 'laundryDetergent', 'oatmeal', 'onions', 'pizza', 'tomatoes', 'yogurt']
['bread', 'cinnamon', 'grapefruit', 'juiceBoxes', 'mayo', 'pastaSauce', 'pepper', 'waterBottles', 'yogurt']
Can i get any code or help to run the apriori algorithm?
Thanks in advance

Below link contains source code for basic apriori implementation.
https://github.com/ak94/Apriori/
Go through readme file.
By basic implementation I mean to say , it do not implement any efficient algorithm like Hash-based technique , partitioning technique , sampling , transaction reduction or dynamic itemset counting.
The code scans the whole dataset every time.But it is memory efficient as it always read input from file rather than storing in memory.
As you are currently on this course I assume this code will be the first you will like to write own your own.
To read more about apriori algorithm I would recommend you to read http://www3.cs.stonybrook.edu/~cse634/lecture_notes/07apriori.pdf
Read , understand and try to implement on your own.
Now,lets talk about how to implement .
When you go through the code from link I posted , It implements on numbers i.e., its input file contains itemset as number instead of text (as in your case)
what you can simply do is , write a program to map each text with a particular number.
For e.g.
Suppose your data set contained
[ 'oatmeal', 'onions', 'pizza', 'tomatoes', 'yogurt']
[ 'tomatoes', 'pepper', 'waterBottles', 'yogurt']
So it would look like
1 2 3 4 5 -1
4 6 7 5 -1
(-1 to represent end of particular transaction ,as in code )
then you use this input file for your code (either same as in link or your own in different language)
and after you get the frequent item set after execution of program , you can transform it back using map you used earlier.

Related

Extract information from a string - What technique in ML can solve

I would like to know what kind of technique in Machine Learning domain can solve the problem below? (For example: Classification, CNN, RNN, etc.)
Problem Description:
User would input a string, and I would like to decompose the string to get the information I want. For example:
User inputs "R21TCCCUSISS", and after code decomposing, then I got the information: "R21" is product type, "TCC" is batch number, "CUSISS" is the place of origin
User inputs "TT3SUAWXCCAT", and after code decomposing, then I got the information: "TT3S" is product type, "SUAW" is batch number, "X" is a wrong character that user input , and "CCAT" is the place of origin
There are not fix string length in product type, batch number, and place of origin. Like product type may be "R21" or "TT3S", meaning that product type may comprise 2 or 3 character.
Also sometimes the string may contain wrong input information, like the "X" in example 2 shown above.
I’ve tried to find related solution, but what I got the most related is this one: https://github.com/philipperemy/Stanford-NER-Python
However, the string I got is not a sentence. A sentence comprises spaces & grammar, but the string I got doesn’t fit this situation.
Your problem is not reasonnably solved with any ML since you have a defined list of product type etc, since there may not be any actual simple logic, and since typically you are never working in the continuum (vector space etc). The purpose of ML is to build a regression function from few pieces of data and hope/expect a good generalisation (the regression fits all the unseen examples, past present and future).
Basically you are trying to reverse engineer the input grammar and generation (which was done by an algorithm, including possibly a random number generator). But in order to assert that your classifier function is working properly you need all your data to be also groundtruth, which breaks the ML principle.
You want to list all your list of defined product types (ground truth), and scatter bits of your input (with or without a regex pattern) into different types (batch number, place of origin). The "learning" is actually building a function (or few, one per type), element by element, which is filling a map (c++) or a dictionary (c#), and using it to parse the input.

Selecting arbitrary rows from a Neo matrix in Nim?

I am using the Neo library for linear algebra in Nim, and I would like to extract arbitrary rows from a matrix.
I can explicitly select a continuous sequence of rows as per the examples in the README, but can't select a disjoint subset of rows.
import neo
let x = randomMatrix(10, 4)
let some_rows = #[1,3,5]
echo x[2..4, All] # works fine
echo x[some_rows, All] ## error
The first echo works because you are creating a Slice object, which neo has defined a proc for. The second echo uses a sequence of integers, and that kind of access is not defined in the neo library. Unfortunately Slices define contiguous closed ranges, you can't even specify steps to iterate in bigger increments than one, so there is no way to accomplish what you want.
Looking at the structure of a Matrix, it seems that it is highly optimised to avoid copying data. Matrix transformation operations seem to reuse the data of the previous matrix and change the access/dimensions. As such, a matrix transformation with arbitrary random would not be possible, the indexes in your example specifically access non contiguos data and this would need to be encoded somehow in the new structure. Plus if you wrote #[1,5,3] that would defeat any kind of normal iterative looping.
An alternative of course is to write a proc which accepts a sequence instead of a slice and then builds a new matrix copying data from the old one. This implies a performance penalty, but if you think this is a good addition to the library please request it in the issue tracker of the project. If it is not accepted, then you will need to write yourself such a proc for personal use in your programs.

Reading from a binary file with a given data structure

I want to read the 'Last Traded Price' from the given binary file. How do I extract a specific data out of the file by using notations like 'hhl10s6sc'. I know I have to use the struct.unpack method, but where can I learn to write such formatting (with some illustrations) so that I can extract any data that I want from such a binary file.
The thing that is troubling me is the unpacking that the writer of the code (that I'm trying to understand) has written - 'hlhcl6s10s11s10s2s1s10s12schc' . I understood what 6s...12s mean, but what's the significance of the 'hlhcl' (5 characters in the beginning) and 'chc' (3 characters in the last). The writer has tried to retrieve the 'Last traded price' from the data structure.
If you could give some examples and/or some sources for the same, it'd be very helpful. Attached the image which contains the data structure of the given file.
This image shows the data structure
struct format strings are fields described in order. Every letter is a format character, so hlhcl translates to "short, long, short, character, long". This doesn't resemble the image you linked (which is a tad impractical as it's off-site and another step to look up), which starts with a single long and otherwise holds only strings. It might apply to a protocol wrapping that packet.

Is it a reasonable practice to serialize Haskell data structures to disk just using Show/Read

I've played around with the Text.Show.Pretty module, and it makes it possible to serialize out Haskell data structures like records into a nice human-readable format & still be able to deserialize them easily using read. The output format is even more readable than YAML and JSON.
Example serialized output for a Haskell record using Text.Show.Pretty:
Book
{ author = "Plato"
, title = "Republic"
, numbers = [ 123
, 1234
]
}
Coming from the Ruby world, I know that YAML and JSON are most Rubyists' preferred format for serializing data structures. Are Haskell Show and Read instances used often to achieve the same end in Haskell?
For big structures, I wouldn't recommend it. read is slower than molasses. Anecdote time: I have a program named yeganesh. Conceptually, it's pretty simple: read in a [(String,Double)] with about 2000 elements and dump out the keys sorted by their elements. I used to store this using Show/Read, but found that switching to a custom printer and parser sped up the program by a factor of 8. (Note: it's not that the parsing sped up by a factor of eight. The whole program sped up by a factor of eight. That means the parsing sped up by a bigger factor than that.) That made the difference between uncomfortably long pauses and instant gratification.
I agree with Daniel Wagner but if you want file that a user can manipulate with simple text editors you could use the read/show for a small set of data, aka config files.
I don't think that is a common way amongst haskellers though, I usually use parsec instead of read 'config data' and a custom class /instance instead of Show.
If you got alot of data one usually use Data.Binary or Data.Serialize.

Hash function that hashes similar strings in the same bucket

I'm searching for a "bad" hash function:
I'd like to hash strings and put similar strings in one bucket.
Can you give me a hint where to start my research?
Some methods or algorithm names...
Your problem is not an easy one. Two ideas:
This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general jist of the text and the high frequency part is the tiny changes.
This is somewhat similar to what jpeg compression does: Throw away the details and just leave the important stuff. If you have two almost-identical images and you jpeg compress them greatly, you usually get the same output.
pHash uses a method similar to this.
Again, this is going to be a pretty complicated way to do it.
Second idea: minHash
The idea for minHash is that you pick some markers that are likely to be the same when the inputs are the same. Then you compute a vector for the outputs of all the markers. If two inputs have similar vectors then the inputs are similar.
For example, count how many times the word "the" appears in the text. If it's even, 0, if it's odd, 1. Now count how many times the word "math" shows up in the text. Again, 0 for even, 1 for odd. Do that for a lot of words.
Now you process all the texts and each one gives you an output like "011100010101" or whatever. If two texts are similar then they will have similar outputs strings, differing by just 1 or two bits. You can use a multi-variate partition trie (MVP) to search the outputs efficiently.
This, too, might be overkill for your problem.
It depends on what you mean by "similar string" ?
But if you look for such a bad one, you have to build it yourself.
Example :
you can create 10 buckets (0 to 9)
and group the strings by theirs length
mod 10
Use a strcmp() like function and group them by the differences with a defined String

Resources