Recommended data structure to store a changeable sequence with a number - python-3.x

I am trying to build a FP tree, and feeling quite confused which data structure I should use to record the prefix path and its occurrence. The prefix path is a sequence recording item set like ('coffee','milk','bear') and its occurrence is an int number. I post two requirements of the data structure below so that you don't need to go deep into FP-tree:
The occurrence of prefix path need to be searched frequently, so maybe dict like {prefix_path : occurrence} is the best way to store them.
The prefix path need to be updated(re-rank and filter) in a conditional FP tree.
I have searched other's work in Github, and found out people would use {tuple(['coffee','milk','bear']):occurrence} or {frozenset(['coffee','milk','bear']):occurrence} to do so. However, when prefix path update, they need to change tuple or frozensetinto list then change it back. I think this is quite not pythonic.
I am wondering if there is a better way to store prefix path with its occurrence.

Related

Checking that a provided string is a single path component

I have a function that accepts a string which will be used to create a file with that name (e.g. f("foo") will create a /some/fixed/path/foo.txt file). I'd like to prevent users from mistakenly passing strings with / separators that would introduce additional sub-directories. Since PathBuf::push() accepts strings with multiple components (and, confusingly, so does PathBuf::set_file_name()) it doesn't seem possible to prevent pushing multiple components onto a PathBuf without a separate check first.
Naively, I could do a .contains() check:
assert!(!name.contains("/"), "name should be a single path element");
But obviously that's not cross-platform. There is path::is_separator() so I could do:
name.chars().any(std::path::is_separator)
Alternatively I looked at Path for any sort of is_single_component() check or similar, I could check the file_name() equals the whole path:
let name = Path::new(name);
assert_eq!(Some(name.as_os_str()), name.file_name(),
"name should be a single path element");
or that iterating over the path yields one element:
assert_eq!(Path::new(name).iter().count(), 1,
"name should be a single path element");
I'm leaning towards this last approach, but I'm just curious if there's a more idiomatic way to ensure pushing a string onto a PathBuf will just add one path component.
If you are fine with limiting yourself to path names that are valid UTF-8, I suggest this succinct implementation:
fn has_single_component(path: &str) -> bool {
!path.contains(std::path::is_separator)
}
In contrast to your Path-based approaches, it will stop at the first separator found, and it's easy to read.
Note that testing whether a path only consists of a single component is a rather uncommon thing to do, so there isn't a standard way of doing it.

How to encode blob names that end with a period?

Azure docs:
Avoid blob names that end with a dot (.), a forward slash (/), or a
sequence or combination of the two.
I cannot avoid such names due to legacy s3 compatibility and so I must encode them.
How should I encode such names?
I don't want to use base64 since that will make it very hard to debug when looking in azure's blob console.
Go has https://golang.org/pkg/net/url/#QueryEscape but it has this limitation:
From Go's implementation of url.QueryEscape (specifically, the
shouldEscape private function), escapes all characters except the
following: alphabetic, decimal digits, '-', '_', '.', '~'.
I don't think there's any universal solution to handle this outside your application scope. Within your application scope, you can do ANY encoding so it falls to personal preference how you like your data to be laid out. There is not "right" way to do this.
Regardless, I believe you should go for these properties:
Conversion MUST be bidirectional and without conflicts in your expected file name space
DO keep file names without ending dots unencoded
with dot-ending files, DO encode just the conflicting dots, keeping the original name readable.
This would keep most (the non-conflicting) files short and with the original intuitive or hopefully meaningful names and should you ever be able to rename or phase out the conflicting files just remove the conversion logic without restructuring all stored data and their urls.
I'll suggest 2 examples for this. Lets suggest you have files:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.
/someParent/double..
Use special subcontainers
You could remove N dots from end of filename and translate them to subcontainer name "dot", "dotdot" etc.
The result urls would like:
/someParent/normal.txt
/someParent/extensionless
/someParent/dot/single
/someParent/dotdot/double
When reading you can remove the "dot"*N folder level and append N dots back to file name.
Obviously this assumes you don't ever need to have such "dot" folders as data themselves.
This is preferred if stored files can come in with any extension but you can make some assumptions on folder structure.
Use discardable artificial extension
Since the conflict is at the end you could just append a never-used dummy extension to given files. For example "endswithdots", but you could choose something more suitable depending on what the expected extensions are:
/someParent/normal.txt
/someParent/extensionless
/someParent/single.endswithdots
/someParent/double..endswithdots
On reading if the file extension is "endswithdots" you remove the "endswithdots" part from end of filename.
This is preferred if your data could have any container structure but you can make some assumptions on incoming extensions.
I would suggest against Base64 or other full-name encoding as it would make file names notably longer and lose any meaningful details the file names may contain.

Possible to balance unidic vs. unidic-neologd?

With the sentence "場所は多少わかりづらいんですけど、感じのいいところでした。" (i.e. "It is a bit hard to find, but it is a nice place.") using mecab with -d mecab-unidic-neologd the first line of output is:
場所 バショ バショ 場所 名詞-固有名詞-人名-姓
I.e. it says "場所" is a person's surname. Using normal mecab-unidic it more accurately says the "場所" is just a simple noun.
場所 バショ バショ 場所 名詞-普通名詞-一般
My first question is has unidic-neologd replaced all the entries in unidic, or has it simply appended its 3 million proper nouns?
Then, secondly, assuming it is a merger, is it possibly to re-weight the entries, to prefer plain unidic entries a bit more strongly? I.e. I'd love to be getting 中居正広のミになる図書館 and SMAP each recognized as single proper nouns, but I also need it to see that 場所 is always going to mean "place" (except in the cases it is followed by a name suffix such as さん or 様, of course).
References: unidic-neologd
Neologd merges with unidic (or ipadic), which is the reason it keeps "unidic" in the name. If an entry has multiple parts of speech, like 場所, which entry to use is chosen by minimizing cost across the sentence using part-of-speech transitions and, for words in the dictionary, the per-token cost.
If you look in the CSV file that contains neologd dictionary entries you'll see two entries for 場所:
場所,4786,4786,4329,名詞,固有名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
場所,4790,4790,4329,名詞,固有名詞,人名,姓,*,*,バショ,場所,場所,バショ,場所,バショ,固,*,*,*,*
And in lex.csv, the default unidic dictionary:
場所,5145,5145,4193,名詞,普通名詞,一般,*,*,*,バショ,場所,場所,バショ,場所,バショ,混,*,*,*,*
The fourth column is the cost. A lower cost item is more likely to be selected, so in this case you can raise the cost for 場所 as a proper noun, though honestly I would just delete it. You can read more about fiddling with cost here (Japanese).
If you want to weight all default unidic entries more strongly, you can modify the neolog CSV file to increase all weights. This is one way to create a file like that:
awk -F, 'BEGIN{OFS=FS}{$4 = $4 * 100; print $0}' neolog.csv > neolog.fix.csv
You will have to remove the original csv file before building (see Note 2 below).
In this particular case, I think you should report this as a bug to the Neologd project.
Note 1: As mentioned above, since which entry is selected depends on the sentence as a whole, it's possible to get the non-proper-noun tag even with the default configuration. Example sentence:
お店の場所知っている?
Note 2: The way the neologd dictionary combines with the default unidic dictionary is based on a subtle aspect of the way Mecab dictionary builds work. Specifically, all CSV files in a dictionary build directory are used when creating the system dictionary. Order isn't specified so it's unclear what happens in the case of collisions.
This feature is mentioned in the Mecab documentation here (Japanese).

Set of strings efficient implementation

Is there an easy way to create a set of strings in Matlab?
I am going through a list of filepaths and want to get all names of folders at a specific level.
But since in some folders there are several files, I get these folders several times.
I know there would be the possibility to create a cell array and check every time if the current folder name is already in the array, and if not, add it.
Another option would be to use the java HashSet class.
But is there any easy inbuilt Matlab way to do something like that?
I can't use a Vector since it would create a vector of chars not strings.
Unfortunately there's nothing as efficient as Java Set implementations.
But you can use set operations. Either union when you add, or just call unique on your collection with duplicates.
You could use the rdir script... MATLAB file exchange to the rescue!
Use it like this:
listing = rdir(name);
The function returns a structure listing similar to the built-in dir command.
It should save you the headache of iterating through a directory tree yourself.
How about "unique":
x = {'dog', 'cat', 'cat', 'fish', 'horse', 'bird', 'rat', 'rat'};
x_set=unique(x)
x_set =
'bird' 'cat' 'dog' 'fish' 'horse' 'rat'

sas generate all possible miss spelling

Does any one know how to generate the possible misspelling ?
Example : unemployment
- uemployment
- onemploymnet
-- etc.
If you just want to generate a list of possible misspellings, you might try a tool like this one. Otherwise, in SAS you might be able to use a function like COMPGED to compute a measure of the similarity between the string someone entered, and the one you wanted them to type. If the two are "close enough" by your standard, replace their text with the one you wanted.
Here is an example that computes the Generalized Edit Distance between "unemployment" and a variety of plausible mispellings.
data misspell;
input misspell $16.;
length misspell string $16.;
retain string "unemployment";
GED=compged(misspell, string,'iL');
datalines;
nemployment
uemployment
unmployment
uneployment
unemloyment
unempoyment
unemplyment
unemploment
unemployent
unemploymnt
unemploymet
unemploymen
unemploymenyt
unemploymenty
unemploymenht
unemploymenth
unemploymengt
unemploymentg
unemploymenft
unemploymentf
blahblah
;
proc print data=misspell label;
label GED='Generalized Edit Distance';
var misspell string GED;
run;
Essentially you are trying to develop a list of text strings based on some rule of thumb, such as one letter is missing from the word, that a letter is misplaced into the wrong spot, that one letter was mistyped, etc. The problem is that these rules have to be explicitly defined before you can write the code, in SAS or any other language (this is what Chris was referring to). If your requirement is reduced to this one-wrong-letter scenario then this might be managable; otherwise, the commenters are correct and you can easily create massive lists of incorrect spellings (after all, all combinations except "unemployment" constitute a misspelling of that word).
Having said that, there are many ways in SAS to accomplish this text manipulation (rx functions, some combination of other text-string functions, macros); however, there are probably better ways to accomplish this. I would suggest an external Perl process to generate a text file that can be read into SAS, but other programmers might have better alternatives.
If you are looking for a general spell checker, SAS does have proc spell.
It will take some tweaking to get it working for your situation; it's very old and clunky. It doesn't work well in this case, but you may have better results if you try and use another dictionary? A Google search will show other examples.
filename name temp lrecl=256;
options caps;
data _null_;
file name;
informat name $256.;
input name &;
put name;
cards;
uemployment
onemploymnet
;
proc spell in=name
dictionary=SASHELP.BASE.NAMES
suggest;
run;
options nocaps;

Resources