Is there a julia structure allowing to search for a range of keys? - hashmap

I know that I can use integer keys for a hashmap like the following example for a Dictionary. But Dictionaries are unordered and do not benefit from having integer keys.
julia> hashmap = Dict( 5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
Dict{Int64,String} with 4 entries:
9 => "nine"
16 => "sixteen"
70 => "seventy"
5 => "five"
julia> hashmap[9]
"nine"
julia> hashmap[8:50] # I would like to be able to do this to get keys between 8 and 50 (9 and 16 here)
ERROR: KeyError: key 8:50 not found
Stacktrace:
[1] getindex(::Dict{Int64,String}, ::UnitRange{Int64}) at ./dict.jl:477
[2] top-level scope at REPL[3]:1
I'm looking for an ordered structure allowing access all it's keys within a certain range while benefiting from performance optimization due to sorted keys.

There is a dedicated library named DataStructures which has a SortedDict structure and corresponding search functions:
using DataStructures
d = SortedDict(5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
st1 = searchsortedfirst(d, 8) # index of the first key greater than or equal to 8
st2 = searchsortedlast(d, 50) # index of the last key less than or equal to 50
And now:
julia> [(k for (k,v) in inclusive(d,st1,st2))...]
3-element Array{Int64,1}:
9
16

I do not think there is such a structure in the standard library, but this could be implemented as a function on an ordinary dictionary as long as the keys are of a type that fits the choice of range:
julia> d = Dict(1 => "a", 2 => "b", 5 => "c", 7 => "r", 9 => "t")
Dict{Int64,String} with 5 entries:
7 => "r"
9 => "t"
2 => "b"
5 => "c"
1 => "a"
julia> dictrange(d::Dict, r::UnitRange) = [d[k] for k in sort!(collect(keys(d))) if k in r]
dictrange (generic function with 1 method)
julia> dictrange(d, 2:6)
2-element Array{String,1}:
"b"
"c"

get allows you to have a default value when none is defined, you can default to missing and then skip them
julia> hashmap = Dict( 5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
Dict{Int64,String} with 4 entries:
9 => "nine"
16 => "sixteen"
70 => "seventy"
5 => "five"
julia> get.(Ref(hashmap), 5:10, missing)
6-element Array{Union{Missing, String},1}:
"five"
missing
missing
missing
"nine"
missing
julia> get.(Ref(hashmap), 5:10, missing) |> skipmissing |> collect
2-element Array{String,1}:
"five"
"nine"

In the case you are working with dates, you might consider have a look at the TimeSeries package which does what you want provided your integer keys are representing dates:
using TimeSeries
dates = [Date(2020,11,5), Date(2020,11,9), Date(2020,11,16), Date(2020,11,30)]
times = TimeArray(dates, ["five", "nine", "sixteen", "thirty"])
And then:
times[Date(2020,11,8):Day(1):Date(2020,11,20)]
2×1 TimeArray{String,1,Date,Array{String,1}} 2020-11-09 to 2020-11-16
│ │ A │
├────────────┼───────────┤
│ 2020-11-09 │ "nine" │
│ 2020-11-16 │ "sixteen" │

Related

Replace multiple strings with multiple values in Julia

In Python pandas you can pass a dictionary to df.replace in order to replace every matching key with its corresponding value. I use this feature a lot to replace word abbreviations in Spanish that mess up sentence tokenizers.
Is there something similar in Julia? Or even better, so that I (and future users) may learn from the experience, any ideas in how to implement such a function in Julia's beautiful and performant syntax?
Thank you!
Edit: Adding an example as requested
Input:
julia> DataFrames.DataFrame(Dict("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."]))
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
Desired output:
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an example
2 │ This is a sample
3 │ This is a sample of an example
In Julia the function for this is also replace. It takes a collection and replaces elements in it. The simplest form is:
julia> x = ["a", "ab", "ac", "b", "bc", "bd"]
6-element Vector{String}:
"a"
"ab"
"ac"
"b"
"bc"
"bd"
julia> replace(x, "a" => "aa", "b" => "bb")
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
If you have more complex replace pattern you can pass a function that does the replacement:
julia> replace(x) do s
length(s) == 1 ? s^2 : s
end
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
There is also replace! that does the same in-place.
Is this what you wanted?
EDIT
Replacement of substrings in a vector of strings:
julia> df = DataFrame("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."])
3×1 DataFrame
Row │ A
│ String
─────┼───────────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
julia> df.A .= replace.(df.A, "ex." => "example", "samp." => "sample")
3-element Vector{String}:
"This is an example"
"This is a sample"
"This is a sample of an example"
Note two things:
you do not need to pass Dict to DataFrame constructor. It is enough to just pass pairs.
In assignment I used .= not =, which perfoms an in-place replacement of updated values in the already existing vector (I show it for a comparison to what #Sundar R proposed in a comment which is an alternative that allocates a new vector; the difference probably does not matter much in your case but I just wanted to show you both syntaxes).

Python is operator behaviour with integers [duplicate]

After dive into Python's source code, I find out that it maintains an array of PyInt_Objects ranging from int(-5) to int(256) (#src/Objects/intobject.c)
A little experiment proves it:
>>> a = 1
>>> b = 1
>>> a is b
True
>>> a = 257
>>> b = 257
>>> a is b
False
But if I run those code together in a py file (or join them with semi-colons), the result is different:
>>> a = 257; b = 257; a is b
True
I'm curious why they are still the same object, so I digg deeper into the syntax tree and compiler, I came up with a calling hierarchy listed below:
PyRun_FileExFlags()
mod = PyParser_ASTFromFile()
node *n = PyParser_ParseFileFlagsEx() //source to cst
parsetoke()
ps = PyParser_New()
for (;;)
PyTokenizer_Get()
PyParser_AddToken(ps, ...)
mod = PyAST_FromNode(n, ...) //cst to ast
run_mod(mod, ...)
co = PyAST_Compile(mod, ...) //ast to CFG
PyFuture_FromAST()
PySymtable_Build()
co = compiler_mod()
PyEval_EvalCode(co, ...)
PyEval_EvalCodeEx()
Then I added some debug code in PyInt_FromLong and before/after PyAST_FromNode, and executed a test.py:
a = 257
b = 257
print "id(a) = %d, id(b) = %d" % (id(a), id(b))
the output looks like:
DEBUG: before PyAST_FromNode
name = a
ival = 257, id = 176046536
name = b
ival = 257, id = 176046752
name = a
name = b
DEBUG: after PyAST_FromNode
run_mod
PyAST_Compile ok
id(a) = 176046536, id(b) = 176046536
Eval ok
It means that during the cst to ast transform, two different PyInt_Objects are created (actually it's performed in the ast_for_atom() function), but they are later merged.
I find it hard to comprehend the source in PyAST_Compile and PyEval_EvalCode, so I'm here to ask for help, I'll be appreciative if some one gives a hint?
Python caches integers in the range [-5, 256], so integers in that range are usually but not always identical.
What you see for 257 is the Python compiler optimizing identical literals when compiled in the same code object.
When typing in the Python shell each line is a completely different statement, parsed and compiled separately, thus:
>>> a = 257
>>> b = 257
>>> a is b
False
But if you put the same code into a file:
$ echo 'a = 257
> b = 257
> print a is b' > testing.py
$ python testing.py
True
This happens whenever the compiler has a chance to analyze the literals together, for example when defining a function in the interactive interpreter:
>>> def test():
... a = 257
... b = 257
... print a is b
...
>>> dis.dis(test)
2 0 LOAD_CONST 1 (257)
3 STORE_FAST 0 (a)
3 6 LOAD_CONST 1 (257)
9 STORE_FAST 1 (b)
4 12 LOAD_FAST 0 (a)
15 LOAD_FAST 1 (b)
18 COMPARE_OP 8 (is)
21 PRINT_ITEM
22 PRINT_NEWLINE
23 LOAD_CONST 0 (None)
26 RETURN_VALUE
>>> test()
True
>>> test.func_code.co_consts
(None, 257)
Note how the compiled code contains a single constant for the 257.
In conclusion, the Python bytecode compiler is not able to perform massive optimizations (like statically typed languages), but it does more than you think. One of these things is to analyze usage of literals and avoid duplicating them.
Note that this does not have to do with the cache, because it works also for floats, which do not have a cache:
>>> a = 5.0
>>> b = 5.0
>>> a is b
False
>>> a = 5.0; b = 5.0
>>> a is b
True
For more complex literals, like tuples, it "doesn't work":
>>> a = (1,2)
>>> b = (1,2)
>>> a is b
False
>>> a = (1,2); b = (1,2)
>>> a is b
False
But the literals inside the tuple are shared:
>>> a = (257, 258)
>>> b = (257, 258)
>>> a[0] is b[0]
False
>>> a[1] is b[1]
False
>>> a = (257, 258); b = (257, 258)
>>> a[0] is b[0]
True
>>> a[1] is b[1]
True
(Note that constant folding and the peephole optimizer can change behaviour even between bugfix versions, so which examples return True or False is basically arbitrary and will change in the future).
Regarding why you see that two PyInt_Object are created, I'd guess that this is done to avoid literal comparison. for example, the number 257 can be expressed by multiple literals:
>>> 257
257
>>> 0x101
257
>>> 0b100000001
257
>>> 0o401
257
The parser has two choices:
Convert the literals to some common base before creating the integer, and see if the literals are equivalent. then create a single integer object.
Create the integer objects and see if they are equal. If yes, keep only a single value and assign it to all the literals, otherwise, you already have the integers to assign.
Probably the Python parser uses the second approach, which avoids rewriting the conversion code and also it's easier to extend (for example it works with floats as well).
Reading the Python/ast.c file, the function that parses all numbers is parsenumber, which calls PyOS_strtoul to obtain the integer value (for intgers) and eventually calls PyLong_FromString:
x = (long) PyOS_strtoul((char *)s, (char **)&end, 0);
if (x < 0 && errno == 0) {
return PyLong_FromString((char *)s,
(char **)0,
0);
}
As you can see here the parser does not check whether it already found an integer with the given value and so this explains why you see that two int objects are created,
and this also means that my guess was correct: the parser first creates the constants and only afterward optimizes the bytecode to use the same object for equal constants.
The code that does this check must be somewhere in Python/compile.c or Python/peephole.c, since these are the files that transform the AST into bytecode.
In particular, the compiler_add_o function seems the one that does it. There is this comment in compiler_lambda:
/* Make None the first constant, so the lambda can't have a
docstring. */
if (compiler_add_o(c, c->u->u_consts, Py_None) < 0)
return 0;
So it seems like compiler_add_o is used to insert constants for functions/lambdas etc.
The compiler_add_o function stores the constants into a dict object, and from this immediately follows that equal constants will fall in the same slot, resulting in a single constant in the final bytecode.

Julia fastest way to read values string to array

I have an array of multiple strings, whos values are actually arrays of various types, e.g.:
Julia> _string
"[(-33.8800966, 151.2069034), (-33.8801202, 151.2071933), (-33.8803442, 151.2083656), (-33.8804469, 151.2088682), (-33.8788247, 151.2104533)]"
Julia> typeof(_string)
String
Julia> _string2
"[1, 2, 3, 4]"
Julia> typeof(_string2)
String
I would like to convert to these into arrays quickly, I know the type of what each string is supposed to be beforehand.
i.e. so
Julia> convert(Array{Tuple(Float64, Float64)}, _string)
(-33.8800966, 151.2069034), (-33.8801202, 151.2071933), (-33.8803442, 151.2083656), (-33.8804469, 151.2088682), (-33.8788247, 151.2104533)]
Julia> convert(Array{Int,1}, _string2)
[1, 2, 3, 4]
Currently i'm using eval(Meta.parse(_string)) which is super slow when repeated millions of times.
What's a fast way to quickly read these strings into arrays?
This probably isn't the best answer but one way to speed it up is to parse the strings exploiting any information on their structure you have, e.g. for your second example:
julia> using BenchmarkTools
julia> #btime parse.(Int64, split(replace(replace($_string2, "[" => ""), "]" => ""), ","))
995.583 ns (19 allocations: 944 bytes)
4-element Array{Int64,1}:
1
2
3
4
Which compares to
julia> #btime eval(Meta.parse($_string2))
135.553 μs (43 allocations: 2.67 KiB)
4-element Array{Int64,1}:
1
2
3
4
on my machine.
How feasible this is will of course depend on whether you are able to find patterns in that way for all your strings quickly.

Subclassing int class in Python

I want to do something every time I add two integers in my TestClass.
import builtins
class myInt(int):
def __add__(self, other):
print("Do something")
class TestClass:
def __init__(self):
builtins.int = myInt
def testMethod(self):
a = 1
b = 2
c = a + b
When I call my testMethod nothing happens, however if I define it like this I get the desired effect:
def testMethod(self):
a = int(1)
b = 2
c = a + b
Is it possible to make this work for all int literals without having to typecast them before the operations?
Sorry, it's not possible without building your own custom interpreter. Literal objects aren't constructed by calling the constructor in __builtins__, they are constructed using opcodes that directly call the builtin types.
Also immutable literals are constructed when the code is compiled, so you were too late anyway. If you disassemble testMethod you'll see it simply uses the constants that were compiled, it doesn't attempt to construct them:
>>> dis.dis(TestClass.testMethod)
5 0 LOAD_CONST 1 (1)
2 STORE_FAST 1 (a)
6 4 LOAD_CONST 2 (2)
6 STORE_FAST 2 (b)
7 8 LOAD_FAST 1 (a)
10 LOAD_FAST 2 (b)
12 BINARY_ADD
14 STORE_FAST 3 (c)
16 LOAD_CONST 0 (None)
18 RETURN_VALUE
Mutable literals are constructed at runtime but they use opcodes to construct the appropriate value rather than calling the type:
>>> dis.dis(lambda: {'a': 1, 'b': 2})
1 0 LOAD_CONST 1 (1)
2 LOAD_CONST 2 (2)
4 LOAD_CONST 3 (('a', 'b'))
6 BUILD_CONST_KEY_MAP 2
8 RETURN_VALUE
You could do something along the lines of what you want by parsing the source code (use builtin compile() with ast.PyCF_ONLY_AST flag) then walking the parse tree and replacing int literals with a call to your own type (use ast.NodeTransformer). Then all you have to do is finish the compilation (use compile() again). You could even do that with an import hook so it happens automatically when your module is imported, but it will be messy.

Association and Sequence Mining

Suppose I have a string of numbers with hyphens representing a space, define
string, A := 1212-241321413-21341-3
and I have a known group of numbers that I am interested in,
group, G := ( 1, 2)
meaning that I don't care the order, 12 or 21. I just want to know, is there an algorithm that finds all the substrings and their beginning positions, however long, that contain ONLY 1 AND 2 ( the substring must contain a 1 and a 2 and there is no neighboring repitition, i.e. you will never see a 22 or 11 )
That means if I ran the algorithm with the string A and the group G, I would get something like
>> substringfind(A, G)
>>
>> { 1212 : [0], 21 : [9, 15] }
if the algorithm returned a dictionary with keys as the substring and key-values as lists of the beginning locations in the string.
Another example would be
group H := (1, 3, 4)
and the algorithm would produce
>> substringfind(A, H)
>>
>> { 413 : [7], 1413 : [11], 1341 : [16] }

Resources