In Python pandas you can pass a dictionary to df.replace in order to replace every matching key with its corresponding value. I use this feature a lot to replace word abbreviations in Spanish that mess up sentence tokenizers.
Is there something similar in Julia? Or even better, so that I (and future users) may learn from the experience, any ideas in how to implement such a function in Julia's beautiful and performant syntax?
Thank you!
Edit: Adding an example as requested
Input:
julia> DataFrames.DataFrame(Dict("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."]))
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
Desired output:
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an example
2 │ This is a sample
3 │ This is a sample of an example
In Julia the function for this is also replace. It takes a collection and replaces elements in it. The simplest form is:
julia> x = ["a", "ab", "ac", "b", "bc", "bd"]
6-element Vector{String}:
"a"
"ab"
"ac"
"b"
"bc"
"bd"
julia> replace(x, "a" => "aa", "b" => "bb")
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
If you have more complex replace pattern you can pass a function that does the replacement:
julia> replace(x) do s
length(s) == 1 ? s^2 : s
end
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
There is also replace! that does the same in-place.
Is this what you wanted?
EDIT
Replacement of substrings in a vector of strings:
julia> df = DataFrame("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."])
3×1 DataFrame
Row │ A
│ String
─────┼───────────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
julia> df.A .= replace.(df.A, "ex." => "example", "samp." => "sample")
3-element Vector{String}:
"This is an example"
"This is a sample"
"This is a sample of an example"
Note two things:
you do not need to pass Dict to DataFrame constructor. It is enough to just pass pairs.
In assignment I used .= not =, which perfoms an in-place replacement of updated values in the already existing vector (I show it for a comparison to what #Sundar R proposed in a comment which is an alternative that allocates a new vector; the difference probably does not matter much in your case but I just wanted to show you both syntaxes).
Related
I have the following code to find the mean of the ages in the dataframe.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
].unwrap();
let mean = df
.lazy()
.select([col("age").mean()])
.collect().unwrap();
println!("{:?}", mean);
After finding the mean, I want to extract the value as an f64.
┌──────────┐
│ age │
│ --- │
│ f64 │ -----> how to transform into a single f64 of value 4.333333?
╞══════════╡
│ 4.333333 │
└──────────┘
Normally, I would do something like df[0,0] to extract the only value. However, as Polars is not a big proponent of indexing, how would one do it using Rust Polars?
Ok guys I found a couple of ways to do this. Although, I'm not sure if they are the most efficient.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
]?;
let mean = df
.lazy()
.select([col("age").mean()])
.collect()?;
// Select the column as Series, turn into an iterator, select the first
// item and cast it into an f64
let mean1 = mean.column("age")?.iter().nth(0)?.try_extract::<f64>()?;
// Select the column as Series and calculate the sum as f64
let mean2 = mean.column("age")?.sum::<f64>()?;
mean["age"].max().unwrap()
or
mean["age"].f64().unwrap().get(0).unwrap()
I am trying to export a dataframe to xlsx
df |> XLSX.writexlsx("df.xlsx")
and am getting this error:
ERROR: Unsupported datatype String31 for writing data to Excel file. Supported data types are Union{Missing, Bool, Float64, Int64, Dates.Date, Dates.DateTime, Dates.Time, String} or XLSX.CellValue.
I cannot find anyting about String31 utilizing Google. Is there a "transform" that I need to perform on the df beforehand?
It seems that currently this is a limitation of XSLX.jl. I have opened an issue proposing to fix it.
I assume your df is not huge, in which case the following solution should work for you:
to_string(x::AbstractString) = String(x)
to_string(x::Any) = x
XLSX.writetable("df.xlsx", to_string.(df))
I guess you're reading the data in from a CSV file? CSV.jl uses fixed-length strings from the InlineStrings.jl package when it reads from a file, which can be much better for managing memory allocation, and hence for performance. For eg.:
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame)
1×3 DataFrame
Row │ S I F
│ String31 Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Here, the S column was assigned a String31 type since that's the smallest type from InlineStrings that can hold our string.
However, it looks like XLSX.jl doesn't know how to handle InlineString types, unfortunately.
You can specify when reading the file that you want all strings to be read as just String type and not the fixed-length InlineString variety.
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame, stringtype = String)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Or, you can transform the column just before writing to the file.
julia> transform!(df, :S => ByRow(String) => :S)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
If you have multiple InlineString type columns and don't want to specify them individually, you can also do:
julia> to_string(c::AbstractVector) = c;
julia> to_string(c::AbstractVector{T}) where T <: AbstractString = String.(c);
julia> mapcols!(to_string, df)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
(Edited to use multiple dispatch here (based on Bogumil Kaminsky's answer) rather than typecheck in the function.)
I have a list in terraform that looks something like:
array = ["a","b","c"]
Within this terraform file there are two variables called age and gender, and I want to make it so that the list called array has an extra element called "d" if age is equal to 12 and gender is equal to male (i.e. if var.age == 12 && var.gender == 'male' then array should be ["a","b","c","d"], else array should be ["a","b","c"]). Would the following going along the right path, or would I need to use another method?
array = ["a","b","c", var.age == 12 && var.gender == 'male' ? "d" : null]
There is another way to do that using flatten:
variable = flatten(["a", "b", "c", var.age == 12 ? ["d"] : []])
There are few ways you could do it. One way would be (example):
variable "array" {
default = ["a","b","c"]
}
variable "age" {
default = 12
}
variable "gender" {
default = "male"
}
locals {
array = var.age == 12 && var.gender == "male" ? concat(var.array, ["d"]) : var.array
}
output "test" {
value = local.array
}
Another approach to the problem in your particular example is removing empty elements at the end of the array.
compact is perhaps the most straightforward, but requires you to rely on using "" as a sentinel value for Emptiness.
From the documentation:
compact takes a list of strings and returns a new list with any empty string elements removed.
> compact(["a", "", "b", "c"])
[
"a",
"b",
"c",
]
I would prefer this over the other answers because it's more idiomatic. I suppose it only works for strings though.
array = compact(["a","b","c", var.age == 12 && var.gender == 'male' ? "d" : ""])`
["a","b","c"] if age != 12 and gender != male
["a","b","c", "d"] if age == 12 and gender == male
Of course the "" element could be anywhere in your list and compact would handle this optionality issue.
I would also be interested in which has the best O() performance. I don't know how compact is implemented underneath the hood, but in general you would either copy elements into a new array or you would be shifting elements into gaps left by removed elements.
I don't expect any other solution to be much better than this. Perhaps concat.
concat function pulled into terraform source code
concat source code. Just appends elements into a new slice.
Given that, it probably requires O(n) comparisons + appends, and then takes O(n) space because a new list is created.
I know that I can use integer keys for a hashmap like the following example for a Dictionary. But Dictionaries are unordered and do not benefit from having integer keys.
julia> hashmap = Dict( 5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
Dict{Int64,String} with 4 entries:
9 => "nine"
16 => "sixteen"
70 => "seventy"
5 => "five"
julia> hashmap[9]
"nine"
julia> hashmap[8:50] # I would like to be able to do this to get keys between 8 and 50 (9 and 16 here)
ERROR: KeyError: key 8:50 not found
Stacktrace:
[1] getindex(::Dict{Int64,String}, ::UnitRange{Int64}) at ./dict.jl:477
[2] top-level scope at REPL[3]:1
I'm looking for an ordered structure allowing access all it's keys within a certain range while benefiting from performance optimization due to sorted keys.
There is a dedicated library named DataStructures which has a SortedDict structure and corresponding search functions:
using DataStructures
d = SortedDict(5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
st1 = searchsortedfirst(d, 8) # index of the first key greater than or equal to 8
st2 = searchsortedlast(d, 50) # index of the last key less than or equal to 50
And now:
julia> [(k for (k,v) in inclusive(d,st1,st2))...]
3-element Array{Int64,1}:
9
16
I do not think there is such a structure in the standard library, but this could be implemented as a function on an ordinary dictionary as long as the keys are of a type that fits the choice of range:
julia> d = Dict(1 => "a", 2 => "b", 5 => "c", 7 => "r", 9 => "t")
Dict{Int64,String} with 5 entries:
7 => "r"
9 => "t"
2 => "b"
5 => "c"
1 => "a"
julia> dictrange(d::Dict, r::UnitRange) = [d[k] for k in sort!(collect(keys(d))) if k in r]
dictrange (generic function with 1 method)
julia> dictrange(d, 2:6)
2-element Array{String,1}:
"b"
"c"
get allows you to have a default value when none is defined, you can default to missing and then skip them
julia> hashmap = Dict( 5 => "five", 9 => "nine", 16 => "sixteen", 70 => "seventy")
Dict{Int64,String} with 4 entries:
9 => "nine"
16 => "sixteen"
70 => "seventy"
5 => "five"
julia> get.(Ref(hashmap), 5:10, missing)
6-element Array{Union{Missing, String},1}:
"five"
missing
missing
missing
"nine"
missing
julia> get.(Ref(hashmap), 5:10, missing) |> skipmissing |> collect
2-element Array{String,1}:
"five"
"nine"
In the case you are working with dates, you might consider have a look at the TimeSeries package which does what you want provided your integer keys are representing dates:
using TimeSeries
dates = [Date(2020,11,5), Date(2020,11,9), Date(2020,11,16), Date(2020,11,30)]
times = TimeArray(dates, ["five", "nine", "sixteen", "thirty"])
And then:
times[Date(2020,11,8):Day(1):Date(2020,11,20)]
2×1 TimeArray{String,1,Date,Array{String,1}} 2020-11-09 to 2020-11-16
│ │ A │
├────────────┼───────────┤
│ 2020-11-09 │ "nine" │
│ 2020-11-16 │ "sixteen" │
Is there any way to get the function below to return a consistent type? I'm doing some work with Julia GLM (love it). I wrote a function that creates all of the possible regression combinations for a dataset. However, my current method of creating a #formula returns a different type for every different length of rhs.
using GLM
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
ts = term.((1, rhs...))
term(lhs) ~ sum(ts)
end
Using #code_warntype for a simple example returns the following
julia> #code_warntype compose(:y, [:x])
Variables
#self#::Core.Compiler.Const(compose, false)
lhs::Symbol
rhs::Array{Symbol,1}
ts::Any
Body::FormulaTerm{Term,_A} where _A
1 ─ %1 = Core.tuple(1)::Core.Compiler.Const((1,), false)
│ %2 = Core._apply(Core.tuple, %1, rhs)::Core.Compiler.PartialStruct(Tuple{Int64,Vararg{Symbol,N} where N}, Any[Core.Compiler.Const(1, false), Vararg{Symbol,N} where N])
│ %3 = Base.broadcasted(Main.term, %2)::Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple},Nothing,typeof(term),_A} where _A<:Tuple
│ (ts = Base.materialize(%3))
│ %5 = Main.term(lhs)::Term
│ %6 = Main.sum(ts)::Any
│ %7 = (%5 ~ %6)::FormulaTerm{Term,_A} where _A
└── return %7
And checking the return type of a few different inputs:
julia> compose(:y, [:x]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term}}
julia> compose(:y, [:x1, :x2]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term,Term}}
We see that as the length of rhs changes, the return type changes.
Can I change my compose function so that it always returns the same type? This isn't really a big issue. Compiling for each new number of regressors only takes ~70ms. This is really more of a "how can I improve my Julia skills?"
I do not think you can avoid type unstability here as ~ expects RHS to be a Term or a Tuple of Terms.
However, the most compilation cost you are paying is in term.((1, rhs...)) as you invoke broadcasting which is expensive to compile. Here is how you can do it in a cheaper way:
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ ntuple(i -> i <= length(rhs) ? term(rhs[i]) : term(1) , length(rhs)+1)
end
or (this is a bit slower but more like your original code):
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ map(term, (1, rhs...))
end
Finally - if you are doing such computations maybe you can drop using the formula interface but feed to lm or glm directly matrices as RHS in which case it should be able to avoid extra compilation cost, e.g.:
julia> y = rand(10);
julia> x = rand(10, 2);
julia> #time lm(x,y);
0.000048 seconds (18 allocations: 1.688 KiB)
julia> x = rand(10, 3);
julia> #time lm(x,y);
0.000038 seconds (18 allocations: 2.016 KiB)
julia> y = rand(100);
julia> x = rand(100, 50);
julia> #time lm(x,y);
0.000263 seconds (22 allocations: 121.172 KiB)