I have the following code to find the mean of the ages in the dataframe.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
].unwrap();
let mean = df
.lazy()
.select([col("age").mean()])
.collect().unwrap();
println!("{:?}", mean);
After finding the mean, I want to extract the value as an f64.
┌──────────┐
│ age │
│ --- │
│ f64 │ -----> how to transform into a single f64 of value 4.333333?
╞══════════╡
│ 4.333333 │
└──────────┘
Normally, I would do something like df[0,0] to extract the only value. However, as Polars is not a big proponent of indexing, how would one do it using Rust Polars?
Ok guys I found a couple of ways to do this. Although, I'm not sure if they are the most efficient.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
]?;
let mean = df
.lazy()
.select([col("age").mean()])
.collect()?;
// Select the column as Series, turn into an iterator, select the first
// item and cast it into an f64
let mean1 = mean.column("age")?.iter().nth(0)?.try_extract::<f64>()?;
// Select the column as Series and calculate the sum as f64
let mean2 = mean.column("age")?.sum::<f64>()?;
mean["age"].max().unwrap()
or
mean["age"].f64().unwrap().get(0).unwrap()
Related
In Python pandas you can pass a dictionary to df.replace in order to replace every matching key with its corresponding value. I use this feature a lot to replace word abbreviations in Spanish that mess up sentence tokenizers.
Is there something similar in Julia? Or even better, so that I (and future users) may learn from the experience, any ideas in how to implement such a function in Julia's beautiful and performant syntax?
Thank you!
Edit: Adding an example as requested
Input:
julia> DataFrames.DataFrame(Dict("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."]))
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
Desired output:
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an example
2 │ This is a sample
3 │ This is a sample of an example
In Julia the function for this is also replace. It takes a collection and replaces elements in it. The simplest form is:
julia> x = ["a", "ab", "ac", "b", "bc", "bd"]
6-element Vector{String}:
"a"
"ab"
"ac"
"b"
"bc"
"bd"
julia> replace(x, "a" => "aa", "b" => "bb")
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
If you have more complex replace pattern you can pass a function that does the replacement:
julia> replace(x) do s
length(s) == 1 ? s^2 : s
end
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
There is also replace! that does the same in-place.
Is this what you wanted?
EDIT
Replacement of substrings in a vector of strings:
julia> df = DataFrame("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."])
3×1 DataFrame
Row │ A
│ String
─────┼───────────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
julia> df.A .= replace.(df.A, "ex." => "example", "samp." => "sample")
3-element Vector{String}:
"This is an example"
"This is a sample"
"This is a sample of an example"
Note two things:
you do not need to pass Dict to DataFrame constructor. It is enough to just pass pairs.
In assignment I used .= not =, which perfoms an in-place replacement of updated values in the already existing vector (I show it for a comparison to what #Sundar R proposed in a comment which is an alternative that allocates a new vector; the difference probably does not matter much in your case but I just wanted to show you both syntaxes).
I am trying to export a dataframe to xlsx
df |> XLSX.writexlsx("df.xlsx")
and am getting this error:
ERROR: Unsupported datatype String31 for writing data to Excel file. Supported data types are Union{Missing, Bool, Float64, Int64, Dates.Date, Dates.DateTime, Dates.Time, String} or XLSX.CellValue.
I cannot find anyting about String31 utilizing Google. Is there a "transform" that I need to perform on the df beforehand?
It seems that currently this is a limitation of XSLX.jl. I have opened an issue proposing to fix it.
I assume your df is not huge, in which case the following solution should work for you:
to_string(x::AbstractString) = String(x)
to_string(x::Any) = x
XLSX.writetable("df.xlsx", to_string.(df))
I guess you're reading the data in from a CSV file? CSV.jl uses fixed-length strings from the InlineStrings.jl package when it reads from a file, which can be much better for managing memory allocation, and hence for performance. For eg.:
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame)
1×3 DataFrame
Row │ S I F
│ String31 Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Here, the S column was assigned a String31 type since that's the smallest type from InlineStrings that can hold our string.
However, it looks like XLSX.jl doesn't know how to handle InlineString types, unfortunately.
You can specify when reading the file that you want all strings to be read as just String type and not the fixed-length InlineString variety.
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame, stringtype = String)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Or, you can transform the column just before writing to the file.
julia> transform!(df, :S => ByRow(String) => :S)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
If you have multiple InlineString type columns and don't want to specify them individually, you can also do:
julia> to_string(c::AbstractVector) = c;
julia> to_string(c::AbstractVector{T}) where T <: AbstractString = String.(c);
julia> mapcols!(to_string, df)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
(Edited to use multiple dispatch here (based on Bogumil Kaminsky's answer) rather than typecheck in the function.)
I mean to use a struct to hold a "table":
% Sample data
% idx idxstr var1 var2 var3
% 1 i01 3.5 21.0 5
% 12 i12 6.5 1.0 3
The first row contains the field names.
Assume I created a struct
ds2 = struct( ...
'idx', { 1, 12 }, ...
'idxstr', { 'i01', 'i12' }, ...
'var1', { 3.5, 6.5 }, ...
'var2', { 21, 1 }, ...
'var3', { 5, 3 } ...
);
How can I retrieve the value for field var2, for the row corresponding to idxstr equal to 'i01'?
Notes:
I cannot ensure the length of idxstr elements will always be 3.
Ideally, I would have a method that also works for columns var2 containing strings, or any other type of variable.
PS: I think https://stackoverflow.com/a/35976320/2707864 can help.
As I mentioned in the comments, I believe you have the wrong kind of struct for this work. Instead of an array of (effectively single-row) structs, you should instead have a single struct with 'array' fields. (numeric or cell, as appropriate).
E.g.
d = struct(
'idx', [1, 12 ],
'idxstr', {{'i01', 'i12'}},
'var1', [3.5, 6.5],
'var2', [21, 1],
'var3', [5, 3]
);
With this structure, your problem becomes infinitely easier to deal with:
d.var2( strcmp( 'i01', d.idxstr ) )
% ans = 21
This is also far more comparable to R / pandas dataframes functionality (which are also effectively initialised via names and equally-sized arrays like this).
PS. Note carefully the syntax used for the 'idxstr' field: there is an 'outer' cell array with a single element, meaning you're only creating a single struct, rather than an array of structs. This single element happens to be a cell array of strings, where this cell array is of the same size (i.e. has the same number of 'rows') as the numeric arrays.
UPDATE
In response to the comment, adding 'rows' should be fairly straightforward. Here is one approach:
function S = addrow( S, R )
FieldNames = fieldnames( S ).'; NumFields = length( FieldNames );
for i = 1 : NumFields,
S.( FieldNames{i} ) = horzcat( S.( FieldNames{i} ), R{i} );
end
end
Then you can simply do:
d = addrow( d, {5, 'i011', 2.7, 10, 11} );
Assuming that idxstr can be more than 3 characters (there is a shorter version of its always 3 chars), this is the thing I came up with (tested on MATLAB):
logical_index=~cellfun(#isempty,strfind({ds2(:).idxstr},'i01'))
you can access the variables as:
ds2(~cellfun(#isempty,strfind({ds2(:).idxstr},'i01'))).var2;
% using above variable
ds2(logical_index).var2;
You can understand now why MATLAB introduced tables hehe.
Maybe you can try the code like below using strcmp
>> [ds2.var2](strcmp('i01',{ds2.idxstr}))
ans = 21
I put together function
function el = struct_pick(s, cdata, cnames, rname)
% Pick an element from a struct by column and row name
coldata = vertcat(s.(cdata));
colnames = mat2cell(vertcat(s.(cnames)), ones(1, length(s)));
% This assumes rname is a string
flt = strcmp(colnames, rname);
el = coldata(logical(flt));
endfunction
which is called with
% Pick an element by column and row name
cdata = 'var3';
cnames = 'idxstr';
rname = 'i01';
elem = struct_pick(ds2, cdata, cnames, rname);
and it seems to do the job.
I don't know if it is an unnecessarily contrived way of doing it.
Still have to deal with the possibility that the row names are not strings, as with
cnames = 'idx';
rname = 1;
EDIT: If the strings in idxstr are not all of the same length, this throws error: vertcat: cat: dimension mismatch.
The answer by Ander Biguri can handle this case.
Is there any way to get the function below to return a consistent type? I'm doing some work with Julia GLM (love it). I wrote a function that creates all of the possible regression combinations for a dataset. However, my current method of creating a #formula returns a different type for every different length of rhs.
using GLM
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
ts = term.((1, rhs...))
term(lhs) ~ sum(ts)
end
Using #code_warntype for a simple example returns the following
julia> #code_warntype compose(:y, [:x])
Variables
#self#::Core.Compiler.Const(compose, false)
lhs::Symbol
rhs::Array{Symbol,1}
ts::Any
Body::FormulaTerm{Term,_A} where _A
1 ─ %1 = Core.tuple(1)::Core.Compiler.Const((1,), false)
│ %2 = Core._apply(Core.tuple, %1, rhs)::Core.Compiler.PartialStruct(Tuple{Int64,Vararg{Symbol,N} where N}, Any[Core.Compiler.Const(1, false), Vararg{Symbol,N} where N])
│ %3 = Base.broadcasted(Main.term, %2)::Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple},Nothing,typeof(term),_A} where _A<:Tuple
│ (ts = Base.materialize(%3))
│ %5 = Main.term(lhs)::Term
│ %6 = Main.sum(ts)::Any
│ %7 = (%5 ~ %6)::FormulaTerm{Term,_A} where _A
└── return %7
And checking the return type of a few different inputs:
julia> compose(:y, [:x]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term}}
julia> compose(:y, [:x1, :x2]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term,Term}}
We see that as the length of rhs changes, the return type changes.
Can I change my compose function so that it always returns the same type? This isn't really a big issue. Compiling for each new number of regressors only takes ~70ms. This is really more of a "how can I improve my Julia skills?"
I do not think you can avoid type unstability here as ~ expects RHS to be a Term or a Tuple of Terms.
However, the most compilation cost you are paying is in term.((1, rhs...)) as you invoke broadcasting which is expensive to compile. Here is how you can do it in a cheaper way:
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ ntuple(i -> i <= length(rhs) ? term(rhs[i]) : term(1) , length(rhs)+1)
end
or (this is a bit slower but more like your original code):
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ map(term, (1, rhs...))
end
Finally - if you are doing such computations maybe you can drop using the formula interface but feed to lm or glm directly matrices as RHS in which case it should be able to avoid extra compilation cost, e.g.:
julia> y = rand(10);
julia> x = rand(10, 2);
julia> #time lm(x,y);
0.000048 seconds (18 allocations: 1.688 KiB)
julia> x = rand(10, 3);
julia> #time lm(x,y);
0.000038 seconds (18 allocations: 2.016 KiB)
julia> y = rand(100);
julia> x = rand(100, 50);
julia> #time lm(x,y);
0.000263 seconds (22 allocations: 121.172 KiB)
As i have tree structure in my graph database ArangoDB like below
The node with id 2699394 is a parent node of this graph tree. and each node has attribute named X assigned to it. I want to know sum of x of all the descendants of parent node 2699394 exclusive of its own attribute x in sum.
for example suppose if,
2699399 x value is = 5,
2699408 x value is = 3,
2699428 x value is = 2,
2699418 x value is = 5,
then parent node 2699394 sum of x should be = 5 + 3 + 2 + 5
= 15
so the answer is 15. So can anybody give me query for this calculation in ArangoDB AQL?
To find out no of descendants of particular node, i have used below query,
`FOR v, e, p in 1..1000 OUTBOUND 'Person/1648954'
GRAPH 'Appes'
RETURN v.id`
Thanks in Advance.
Mayank
Assuming that children are linked to their parents, the data could be visualized like this:
nodes/2699394 SUM of children?
↑
nodes/2699399 {x: 5}
↑
nodes/2699408 {x: 3}
↑
nodes/2699428 {x: 2}
↑
nodes/2699418 {x: 5}
To walk the chain of children, we need to traverse in INBOUND direction (or OUTOBUND if parent nodes point to children):
FOR v IN 1..10 INBOUND "nodes/2699394" relations
RETURN v
In this example, an anonymous graph is used by specifying an edge collection relations. You can also use a named graph, like GRAPH "yourGraph".
Starting at nodes/2699394, the edges down to nodes/2699418 are traversed and every node on the way is returned unchanged so far.
Since we are only interested in the x attribute, we can change that to only return that attribute: RETURN v.x - which will return [ 5, 3, 2, 5 ]. Unless we say IN 0..10, the start vertex will not be included.
Inside the FOR loop, we don't have access to all the x values, but only one at a time. We can't do something like RETURN SUM(v.x) here. Instead, we need to assign the result of the traversal to a variable, which makes it a sub-query. We can then add up all the numbers and return the resulting value:
LET x = (
FOR v IN 1..10 INBOUND "nodes/2699394" relations
RETURN v.x
)
RETURN SUM(x) // [ 15 ]
If you want to return the start node with a computed x attribute, you may do the following:
LET doc = DOCUMENT("nodes/2699394")
LET x = (
FOR v IN 1..10 INBOUND doc relations
RETURN v.x
)
RETURN MERGE( doc, { x: SUM(x) } )
The result will be like:
[
{
"_id": "nodes/2699394",
"_key": "2699394",
"_rev": "2699394",
"x": 15
}
]