Error when going from dataframe to excel file in Julia - excel

I am trying to export a dataframe to xlsx
df |> XLSX.writexlsx("df.xlsx")
and am getting this error:
ERROR: Unsupported datatype String31 for writing data to Excel file. Supported data types are Union{Missing, Bool, Float64, Int64, Dates.Date, Dates.DateTime, Dates.Time, String} or XLSX.CellValue.
I cannot find anyting about String31 utilizing Google. Is there a "transform" that I need to perform on the df beforehand?

It seems that currently this is a limitation of XSLX.jl. I have opened an issue proposing to fix it.
I assume your df is not huge, in which case the following solution should work for you:
to_string(x::AbstractString) = String(x)
to_string(x::Any) = x
XLSX.writetable("df.xlsx", to_string.(df))

I guess you're reading the data in from a CSV file? CSV.jl uses fixed-length strings from the InlineStrings.jl package when it reads from a file, which can be much better for managing memory allocation, and hence for performance. For eg.:
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame)
1×3 DataFrame
Row │ S I F
│ String31 Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Here, the S column was assigned a String31 type since that's the smallest type from InlineStrings that can hold our string.
However, it looks like XLSX.jl doesn't know how to handle InlineString types, unfortunately.
You can specify when reading the file that you want all strings to be read as just String type and not the fixed-length InlineString variety.
julia> df = CSV.read(IOBuffer("S, I, F
hello Julian world, 2, 3.14"), DataFrame, stringtype = String)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
Or, you can transform the column just before writing to the file.
julia> transform!(df, :S => ByRow(String) => :S)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
If you have multiple InlineString type columns and don't want to specify them individually, you can also do:
julia> to_string(c::AbstractVector) = c;
julia> to_string(c::AbstractVector{T}) where T <: AbstractString = String.(c);
julia> mapcols!(to_string, df)
1×3 DataFrame
Row │ S I F
│ String Int64 Float64
─────┼────────────────────────────────────
1 │ hello Julian world 2 3.14
(Edited to use multiple dispatch here (based on Bogumil Kaminsky's answer) rather than typecheck in the function.)

Related

Extracting a Rust Polars dataframe value as a scalar value

I have the following code to find the mean of the ages in the dataframe.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
].unwrap();
let mean = df
.lazy()
.select([col("age").mean()])
.collect().unwrap();
println!("{:?}", mean);
After finding the mean, I want to extract the value as an f64.
┌──────────┐
│ age │
│ --- │
│ f64 │ -----> how to transform into a single f64 of value 4.333333?
╞══════════╡
│ 4.333333 │
└──────────┘
Normally, I would do something like df[0,0] to extract the only value. However, as Polars is not a big proponent of indexing, how would one do it using Rust Polars?
Ok guys I found a couple of ways to do this. Although, I'm not sure if they are the most efficient.
let df = df! [
"name" => ["panda", "polarbear", "seahorse"],
"age" => [5, 7, 1],
]?;
let mean = df
.lazy()
.select([col("age").mean()])
.collect()?;
// Select the column as Series, turn into an iterator, select the first
// item and cast it into an f64
let mean1 = mean.column("age")?.iter().nth(0)?.try_extract::<f64>()?;
// Select the column as Series and calculate the sum as f64
let mean2 = mean.column("age")?.sum::<f64>()?;
mean["age"].max().unwrap()
or
mean["age"].f64().unwrap().get(0).unwrap()

Replace multiple strings with multiple values in Julia

In Python pandas you can pass a dictionary to df.replace in order to replace every matching key with its corresponding value. I use this feature a lot to replace word abbreviations in Spanish that mess up sentence tokenizers.
Is there something similar in Julia? Or even better, so that I (and future users) may learn from the experience, any ideas in how to implement such a function in Julia's beautiful and performant syntax?
Thank you!
Edit: Adding an example as requested
Input:
julia> DataFrames.DataFrame(Dict("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."]))
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
Desired output:
3×1 DataFrame
Row │ A
│ String
─────┼────────────────────
1 │ This is an example
2 │ This is a sample
3 │ This is a sample of an example
In Julia the function for this is also replace. It takes a collection and replaces elements in it. The simplest form is:
julia> x = ["a", "ab", "ac", "b", "bc", "bd"]
6-element Vector{String}:
"a"
"ab"
"ac"
"b"
"bc"
"bd"
julia> replace(x, "a" => "aa", "b" => "bb")
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
If you have more complex replace pattern you can pass a function that does the replacement:
julia> replace(x) do s
length(s) == 1 ? s^2 : s
end
6-element Vector{String}:
"aa"
"ab"
"ac"
"bb"
"bc"
"bd"
There is also replace! that does the same in-place.
Is this what you wanted?
EDIT
Replacement of substrings in a vector of strings:
julia> df = DataFrame("A" => ["This is an ex.", "This is a samp.", "This is a samp. of an ex."])
3×1 DataFrame
Row │ A
│ String
─────┼───────────────────────────
1 │ This is an ex.
2 │ This is a samp.
3 │ This is a samp. of an ex.
julia> df.A .= replace.(df.A, "ex." => "example", "samp." => "sample")
3-element Vector{String}:
"This is an example"
"This is a sample"
"This is a sample of an example"
Note two things:
you do not need to pass Dict to DataFrame constructor. It is enough to just pass pairs.
In assignment I used .= not =, which perfoms an in-place replacement of updated values in the already existing vector (I show it for a comparison to what #Sundar R proposed in a comment which is an alternative that allocates a new vector; the difference probably does not matter much in your case but I just wanted to show you both syntaxes).

Unable to convert pandas str column with .0 to int [duplicate]

This question already has answers here:
Change column type in pandas
(16 answers)
Remove decimals fom pandas column(String type)
(2 answers)
Closed 8 months ago.
Let's say I have a dataframe that looks like this...
xdf = pd.DataFrame({ 'foo': ['1', '2', '3']})
foo
1
2
3
And I want to convert the column to type int. I can do that easily with...
df = df.astype({ 'foo': 'int' })
But if my dataframe looks like this...
df = pd.DataFrame({ 'foo': ['1.0', '2.0', '3.0']})
foo
1.0
2.0
3.0
And I try to convert it from object to int then I get this error
ValueError: invalid literal for int() with base 10: '1.0'
Why doesn't this work? How would I go about converting this to an int properly?
Just do a two step conversion, string to float then float to int.
>>> df.astype({ 'foo': 'float' }).astype({ 'foo': 'int' })
foo
0 1
1 2
2 3
It works with or without the decimal point.
You can use the downcast option of to_numeric method.
df['foo'] = pd.to_numeric(df['foo'], downcast='integer')

How can I get a consistent return type from this function?

Is there any way to get the function below to return a consistent type? I'm doing some work with Julia GLM (love it). I wrote a function that creates all of the possible regression combinations for a dataset. However, my current method of creating a #formula returns a different type for every different length of rhs.
using GLM
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
ts = term.((1, rhs...))
term(lhs) ~ sum(ts)
end
Using #code_warntype for a simple example returns the following
julia> #code_warntype compose(:y, [:x])
Variables
#self#::Core.Compiler.Const(compose, false)
lhs::Symbol
rhs::Array{Symbol,1}
ts::Any
Body::FormulaTerm{Term,_A} where _A
1 ─ %1 = Core.tuple(1)::Core.Compiler.Const((1,), false)
│ %2 = Core._apply(Core.tuple, %1, rhs)::Core.Compiler.PartialStruct(Tuple{Int64,Vararg{Symbol,N} where N}, Any[Core.Compiler.Const(1, false), Vararg{Symbol,N} where N])
│ %3 = Base.broadcasted(Main.term, %2)::Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple},Nothing,typeof(term),_A} where _A<:Tuple
│ (ts = Base.materialize(%3))
│ %5 = Main.term(lhs)::Term
│ %6 = Main.sum(ts)::Any
│ %7 = (%5 ~ %6)::FormulaTerm{Term,_A} where _A
└── return %7
And checking the return type of a few different inputs:
julia> compose(:y, [:x]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term}}
julia> compose(:y, [:x1, :x2]) |> typeof
FormulaTerm{Term,Tuple{ConstantTerm{Int64},Term,Term}}
We see that as the length of rhs changes, the return type changes.
Can I change my compose function so that it always returns the same type? This isn't really a big issue. Compiling for each new number of regressors only takes ~70ms. This is really more of a "how can I improve my Julia skills?"
I do not think you can avoid type unstability here as ~ expects RHS to be a Term or a Tuple of Terms.
However, the most compilation cost you are paying is in term.((1, rhs...)) as you invoke broadcasting which is expensive to compile. Here is how you can do it in a cheaper way:
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ ntuple(i -> i <= length(rhs) ? term(rhs[i]) : term(1) , length(rhs)+1)
end
or (this is a bit slower but more like your original code):
function compose(lhs::Symbol, rhs::AbstractVector{Symbol})
term(lhs) ~ map(term, (1, rhs...))
end
Finally - if you are doing such computations maybe you can drop using the formula interface but feed to lm or glm directly matrices as RHS in which case it should be able to avoid extra compilation cost, e.g.:
julia> y = rand(10);
julia> x = rand(10, 2);
julia> #time lm(x,y);
0.000048 seconds (18 allocations: 1.688 KiB)
julia> x = rand(10, 3);
julia> #time lm(x,y);
0.000038 seconds (18 allocations: 2.016 KiB)
julia> y = rand(100);
julia> x = rand(100, 50);
julia> #time lm(x,y);
0.000263 seconds (22 allocations: 121.172 KiB)

Problem converting python list to np.array. Process is dropping sting type data

My goal is to convert this list of strings to a Numpy Array.
I want to convert the first 2 columns to numerical data (integer)
list1 = [['380850', '625105', 'Dota 2'],
['354804', '846193', "PLAYERUNKNOWN'S BATTLEGROUNDS"],
['204354', '467109', 'Counter-Strike: Global Offensive']
]
dt = np.dtype('i,i,U')
cast_array = np.array([tuple(row) for row in sl], dtype=dt)
print(cast_array)
The result is ...
[OUT] [(380850, 625105, '') (354804, 846193, '') (204354, 467109, '')]
I am losing the string data. I am interested in
Understanding why the string data is getting dropped
Finding any solution that converts the first 2 columns to integer type in a numpy array
This answer gave me the approach but doesn't seem to work for strings
Thanks to user: 9769953's comment above, this is the solution.
#when specifying strings you need to specify the length (derived from longest string in the list)
dtypestr = 'int, int, U' + str(max([len(i[2]) for i in plist1]))
cast_array = np.array([tuple(row) for row in plist1], dtype=dtypestr)
print(np.array(cast_array))
The simplest way to do that at high level is to use pandas, as said in comments, which will silently manage tricky problems :
In [64]: df=pd.DataFrame(list1)
In [65]: df2=df.apply(pd.to_numeric,errors='ignore')
In [66]: df2
Out[66]:
0 1 2
0 380850 625105 Dota 2
1 354804 846193 PLAYERUNKNOWN'S BATTLEGROUNDS
2 204354 467109 Counter-Strike: Global Offensive
In [67]: df2.dtypes
Out[67]:
0 int64
1 int64
2 object
dtype: object
df2.iloc[:,:2].values will be the numpy array, You can use all numpy accelerations on this part.
Your dtype is not what you expect it to be - you're running into https://github.com/numpy/numpy/issues/8969:
>>> dt = np.dtype('i,i,U')
>>> dt
dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<U')])
>>> dt['f2'].itemsize
0 # 0-length strings!
You need to either specify a maximum number of characters
>>> dt = np.dtype('i,i,16U')
Or use an object type to store variable length strings:
>>> dt = np.dtype('i,i,O')

Resources