I use Rust polars for analyzing time-series data.
This csv data includes a lot of sensors' output, and some sensors are mostly 0 and otherwise float.
Default inference schema judge such colmuns as i64, but I want to judge columns as f64.
Current Solutions are as follows, but they have some problems.
with_infer_schema_length(None)
This function scan all values and judge datatype, but become too slow.
with_schema()orwith_dtype_overwrite()
These function specify column dtype, but we should list up columns. (ex: get_column_names())
If string columns are mixed, I should exclude them manually.
Do you know any way to enforce polars to use f64 for number columns?
I want to keep str columns as it, and to convert number(i64,i32,...) columns into f64.
Thank you.
I checked official document, and found with_infer_schema_length(None), with_schema(), with_dtype_overwrite().
However, these functions are not enough for this problem.
Related
What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series structure), which is a nested structure.
Such nested structures are meant:
list of arrays,
array of arrays,
series of lists,
dataframe with nested lists of numerical values in some columns (my case)
How to get the simple descriptive statistics from any level of the nested structure in one go?
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values.
I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.
My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well.
In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=,
kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.
Conceptually the indexing and column names of a pandas data frame seem to me equivalent to its data type. Is there a sensible way to portray this using type hints or is this a matter for docstrings?
I found this library: https://pypi.org/project/dataenforce/
Doesn't seem to support anything with respect to names and dtypes of index columns.
I haven't used it, but it looks interesting.
I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark
I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.
Beware: I don't need a column of vectors; I need a vector object.
How to do this?
I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.
About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.
How:
DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())
but it doesn't make sense in any practical scenario.
Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.
I am using this as a resource to get me started - http://www.pantz.org/software/sqlite/sqlite_commands_and_general_usage.html
Currently I am working on creating an AIR program making use of the built in SQLite database. I could be considered a complete noob in making SQL queries.
table column types
I have a rather large excel file (14K rows) that I have exported to a CSV file. It has 65 columns of varying data types (mostly ints, floats and short strings, MAYBE a few bools). I have no idea about the proper form of importing so as to preserve the column structure nor do I know the best data formats to choose per db column. I could use some input on this.
table creation utils
Is there a util that can read an XLS file and based on the column headers, generate a quick query statement to ease the pain of making the query manually? I saw this post but it seems geared towards a preexisting CSV file and makes use of python (something I am also a noob at)
Thank you in advance for your time.
J
SQLite3's column types basically boil down to:
TEXT
NUMERIC (REAL, FLOAT)
INTEGER (the various lengths of integer; but INT will normally do)
BLOB (binary objects)
Generally in a CSV file you will encounter strings (TEXT), decimal numbers (FLOAT), and integers (INT). If performance isn't critical, those are pretty much the only three column types you need. (CHAR(80) is smaller on disk than TEXT but for a few thousand rows it's not so much of an issue.)
As far as putting data into the columns is concerned, SQLite3 uses type coercion to convert the input data type to the column type whereever the conversion makes sense. So all you have to do is specify the correct column type, and SQLite will take care of storing it in the correct way.
For example the number -1230.00, the string "-1230.00", and the string "-1.23e3" will all coerce to the number 1230 when stored in a FLOAT column.
Note that if SQLite3 can't apply a meaningful type conversion, it will just store the original data without attempting to convert it at all. SQLite3 is quite happy to insert "Hello World!" into a FLOAT column. This is usually a Bad Thing.
See the SQLite3 documentation on column types and conversion for gems such as:
Type Affinity
In order to maximize compatibility between SQLite and other database
engines, SQLite supports the concept of "type affinity" on columns.
The type affinity of a column is the recommended type for data stored
in that column. The important idea here is that the type is
recommended, not required. Any column can still store any type of
data. It is just that some columns, given the choice, will prefer to
use one storage class over another. The preferred storage class for a
column is called its "affinity".