How to add Column names in a Polars DataFrame while using CsvReader

How to add Column names in a Polars DataFrame while using CsvReader - rust

I can read a csv file which does not have column headers in the file. With the following code using polars in rust:
use polars::prelude::*;
fn read_wine_data() -> Result<DataFrame> {
let file = "datastore/wine.data";
CsvReader::from_path(file)?
.has_header(false)
.finish()
}
fn main() {
let df = read_wine_data();
match df {
Ok(content) => println!("{:?}", content.head(Some(10))),
Err(error) => panic!("Problem reading file: {:?}", error)
}
}
But now I want to add column names into the dataframe while reading or after reading, how can I add the columns names. Here is a column name vector:
let COLUMN_NAMES = vec![
"Class label", "Alcohol",
"Malic acid", "Ash",
"Alcalinity of ash", "Magnesium",
"Total phenols", "Flavanoids",
"Nonflavanoid phenols",
"Proanthocyanins",
"Color intensity", "Hue",
"OD280/OD315 of diluted wines",
"Proline"
];
How can I add these names to the dataframe. The data can be downloaded with the following code:
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

This seemed to work, by creating a schema object and passing it in with the with_schema method on the CsvReader:
use polars::prelude::*;
use polars::datatypes::DataType;
fn read_wine_data() -> Result<DataFrame> {
let file = "datastore/wine.data";
let mut schema: Schema = Schema::new();
schema.with_column("wine".to_string(), DataType::Float32);
CsvReader::from_path(file)?
.has_header(false)
.with_schema(&schema)
.finish()
}
fn main() {
let df = read_wine_data();
match df {
Ok(content) => println!("{:?}", content.head(Some(10))),
Err(error) => panic!("Problem reading file: {:?}", error)
}
}
Granted I don't know what the column names should be, but this is the output I got when adding the one column:
shape: (10, 1)
┌──────┐
│ wine │
│ --- │
│ f32 │
╞══════╡
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ ... │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
├╌╌╌╌╌╌┤
│ 1.0 │
└──────┘

Here is the full solution working for me:
fn read_csv_into_df(path: PathBuf) -> Result<DataFrame> {
let schema = Schema::from(vec![
Field::new("class_label", Int64),
Field::new("alcohol", Float64),
Field::new("malic_acid", Float64),
Field::new("ash", Float64),
Field::new("alcalinity_of_ash", Float64),
Field::new("magnesium", Float64),
Field::new("total_phenols", Float64),
Field::new("flavanoids", Float64),
Field::new("nonflavanoid_phenols", Float64),
Field::new("color_intensity", Float64),
Field::new("hue", Float64),
Field::new("od280/od315_of_diluted_wines", Float64),
Field::new("proline", Float64),
]);
CsvReader::from_path(path)?.has_header(false).with_schema(&schema).finish()
}
I had Use Field and types for each field to create a schema then use the schema in CsvReader to read the data.

Related

Handling list of maps in for loop in terraform

I have the following locals file. I need to get the child and parent names separately in for each in terraform.
locals:
{
l3_crm:
[
{ parent: "crm", child: ["crm-sap", "crm-sf"] },
{ parent: "fin", child: ["fin-mon"] },
]
}
For the following ou creation code in aws, parent_id needs the parent name from the locals and ou_name needs the corresponding child name iterated.
module "l3_crm" {
source = "./modules/ou"
for_each = { for idx, val in local.l3_crm : idx => val }
ou_name = [each.value.child]
parent_id = module.l2[each.key.parent].ou_ids[0]
depends_on = [module.l2]
ou_tags = var.l2_ou_tags
}
I get the following error:
│ Error: Unsupported attribute
│
│ on main.tf line 30, in module "l3_rnd":
│ 30: parent_id = module.l2[each.key.parent].ou_ids[0]
│ ├────────────────
│ │ each.key is a string, known only after apply
│
│ This value does not have any attributes.
╵
Let me know what I am doing wrong in for loop.
I tried this as well:
module "l3_rnd" {
source = "./modules/ou"
for_each = { for parent, child in local.l3_crm : parent => child }
ou_name = [each.value]
parent_id = module.l2[each.key].ou_ids[0]
depends_on = [module.l2]
ou_tags = var.l2_ou_tags
}
with the local.tf:
locals {
l3_crm = [
{ "rnd" : ["crm-sap", "crm-sf"] },
{ "trade" : ["fin-mon"] }
]
}
I get these errors:
╷
│ Error: Invalid value for module argument
│
│ on main.tf line 28, in module "l3_crm":
│ 28: ou_name = [each.value]
│
│ The given value is not suitable for child module variable "ou_name" defined
│ at modules\ou\variables.tf:1,1-19: element 0: string required.
╵
╷
│ Error: Invalid value for module argument
│
│ on main.tf line 28, in module "l3_crm":
│ 28: ou_name = [each.value]
│
│ The given value is not suitable for child module variable "ou_name" defined
│ at modules\ou\variables.tf:1,1-19: element 0: string required.
╵
╷
│ Error: Invalid index
│
│ on main.tf line 29, in module "l3_crm":
│ 29: parent_id = module.l2[each.key].ou_ids[0]
│ ├────────────────
│ │ each.key is "1"
│ │ module.l2 is object with 2 attributes
│
│ The given key does not identify an element in this collection value.
╵
╷
│ Error: Invalid index
│
│ on main.tf line 29, in module "l3_crm":
│ 29: parent_id = module.l2[each.key].ou_ids[0]
│ ├────────────────
│ │ each.key is "0"
│ │ module.l2 is object with 2 attributes
│
│ The given key does not identify an element in this collection value.
╵
time=2022-11-11T13:24:15Z level=error msg=Hit multiple errors:
Hit multiple errors:
exit status 1

With your current structure you can reconstruct the map in your meta-argument like:
for_each = { for l3_crm in local.l3_crm : l3_crm.parent => l3_crm.child }
to access the values of each key in the list element and reconstruct to a map of parent keys and child values.
You can also optimize the structure like:
l3_crm:
[
{ "crm" = ["crm-sap", "crm-sf"] },
{ "fin" = ["fin-mon"] },
]
and then:
for_each = { for parent, child in local.l3_crm : parent => child }
where you cannot simply convert to a set type with toset because set(map) is not allowed as an argument value type.
Either way the references are updated fully accordingly:
ou_name = [each.key]
parent_id = module.l2[each.value].ou_ids[0]

How to define maps in template file for terraform to pick it up

My code is as follows, so the custom_parameter fails to be decoded, I'm not sure how I can define this maps in template file, How can I define the maps variable in template file.
Invalid template interpolation value; Cannot include the given value in a
│ string template: string required..
main.tf looks like this
resource "google_dataflow_flex_template_job" "dataflow_jobs_static" {
provider = google-beta
for_each = {
for k, v in var.dataflows : k => v
if v.type == "static"
}
parameters = merge(
yamldecode(templatefile("df/${each.key}/config.yaml", {
tf_host_project = var.host_project
tf_dataflow_subnet = var.dataflow_subnet
tf_airflow_project = local.airflow_project
tf_common_project = "np-common"
tf_dataeng_project = local.dataeng_project
tf_domain = var.domain
tf_use_case = var.use_case
tf_env = var.env
tf_region = lookup(local.regions, each.value.region, "northamerica-northeast1")
tf_short_region = each.value.region
tf_dataflow_job = each.key
tf_dataflow_job_img_tag = each.value.active_img_tag
tf_metadata_json = indent(6, file("df/${each.key}/metadata.json"))
tf_sdk_language = each.value.sdk_language
tf_custom_parameters = each.value.custom_parameters[*]
}))
)
}
terraform.tfvars looks like this
dataflows = {
"lastflow" = {
type = "static"
region = "nane1"
sdk_language = "JAVA"
active_img_tag = "0.2"
custom_parameters = {
bootstrapServers = "abc,gv"
}
},
vars.tf
variable "dataflows" {
type = map(object({
type = string
region = string
sdk_language = string
active_img_tag = string
custom_parameters = map(string)
}))
default = {}
}
config.yaml
custom_var: ${tf_custom_parameters}
Also my metadata json file looks like this
{
"name": "Streaming Beam PubSub to Kafka Testing",
"description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam to transform the message data, and writes the results to a Kafka",
"parameters": [
{
"name": "custom_var",
"isOptional": true
}
]
}
Error
Error: Error in function call
│
│ on dataflow.tf line 60, in resource "google_dataflow_flex_template_job" "dataflow_jobs_static":
│ ├────────────────
│ │ each.key is "lastflow"
│ │ each.value.active_img_tag is "0.2"
│ │ each.value.custom_parameters is map of string with 1 element
│ │ each.value.region is "nane1"
│ │ each.value.sdk_language is "JAVA"
│ │ local.airflow_project is "-01"
│ │ local.dataeng_project is "-02"
│ │ local.regions is object with 2 attributes
│ │ var.common_project_index is "01"
│ │ var.dataflow_subnet is "dev-01"
│ │ var.domain is "datapltf"
│ │ var.env is "npe"
│ │ var.host_project is "rod-01"
│ │ var.use_case is "featcusto"
│
│ Call to function "templatefile" failed: df/lastflow/config.yaml:2,15-35:
│ Invalid template interpolation value; Cannot include the given value in a
│ string template: string required..

Hi I fixed this by referencing to the foll. doc
https://www.terraform.io/language/functions/templatefile#maps
following that, my solution is as follows,
the config.yaml was changed to be like this
%{ for key, value in tf_custom_parameters }
${key}: ${value}
%{ endfor ~}
And the metadata.json file changed to be as follows
{
"name": "Streaming Beam PubSub to Kafka Testing",
"description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam to transform the message data, and writes the results to a Kafka",
"parameters": [
{
"name": "bootstrapServers",
"label": "test label.",
"isOptional": true,
"helpText": "test help Text"
}
]
}
And the one change in main.tf file was this
.........
.........
tf_custom_parameters = each.value.custom_parameters
.........
this the solution that helped.

Abstract out Polars expressions with user-defined chainable functions on the `DataFrame`

Motivation
Abstract out parametrized (via custom function parameters) chainable (preferably via the DataFrame.prototype) Polars expressions to provide user-defined, higher-level, reusable and chainable data analysis functions on the DataFrame
Desired behavior and failed intent
import pl from "nodejs-polars"
const { DataFrame, col } = pl
// user-defined, higher-level, reusable and chainable data analysis function
// with arbitrary complex parametrized Polars expressions
DataFrame.prototype.inc = function inc(column, x = 1, alias = `${column}Inc`) {
return this.withColumn(col(column).add(x).alias(alias))
}
const df = new DataFrame({ a: [1, 2, 3] })
// works, but implies code duplication on reuse
console.log(df.withColumn(col("a").add(1).alias("aInc")))
// desiged behavior gives TypeError: df.inc is not a function
console.log(df.inc("a").inc("a", 2, "aInc2"))
What it the recommended way to define custom functions that encapsulate Polars expressions in nodejs-polars?

a functional approach that does not require additional libraries would be to create a simple wrapper function & reexport polars with an overridden DataFrame method within your own package.
// polars_extension.js
import pl from 'nodejs-polars'
const customMethods = {
sumAlias() {
return this.sum();
},
};
export default {
...pl,
DataFrame(...args) {
return Object.assign(pl.DataFrame(...args), customMethods);
}
}
// another_file.js
import pl from './polars_extension'
pl.DataFrame({num: [1, 2, 3]}).sumAlias()

Prototype-based solution
function DF(df) { this.df = df }
DF.prototype.inc = function inc(column, x = 1, alias = `${column}Inc`) {
this.df = this.df.withColumn(col(column).add(x).alias(alias))
return this
}
const df = new DF(new DataFrame({ a: [1, 2, 3] }))
console.log(df.inc("a").inc("a", 2, "aInc2"))
Functional programming solution (preferred)
import { curry, pipe } from "rambda"
function inc(column, x, alias, df) {
return df.withColumn(col(column).add(x).alias(alias))
}
const makeInc = curry(inc)
const df = new DataFrame({ a: [1, 2, 3] })
console.log(pipe(makeInc("a", 1, "aInc"), makeInc("a", 2, "aInc2"))(df))
Output
shape: (3, 3)
┌─────┬──────┬───────┐
│ a ┆ aInc ┆ aInc2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪═══════╡
│ 1.0 ┆ 2.0 ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2.0 ┆ 3.0 ┆ 4.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3.0 ┆ 4.0 ┆ 5.0 │
└─────┴──────┴───────┘```

Terraform Datadog: trace_service_definition does not accept block for "query" or "formula" even tho it is in documentation

Am i doing something wrong?
widget {
widget_layout {
x = 0
y = 47
width = 50
height = 25
}
timeseries_definition {
request {
formula {
formula_expression = "query1 * 100"
alias = "Total Session Capacity"
}
query {
metric_query {
data_source = "metrics"
query = "sum:.servers.available{$region,$stage,$service-name} by {availability-zone}"
name = "query1"
}
}
}
}
}
Documentation links:
https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/dashboard#nestedblock--widget--group_definition--widget--timeseries_definition--request--query
https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/dashboard#nested-schema-for-widgetgroup_definitionwidgettimeseries_definitionrequestformula
$terraform --version
Terraform v1.0.11
on darwin_amd64
+ provider registry.terraform.io/datadog/datadog v2.21.0
$terraform validate
╷
│ Error: Unsupported block type
│
│ on weekly_ops_dashboard.tf line 152, in resource "datadog_dashboard" "weekly_ops":
│ 152: formula {
│
│ Blocks of type "formula" are not expected here.
╵
╷
│ Error: Unsupported block type
│
│ on weekly_ops_dashboard.tf line 156, in resource "datadog_dashboard" "weekly_ops":
│ 156: query {
│
│ Blocks of type "query" are not expected here.

You seam to be using an old version of the Datadog Terraform plugin:
provider registry.terraform.io/datadog/datadog v2.21.0
Version 2.21.0 of their plugin doesn't mention formula.
Either upgrade to the newest version or use whatever is available in 2.21.0

node console.log() output array in one line

I use node v10.6.0.
Here's my codes:
console.log([{a:1, b:2}, {a:1, b:2}, {a:1, b:2}])
console.log([{a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}, {a:1, b:2}])
the output is as following:
[ { a: 1, b: 2 }, { a: 1, b: 2 }, { a: 1, b: 2 } ]
[ { a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 },
{ a: 1, b: 2 } ]
How can I make the second array output in one line, instead of spreading to multiple lines.

Although the output is not exactly the same as if console.log is used, it's possible to use JSON.stringify to convert the array to a string, then print it:
console.log(JSON.stringify(array))
Try it online!
It cannot process circular structures, however.

I suggest using the following:
console.log(util.inspect(array, {breakLength: Infinity}))
Plus, util.inspect has a bunch of extra options to format and limit the output:
https://nodejs.org/api/util.html#utilinspectobject-options

Why not using console.table instead?
This give this nice table: https://tio.run/##y0osSyxOLsosKNHNy09J/Z9YVJRYaRtdnWhlqKOQZGVUq6NAW3bs/#T8vOL8nFS9ksSknFQNsAM0//8HAA
┌─────────┬───┬───┐
│ (index) │ a │ b │
├─────────┼───┼───┤
│ 0 │ 1 │ 2 │
│ 1 │ 1 │ 2 │
│ 2 │ 1 │ 2 │
│ 3 │ 1 │ 2 │
│ 4 │ 1 │ 2 │
│ 5 │ 1 │ 2 │
│ 6 │ 1 │ 2 │
│ 7 │ 1 │ 2 │
│ 8 │ 1 │ 2 │
└─────────┴───┴───┘

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to add Column names in a Polars DataFrame while using CsvReader - rust

Related

Handling list of maps in for loop in terraform

How to define maps in template file for terraform to pick it up

Abstract out Polars expressions with user-defined chainable functions on the `DataFrame`

Terraform Datadog: trace_service_definition does not accept block for "query" or "formula" even tho it is in documentation

node console.log() output array in one line

Categories

Resources