Adding CSV to Accumulo/GeoMesa with converter files

Adding CSV to Accumulo/GeoMesa with converter files - accumulo

What is the best way to add (spatial) data to a GeoMesa/Accumulo stack?
(1) If I understand correctly, a SimpleFeature creation file and converter file should be created in order to add the data. The data itself is stored as CSV. Am I correct that we must build these files for each CSV we wish to add?
(2) Are the examples below correct? For example, the geometry in the CSV files is stored as follows. " MULTILINESTRING((2.0116069 48.9172785,2.0116474 48.9172131,2.0117161 48.917135,2.011814 48.9170714,2.0118996 48.9170489))"
(3) How do we add these converter files to the process of adding the data to the GeoMesa/Accumulo stack?
The goal in the end is to have a (simple) procedure to add data to the stack and, in a next step, to open the data through a Geoserver.
Any kind of help is welcome. Thanks in advance.
Simple feature creation file:
geomesa.sfts.links_geom = {
attributes = [
{ name = "id", type = "Long" }
{ name = "length", type = "Float" }
{ name = "number", type = "Integer" }
...
{ name = "geom", type = "MultiLineString", srid = 4326 }
]
} ```
Converter file:
geomesa.converters.links_geom = {
type = "delimited-text",
format = "CSV",
id-field = "toString($id)",
fields = [
{ name = "id", transform = "$1::long" }
{ name = "length", transform = "$2::float" }
{ name = "number", transform = "$3::int" }
...
{ name = "geom", transform = "multilinestring($11)" }
]
}

There is no "best" way to ingest data into GeoMesa, it depends on your specific use-case. The command-line tools provide an easy entry point, but more advanced scenarios might use Apache NiFi, a stream processing framework like Apache Storm, or cloud native tools like AWS Lambda.
GeoMesa is a GeoTools data store, so you can write data using the DataStore API, without any Converter definitions. There are examples of this in the geomesa-tutorials project. However, Converters provide a declarative way to define your data type without any code. They can also be re-used across environments, so if you develop a Converter for the CLI tools, you can easily use the same definition in e.g. Apache NiFi, allowing you to scale and migrate your ingest as needed.
In general, with Converters you do need to define one per file format. GeoMesa offers type inference for CSV files as described here, which may let you ingest your data without a converter, or at least provide an initial template that you can tweak to your needs.
There is information on adding your Converters to the classpath here and here.
When developing an initial Converter definition, it can be helpful to use the convert CLI command, with the error mode to "raise-errors" as described here. Once your definition is solid, you can then proceed with ingestion.

Related

Use JOOQ Multiset with custom RecordMapper - How to create Field<List<String>>?

Suppose I have two tables USER_GROUP and USER_GROUP_DATASOURCE. I have a classic relation where one userGroup can have multiple dataSources and one DataSource simply is a String.
Due to some reasons, I have a custom RecordMapper creating a Java UserGroup POJO. (Mainly compatibility with the other code in the codebase, always being explicit on whats happening). This mapper sometimes creates simply POJOs containing data only from the USER_GROUP table, sometimes also the left joined dataSources.
Currently, I am trying to write the Multiset query along with the custom record mapper. My query thus far looks like this:
List<UserGroup> = ctx
.select(
asterisk(),
multiset(select(USER_GROUP_DATASOURCE.DATASOURCE_ID)
.from(USER_GROUP_DATASOURCE)
.where(USER_GROUP.ID.eq(USER_GROUP_DATASOURCE.USER_GROUP_ID))
).as("datasources").convertFrom(r -> r.map(Record1::value1))
)
.from(USER_GROUP)
.where(condition)
.fetch(new UserGroupMapper()))
Now my question is: How to create the UserGroupMapper? I am stuck right here:
public class UserGroupMapper implements RecordMapper<Record, UserGroup> {
#Override
public UserGroup map(Record rec) {
UserGroup grp = new UserGroup(rec.getValue(USER_GROUP.ID),
rec.getValue(USER_GROUP.NAME),
rec.getValue(USER_GROUP.DESCRIPTION)
javaParseTags(USER_GROUP.TAGS)
);
// Convention: if we have an additional field "datasources", we assume it to be a list of dataSources to be filled in
if (rec.indexOf("datasources") >= 0) {
// How to make `rec.getValue` return my List<String>????
List<String> dataSources = ?????
grp.dataSources.addAll(dataSources);
}
}
My guess is to have something like List<String> dataSources = rec.getValue(..) where I pass in a Field<List<String>> but I have no clue how I could create such Field<List<String>> with something like DSL.field().

How to get a type safe reference to your field from your RecordMapper
There are mostly two ways to do this:
Keep a reference to your multiset() field definition somewhere, and reuse that. Keep in mind that every jOOQ query is a dynamic SQL query, so you can use this feature of jOOQ to assign arbitrary query fragments to local variables (or return them from methods), in order to improve code reuse
You can just raw type cast the value, and not care about type safety. It's always an option, evne if not the cleanest one.
How to improve your query
Unless you're re-using that RecordMapper several times for different types of queries, why not do use Java's type inference instead? The main reason why you're not getting type information in your output is because of your asterisk() usage. But what if you did this instead:
List<UserGroup> = ctx
.select(
USER_GROUP, // Instead of asterisk()
multiset(
select(USER_GROUP_DATASOURCE.DATASOURCE_ID)
.from(USER_GROUP_DATASOURCE)
.where(USER_GROUP.ID.eq(USER_GROUP_DATASOURCE.USER_GROUP_ID))
).as("datasources").convertFrom(r -> r.map(Record1::value1))
)
.from(USER_GROUP)
.where(condition)
.fetch(r -> {
UserGroupRecord ug = r.value1();
List<String> list = r.value2(); // Type information available now
// ...
})
There are other ways than the above, which is using jOOQ 3.17+'s support for Table as SelectField. E.g. in jOOQ 3.16+, you can use row(USER_GROUP.fields()).
The important part is that you avoid the asterisk() expression, which removes type safety. You could even convert the USER_GROUP to your UserGroup type using USER_GROUP.convertFrom(r -> ...) when you project it:
List<UserGroup> = ctx
.select(
USER_GROUP.convertFrom(r -> ...),
// ...

Terraform cloudwatch_log_metric_filter - technique to filter to specific stream prefix?

Asking the question to make sure I'm not missing out on something. When using Terraform's cloudwatch_log_metric_filter, and you have a loggroup that has many streams, is there a way to filter to a specific log stream prefix?
Thanks!
Reference:
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_metric_filter
resource "aws_cloudwatch_log_metric_filter" "yada" {
name = "MyAppAccessCount"
pattern = "_special_logstream_prefix_filter_pattern_???"
log_group_name = aws_cloudwatch_log_group.dada.name
metric_transformation {
name = "EventCount"
namespace = "YourNamespace"
value = "1"
}
}

is there a way to filter to a specific log stream prefix?
No. You are doing it correctly in your code as the filter functionality can be applied at the log group level, and not the log stream level.

How to import Datadog JSON template in terraform DSL?

{
"title":"xxxx",
"description":"xxx",
"widgets":[
{
"id":0,
"definition":{
"type":"timeseries",
"requests":[
{
"q":"xxxxxx{xxxx:xx}",
"display_type":"bars",
"style":{
"palette":"cool",
"line_type":"solid",
"line_width":"normal"
}
}
]
}
]
}
I have the above datadog json template with me which I have to just import in terraform instead of recreating it as terraform dsl.

I'm not particularly familiar with this Datadog JSON format but the general pattern I would propose here has multiple steps:
Decode the serialized data into a normal Terraform value. In this case that would be using jsondecode, because the data is JSON-serialized.
Transform and normalize that raw data into a consistent shape that is more convenient to use in a declarative Terraform configuration. This will usually involve at least one named local value containing an expression that uses for expressions and the try function, along with the type conversion functions, to try to force the raw data into a more consistent shape.
Use the transformed/normalized result with Terraform's resource and block repetition constructs (resource for_each and dynamic blocks) to describe how the data maps onto physical resource types.
Here's a basic example of that to show the general principle. It will need more work to capture all of the details you included in your initial example.
variable "datadog_json" {
type = string
}
locals {
raw = jsondecode(var.datadog_json)
screenboard = {
title = local.raw.title
description = try(local.raw.description, tostring(null))
widgets = [
for w in local.raw.widgets : {
type = w.definition.type
title = w.definition.title
title_size = try(w.definition.title_size, 16)
title_align = try(w.definition.title_align, "center")
x = try(w.definition.x, tonumber(null))
y = try(w.definition.y, tonumber(null))
width = try(w.definition.x, tonumber(null))
height = try(w.definition.y, tonumber(null))
requests = [
for r in w.definition.requests : {
q = r.q
display_type = r.display_type
style = tomap(try(r.style, {}))
}
]
}
]
}
}
resource "datadog_screenboard" "acceptance_test" {
title = local.screenboard.title
description = local.screenboard.description
read_only = true
dynamic "widget" {
for_each = local.screenboard.widgets
content {
type = widget.value.type
title = widget.value.title
title_size = widget.value.title_size
title_align = widget.value.title_align
x = widget.value.x
y = widget.value.y
width = widget.value.width
height = widget.value.height
tile_def {
viz = widget.value.type
dynamic "request" {
for_each = widget.value.requests
content {
q = request.value.q
display_type = request.value.display_type
style = request.value.style
}
}
}
}
}
}
The separate normalization step to build local.screenboard here isn't strictly necessary: you could instead put the same sort of normalization expressions (using try to set defaults for things that aren't set) directly inside the resource "datadog_screenboard" block arguments if you wanted. I prefer to treat normalization as a separate step because then this leaves a clear definition in the configuration for what we're expecting to find in the JSON and what default values we'll use for optional items, separate from defining how that result is then mapped onto the physical datadog_screenboard resource.
I wasn't able to test the example above because I don't have a Datadog account. I'm sorry if there are minor typos/mistakes in it that lead to errors. My hope was to show the general principle of mapping from a serialized data file to a resource rather than to give a ready-to-use solution, so I hope the above includes enough examples of different situations that you can see how to extend it for the remaining Datadog JSON features you want to support in this module.
If this JSON format is a interchange format formally documented by Datadog, it could make sense for Terraform's Datadog provider to have the option of accepting a single JSON string in this format as configuration, for easier exporting. That may require changes to the Datadog provider itself, which is beyond what I can answer here but might be worth raising in the GitHub issues for that provider to streamline this use-case.

Construct Completely Ad-hoc Slick Query

Pardon my newbieness but im trying to build a completely ad-hoc query builder using slick. From our API, I will get a list of strings that is representative of the table, as well as another list that represents the filter for the tables, munge then together to create a query. The hope is that I can take these and create the inner join. A similar example of what i'm trying to do would be JIRA's advanced query builder.
I've been trying to build it using reflection but I've come across so many blocking issues i'm wondering if this is even possible at all.
In code this is what I want to do:
def getTableQueryFor(tbl:String):TableQuery[_] = {
... a matcher that returns a tableQueries?
... i think the return type is incorrect b/c erasure?
}
def getJoinConditionFor:(tbl1:String, tbl2:String) => scala.slick.lifted.Column[Boolean] = (l:Coffees,r:Suppies) => {
...a matcher
}
Is the Following even possible?
val q1 = getTableQueryFor("coffee")
val q2 = getTableQueryFor("supply")
val q3 = q1.innerJoin.q2.on(getJoinCondition("coffee", "supply")
edit: Fixed grammar issue.

Dynamic data structures in C#

I have data in a database, and my code is accessing it using LINQ to Entities.
I am writing some software where I need to be able to create a dynamic script. Clients may write the scripts, but it is more likely that they will just modify them. The script will specify stuff like this,
Dataset data = GetDataset("table_name", "field = '1'");
if (data.Read())
{
string field = data["field"];
while (cway.Read())
{
// do some other stuff
}
}
So that script above is going to read data from the database table called 'table_name' in the database into a list of some kind based on the filter I have specified 'field='1''. It is going to be reading particular fields and performing normal comparisons and calculations.
The most important thing is that this has to be dynamic. I can specify any table in our database, any filter and I then must be able to access any field.
I am using a script engine that means the script I am writing has to be written in C#. Datasets are outdated and I would rather keep away from them.
Just to re-iterate I am not really wanting to keep with the above format, and I can define any method I want to behind the scenes for my C# script to call. The above could end up like this for instance,
var data = GetData("table_name", "field = '1'");
while (data.ReadNext())
{
var value = data.DynamicField;
}
Can I use reflection for instance, but perhaps that would be too slow? Any ideas?

If you want to read dynamically a DataReader context, it's a pretty easy step:
ArrayList al = new ArrayList();
SqlDataReader dataReader = myCommand.ExecuteReader();
if (dataReader.HasRows)
{
while (dataReader.Read())
{
string[] fields = new string[datareader.FieldCount];
for (int i =0; i < datareader.FieldCount; ++i)
{
fields[i] = dataReader[i].ToString() ;
}
al.Add(fields);
}
}
This will return an array list composed by a dynamic object based on the number of field the reader has.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Adding CSV to Accumulo/GeoMesa with converter files - accumulo

Related

Use JOOQ Multiset with custom RecordMapper - How to create Field<List<String>>?

Terraform cloudwatch_log_metric_filter - technique to filter to specific stream prefix?

How to import Datadog JSON template in terraform DSL?

Construct Completely Ad-hoc Slick Query

Dynamic data structures in C#

Categories

Resources