BigQuery record of repeated fields - nested

I have a Bigquery table with (what is conceptually) a field containing repeated record.
However, this field is stored as a record of repeated fields. This is either caused by the export from AppEngine DataStore (using Mache), or by the representation of the data (using Objectify 3); I don't know.
So what I have is a field (exercises) that looks like this:
exercises RECORD NULLABLE exercises
exercises.id INTEGER REPEATED id
exercises.weight FLOAT REPEATED weight
exercises.duration STRING REPEATED duration
instead of
exercises RECORD REPEATED exercises
exercises.id INTEGER NULLABLE id
exercises.weight FLOAT NULLABLE weight
exercises.duration STRING NULLABLE duration
The latter can be queried easily using FLATTEN (legacy SQL) or UNNEST (standard SQL). However, with the schema I have now, I seem to be stuck.
I guess I would have to to transpose the exercises field in some way, from a records of arrays to an array of records.
The sub-fields of exercises always have the same length, so that should not be a problem.
How can I query and select this field?

I have tried UNNEST WITH OFFSET, as suggested here:
SELECT
exerciseId, exofs,
exercises.weight[OFFSET(exofs)] AS exerciseWeight,
exercises.duration[OFFSET(exofs)] AS exerciseDuration
FROM Session, UNNEST(exercises.id) AS exerciseId WITH OFFSET exofs
This works! This feature is only available in standard SQL. The FLATTEN in legacy SQL does not support WITH OFFSET.

Related

What are the performance implications of sparsely populated Frozen User Defined Type?


We have a frozen UDT with ~2000 fields as one of the columns in a table.
We use this table to implement append-only writes so that the data is auditable and not overwritten.
We are seeing degradation in write performance when only 1 (out of 2000) field in the UDT is populated.
Trying to understand the performance implication of using sparsely populated frozen UDTs. How are UDTs serialized/deserialized internally? Any documentation of this will be highly appreciated.
We tried to gather some metrics from cass session, but couldn't get much information.
edit: Using the C++ cassandra driver withPrepared Statements for writes
Cassandra version: 3.11.6
Data Model:
CREATE TYPE udt_xyx {
field1 bigint,
field2 ..
..
..
field2000
}
CREATE TABLE table_xyz(
key_1 text,
txn_id int,
fields frozen<udt_xyx>,
PRIMARY KEY ((key_1), txn_id)
)
Workflow:
Request comes in from the caller to write n fields(out of 2000) for a given key_1.
We assign a unique txn_id (transaction_id) to the request.
Then we create a UDT object which has 2000 fields but only populate n of those fields and persist it in the table.
The new request that comes in for the same key_1 with different (or same) fields will be assigned a new txn_id and written to the table as a new record.
That way we are not updating any currently written UDT, but always creating a new record in the table (associated with new txn_id).
When the UDT is sparsely populated, we are experiencing write performance degradation.
EDIT:
After doing some analysis we narrowed down the slowness to this:
https://github.com/datastax/cpp-driver/blob/master/src/data_type.hpp#L352-L380
Basically every time we bind a udt the "check" method runs and compares the string names for every field in the UDT.
Since we have ~2000 fields and we do over 100,000 binds we're doing about 100 Million string comparisons
What performance are you measuring here? Comparing performance to inserting data using non-UDT columns into a table versus inserting data using both non-UDT columns and UDT-type columns?
a column whose type is a frozen collection (set, map, or list) or UDT can only have its value replaced as a whole. In other words, we can't add, update, or delete individual elements from the collection as we can in non-frozen collection types. So, the frozen keyword can be useful, for example, when we want to protect collections against single-value updates.
For example, in case of the below snippet,
CREATE TYPE IF NOT EXISTS race (
race_title text,
race_date date
);
CREATE TABLE IF NOT EXISTS race_data (
id INT PRIMARY KEY,
races frozen<list<race>>
...
);
the UDT nested in the list is frozen, so the entire list will be read when querying the table.
Since you did not provide "how" you're updating the frozen collection, it is hard to triage why there is a performannce concern here.
References for exploration:
Freezing collections
Essentially, you will not be able to do an append-only operation with a frozen type as you will always have to perform read-before-write operation for any upserts.

Mongodb, should a number fields be indexed?

I'm trying to get a proper understanding of using mongodb to optimise queries. In this case it's for fields that would hold an integer. So say i have a collection
with two fields value and cid where value will store data of type string and cid will store data of type number.
I intend to write queries that will search for records by matching the fields value and cid. Also the expectation is that the saved records for this collection would get very large and hence queries could benefit from mongodb indexes. It makes sense to me to index the value field which holds string. But I wonder if the cid field requires indexing, or its okay as is, given that it will be holding integers.
I'm asking because I was going through a code base with this exact scenario described and i can't figure out why the number field was not indexed. Hoping my question makes any sense.
Regardless of datatypes, generally speaking all queries should use an index. If you use a sort predicate you can assist the database by having a compound index on both the equality portion of the query (the filter predicate) as well as the sorting portion (the sort predicate). MongoDB recommends following the index strategy referred to as the E.S.R. rule - see Performance Best Practices for E.S.R. rule.

solr query to sort result in descending order on basis of price

I am very beiginer in Solr and I am trying to do query on my data. I am trying to find data with name=plant and sort it by maximum price
my schema for both name and price is text type.
for eg let say data is
name:abc, price:25;
name:plant, price:35;
name:plant,price:45; //1000 other data
My Approach
/query?q=(name:"Plant")&stopwords=true
but above is giving me result of plants but I am not sure how to sort result using price feild
Any help will be appreciated
You can use the sort param for achieving the sorting.
Your query would be like q=(name:"Plant")&sort=price desc
The sort parameter arranges search results in either ascending (asc)
or descending (desc) order. The parameter can be used with either
numerical or alphabetical content. The directions can be entered in
either all lowercase or all uppercase letters (i.e., both asc or ASC).
Solr can sort query responses according to document scores or the
value of any field with a single value that is either indexed or uses
DocValues (that is, any field whose attributes in the Schema include
multiValued="false" and either docValues="true" or indexed="true" – if
the field does not have DocValues enabled, the indexed terms are used
to build them on the fly at runtime), provided that:
the field is non-tokenized (that is, the field has no analyzer and its
contents have been parsed into tokens, which would make the sorting
inconsistent), or
the field uses an analyzer (such as the KeywordTokenizer) that
produces only a single term.

Can I index EXTRACT(WEEK from startDateTime)? Or, will the query planner use an index directly on 'startDateTime'?

I have a large number of records indexed on some startDateTime field, and want to select aggregates (SUM and COUNT) on all records grouped by WEEKOFYEAR(startDateTime) (i.e., EXTRACT(WEEK FROM startDateTime)). Can I put a secondary index on EXTRACT(WEEK FROM startDateTime)? Or, even better, will the query use an index on startDateTime appropriately to optimize a request grouped by WEEK?
See this similar question about MySQL indices. How would this be handled in the Cloud Spanner world?
Secondary index on generated columns (i.e., EXTRACT(WEEK FROM startDateTime)) are not supported yet. If you have a covering index that includes all the columns required for the query (i.e., startDateTime and other required columns for grouping and aggregation), the planner will use such covering index over the base table but the aggregation is likely to be based on hash aggregation. Unless you aggregate over very long period of time, it should not be a big problem (I admit that it is not ideal though).
If you want to restrict the aggregated time range, you need to spell it out in terms of startDateTime (i.e., you need to convert the min/max datetime to the same type as startDateTime).
Hope this helps.

How to avoid dataset from renaming columns to value while mapping?

While mapping a dataset I keep having the problem that columns are being renamed from _1, _2 ect to value, value.
What is it which is causing the rename?
That's because map on Dataset causes that query is serialized and deserialized in Spark.
To Serialize it, Spark must now the Encoder. That's ewhy there is an object ExpressionEncoder with method apply. It's JavaDoc says:
A factory for constructing encoders that convert objects and primitives to and from the
internal row format using catalyst expressions and code generation. By default, the
expressions used to retrieve values from an input row when producing an object will be created as
follows:
- Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
and [[UnresolvedExtractValue]] expressions.
- Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
- Primitives will have their values extracted from the first ordinal with a schema that defaults
to the name `value`.
Please look at the last point. Your query is just mapped to primitives, so Catalyst uses name "value".
If you add .select('value.as("MyPropertyName")).as[CaseClass], the field names will be correct.
Types that will have column name "value":
Option(_)
Array
Collection types like Seq, Map
types like String, Timestamp, Date, BigDecimal

Resources