Nodes order in xml created from dataset - xsd

I'am filling tables in .net DataSet with data.
There is a nested relation between the tables, so the exported XML (by using GetXml() method) is nested (the child rows are becoming child nodes).
I 'am sending this XML to a conversion module that converts the XML from the DataSet schema (I' am using the dataset XSD file) to other schema by XSLT map.
The problem is that in the XML that I' am receiving from the DataSet (by using GetXml method) the child nodes are not in the correct order (different from the order they are in the schema). From this reason the schema validation in the conversion module is failing!
I've found this W# documentation:
All or Sequence
I've tried to act according to this, but it seems like the value "all" can't "live" with the relations between the tables in the DataSet and I'am getting many weird error messages.
Is there a better way to control the child nodes order or to make the schema to succeed in the validation process even if the order is different?

I would use explicit select statements in your SQL
SELECT Column1, Column2 From ...
If this isn't possible, you will need to make your XSD match your physical table specifications.

Related

Multi-schema data pipeline

We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have thousands of different event types coming in and the transformations would take place on a set of common fields. Once the events are transformed, they need to be split by writing them to different output locations according to the event type. This pipeline is described in the figure below:
Goals:
To infer schema safely in order to apply transformations based on the merged schema. The assumption is that the event types are compatible with each other (i.e. without overlapping schema structure) but the schema of any of them can change at unpredictable times. The pipeline should handle it dynamically.
To split the output after the transformations while keeping the original individual schema.
What we considered:
Schema inference seems to work fine on sample data. But is it safe for production usecases and for a large number of different event types?
Simply using partitionBy("type") while writing out is not enough because it would use the merged schema.
Doing the same here, casting everything to string, using marshamallow to validate, and then using from_json in a foreach like in https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read
seems the more reasonable approach

Spark read json with partial schema

I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project).
I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.
It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.
But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.

Complex json log data transformation using?

I am new to data science tools and have a use case to transform json logs into a flattened columnar data maybe considered as normal csv, I was looking into a lot of alternatives (tools) to approach this problem and found that I can easily solve this using Apache Spark Sql but the problem is my json log can be a complex data structure with hierarchical arrays i.e. I would have to explode the dataset multiple times to transform it.
The problem is I don't want to hard code the logic for data transformation as I wish to reuse the same chunk of code with different transformation logic, or to put it in a better way I want my transformation to be driven by configurations rather than code.
For the same reason I was looking into Apache Avro which provides me with liberty to define my own schema for the input, but here the problem is I am unaware if I can also define the output schema as well ? If not then it will be same as reading and filtering the avro data structure (generated) into my code logic.
One probable solution which I can think of is to define my schema along with the array fields and some flags to notify my parser to explode on them, which might be recursive as well till I transform the input schema into output i.e. generating the transformation logic based on my input and output schemas.
Is there any better approach which I am unaware of or not being able to think about ?

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

What is the best way to store and search through object transactions?

We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.

Resources