What code could be used as a string aggregator for Sybase? (Like Oracle's stragg) - string

In my travels in Oracle, the 'stragg' function, or 'String Aggregator' was life-saving when I had to create dynamic SQL queries on the fly.
You can read up about it here: http://www.oratechinfo.co.uk/delimited_lists_to_collections.html
The basic use of it was:
select stragg(fruit) from food;
fruit
-----------
apple,pear,banana,strawberry
1 row(s) returned
So simple to use, concatenating chr(13) turned it into a long list, and selecting information from system tables gave a 5 minute solution to dynamically generated SQL, e.g. auditing triggers.
Now I've been charged with transferring oracle functionality related to auditing into Sybase, and a function similar to Stragg would be ideal for this purpose.
E.g.
select #my_table = 'table_of_fruit'
select 'insert into '+#mytable+'_copy (' +char(10)
+ stragg(c.name) +char(10)
+ 'select '
+ stragg('inserted.'+c.name) + char(10)
+ 'from '+#mytable
from syscolumns c
where objectid(#mytable) = c.id
------------------------------------------
insert into table_of_fruit_copy
(fruit, sweetness, price)
select fruit, sweetness,price
from inserted
Done. Simple.
Except I don't know how to get a string-aggregation function working in Sybase.
Does anyone know of an attempt to do this kind of thing, or code that could work the same as stragg that could be used in this way?
The alternative at the moment is printing code based on complex cursors and such (sample LOC: 500), or select statements combining static strings and columns from user tables (sample LOC: 200). Stragg would severely reduce the complexity of this code, and would be a great deal of help in the future (sample LOC: who knows, maybe 50?)
p.s. I'm calling these selects through a shell script then piping them to file, then running the file through iSQL. Not the nicest solution, but it's better than the alternatives.

There are three separate answers
Question
You have made comments about simplicity, which need to be addressed before we get to the solution.
It is a common requirement to be able to take a delimited list of values, say A,B,C,D, and treat this data like it was a set of rows in a table, or vice versa
This one of the Top Ten Worst Programming Practices I read about recently.
In general, Sybase types tend to be somewhat more academically and Relationally qualified than Oracle types, so we simply do not do that sort of thing in SybaseLand or DB2Land.
In 20 years of working with Sybase, I have had to code that as part of my project just once, and that was for non-technical Auditor who loaded the result set into MS Access.
On the other hand, I have had to code that at least 12 times, when producing text files for importation into Oracle databases (fulfilling external requirements is outside my project, but I satisfy any such requirement free). Obviously the target databases were sub-standard and non-relational (loading a column with more than one datum breaks 1NF, and creates Update Anomalies), which is typical of what Oracle types have to do to get some speed.
Therefore, no, it is not simplicity, at least in the sense of that principle. It is by definition, complexity.
Your reference to "arrays" is incorrect. All commercial dbms handle arrays, according to the ISO/IEC/ANSI SQL (STRAGGR and LIST operators are non-standard SQL, therefore not SQL). Sybase is very strong in processing arrays. If it was an array, you would not need special hand coding to handle it (and you do, as per your question). This is not an array, there is no definition to the cells. This is a single concatenated scalar string.
Pivoting is an entirely different process, which uses set-processing; it does not require row-processing. (I understand on good authority, that Oracle is hopeless at scalar subqueries, and thus Oracle people are used to writing them as [very inefficient] joins or inline views, and then filtering: all that can be elevated to set-processing via scalar subqueries, and it will perform much faster. Particularly your Pivots.)
Even the author in your link posts as follows. Please familiarise yourself with the caveats:
It's as simple as this: If you want to have a system with no logical limitation in the number of data elements passed to a given process, then forget the following mechanisms! They are simply the wrong way to approach the problem.
Therefore, know whatever you are doing is sub-standard, non-relational, and limited; and go ahead with your eyes open. No use pretending that: it will not break; it is not limited; it is an "array"; or that Sybase doesn't have a neat little function that Oracle has. Any professional will see through all that. And if the string length is exceeded, for God's sake send some indicator back to the caller ("!Exceeded" in the string) identifying that condition.
Essentially you are turning the set-processing engine on its head, and forcing it into row-processing mode, so it will be very slow. A WHILE loop is distinctly faster than a cursor, but both are in the same class, row-processors.
The alternative at the moment is printing code based on complex cursors and such
What 200 or 500 LoC ? It is possible I am missing something, but my code is the same few lines of code identified under "Using a Table Function" in your link. Maximum 20, if you count nice formatting; the loop; initialisation; error handling. There is nothing "complex" about it. Do the exact reverse to cancatenate a single string from multiple rows. We use stored procedures for this (which oracle does not have, really, PL/SQL is a different animal). If you have ASE 15.0.2 or greater, you can use a User Defined Function, which you can then use in place of a column. Stored procs are better for true arrays.
the concatenation operator in Sybase is the plus sign. For reversal (decomposing the CSV string) you need CHARINDEX and SUBSTRING functions
You may need the Function Reference Manual, if for nothing else, to avoid writing code where we have functions.
Likewise, we do not have a RANK() function. We are quite happy with the 4 lines of code requires for the subquery. It is only required for Oracle because subqueries are crippled.
Ok, I have answered your question, Now to address the approaches.
You will be aware that code using Oracle Extensions to the SQL standard will need to be changed.
Sybase is way more automated than Oracle; if you familiarise yourself with its feature set, in many instances, you can get the same result (as you did in Oracle) without writing any code. Writing code-for-code blocks is the chain gang, rock-breaking method of building roads, in the context of bulldozers. Even if your company had good reason to use that method, you need to the aware that features work quite differently, eg. triggers, which is why I am posting so much detail.
Another issue that will annoy you is that Oracle isn't really ANSI SQL compliant (stretches the definitions in many places, in order to appear to be compliant), and Sybase, given its customer base, is rigidly SQL compliant. So in addition to the same function working differently, or in a different deployment, you need to be aware that code changes may be required to elevate Oracle code to ANSI compliance levels, just to execute on an ANSI SQL compliant platform.
I am not sure if you are trying to write code for the content of a trigger, or if you are trying to capture the changes to a database. I will provide both answers.
Auditing
Capture Changes to Database
We have an very robust, fast and configurable Auditing subsystem, fit for high volumes and banking level auditing requirements. Get your DBA to setup the sybaudit (separate) database, and to configure exactly what changes need to be captured. This facility will perform much faster than any code you or I can write in a trigger (as much as 100 times faster than your row-by-processing required for the above, as it is executed within the engine, within your executing thread). And of course the setup time is a fraction of your coding time.
Triggers
Again, I am not sure exactly what you are trying to achieve, but assuming you want to copy every insert to some table to a COPY of that table (inside the Trigger), that example code you have provided will not work (and I am not counting syntax issues).
Speaking to your example, you need to do way more work, to deal with the different datatypes; column sizes; precisions; scale; etc. And perhaps the UPDATE() function to identify which columns have changed (for an UPDATE trigger of course). If all you are trying to do is convert the various datatypes to strings, check the CONVERT() function.
Triggers are transactional.
Never place row-processing code in a Trigger (it will strangle the table)
You can't place Dynamic SQL in a Trigger.
But in Sybase even that is not necessary. Refer to the User Guide, chapter 19 is devoted to Triggers, with several variations, and examples. Inside the trigger, you should be able to simply:
INSERT table_copy
SELECT column_list -- never use * unless you want the db fixed in cement
FROM inserted
If you are trying to copy the inserts to all tables into one Audit table, then beware. Then I understand your example a little bit more. You will be forcing a highly Symmetric Muli-Threading server (oracle is not a server in the architecture sense) into single-threading through your table. Auditing is multi-threaded.
Last, the use of manual methods of any kind is not required, so if you could expand a bit more on your PS, what the requirement you are trying to fulfil is, I can identify the programmatic method for you. It appears you are trying to use the PL/SQL approach (which is very limited).

Just use the LIST() function. It's a direct replacement for stragg() function. Example:
SELECT LIST(state, ', ') FROM cities
Result:
name
CA, CA, MA, NY

Related

How do you save a List<> as a column in a table in room?

I am building an app in which I have a Room entity that one of its columns is supposed to hold a List.
What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
I tried serializing with Jackson (turning the List to a long json String and then bring it back to a List when fetched) but I am not sure if this is the correct approach.
Thank you,
What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
This is very much open to opinion.
From a database perspective the approach would be to have any list as a table and thus
reducing the JSON bloat and thus reducing efficiency,
reduce duplication and thus be more likely to conform to normalisation
not potentially introducing complexities and even greater inefficiencies (e.g. not mentioned in the answer below but wild-character as the first character must do a full table scan)
perhaps consider this question and answer matching multiple title in single query using like keyword where if the table per list approach were taken then a simple SELECT * FROM task WHERE task_tags IN(:taglist) could do the same
From a coding point of view at first the coding is simpler when embedding JSON as the complex code is within the JSON libraries.

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Code generation against Sprocs?

I'm trying to understand choices for code generation tools/ORM tools and discover what solution will best meet the requirements that I have and the limitations present.
I'm creating a foundational solution to be used for new projects. It consists of ASP.NET MVC 3.0, layers for business logic and data access. The data access layer will need to go against Oracle for now, and then switch to SQL this year as the db migration is finished.
From a DTO standpoint mapping to custom types in the solution, what ORM/code generation tool will work with creating my needed code but can ONLY access Stored Procs in Oracle and SQL.?
Meaning, I need to generate the custom objects that are the artifacts from and being pushed to the stored procedures as the parameters, I don't need to generate the sprocs themselves, they already exist. I'm looking for the representation of what the sproc needs and gives back to be generated into DTOs. In some cases I can go against views and generate DTOs. I'm assuming most tools already do this. But for 90% of the time, I don't have access directly to any tables or views, only stored procs.
Does this make sense?
ORMs are best at mapping objects to tables (and/or views), not mapping objects to sprocs.
Very few tools can do automated code generation against whatever output a sproc may generate, depending on the complexity of the sproc. It's much more straight-forward to code generate the input to a sproc as that is generally well defined and clear.
I would say if you are stuck with sprocs, your options for using third party code to help reduce your development and maintenance time are severely limited.
I believe either LinqToSql or EntityFramework (or both?) are capable of some magic with regards to SQL Server to try to mostly automatically figure out what a sproc may be returning. I don't think it works all the time, it's just sophisticated guess work and I seriously doubt it would work with Oracle. I am not aware of anything else software-wise that even attempts to figure out what a sproc may return.
A sproc can return multiple diverse record sets that can be built dynamically by the sproc depending on the input and data in the database. A technical solution to automatically anticipating sproc output seems like it would require the following:
A static set of underlying data in the database
The ability to pass all possible inputs to the sproc and execute the sproc without any negative impact or side effects
That would give you a static set of possible outputs for any given valid input. A small change in the data in the database could invalidate everything.
If I recall correctly, the magic Microsoft did was something like calling the sproc passing NULL for all input parameters and assuming the output is always exactly the first recordset that comes back from the database. That is clearly an incomplete solution to the problem, but in simple cases it appears to be magic because it can work very well some of the time.

Transform MongoDB Data on Find

Is it possible to transform the returned data from a Find query in MongoDB?
As an example, I have a first and last field to store a user's first and last name. In certain queries, I wish to return the first name and last initial only (e.g. 'Joe Smith' returned as 'Joe S'). In MySQL a SUBSTRING() function could be used on the field in the SELECT statement.
Are there data transformations or string functions in Mongo like there are in SQL? If so can you please provide an example of usage. If not, is there a proposed method of transforming the data aside from looping through the returned object?
It is possible to do just about anything server-side with mongodb. The reason you will usually hear "no" is you sacrifice too much speed for it to make sense under ordinary circumstances. One of the main forces behind PyMongo, Mike Dirolf with 10gen, has a good blog post on using server-side javascript with pymongo here: http://dirolf.com/2010/04/05/stored-javascript-in-mongodb-and-pymongo.html. His example is for storing a javascript function to return the sum of two fields. But you could easily modify to return the first letter of your user name field. The gist would be something like:
db.system_js.first_letter = "function (x) { return x.charAt(0); }"
Understand first, though, that mongodb is made to be really good at retrieving your data, not really good at processing it. The recommendation (see for example 50 tips and tricks for mongodb developers from Kristina Chodorow by Oreilly) is to do what Andrew tersely alluded to doing above: make a first letter column and return that instead. Any processing can be more efficiently done in the application.
But if you feel that even querying for the fullname before returning fullname[0] from your 'view' is too much of a security risk, you don't need to do everything the fastest possible way. I'd avoided map-reduce in mongodb for awhile because of all the public concerns about speed. Then I ran my first map reduce and twiddled my thumbs for .1 seconds as it processed 80,000 10k documents. I realize in the scheme of things, that's tiny. But it illustrates that just because it's bad for a massive website to take a performance hit on some server side processing, doesn't mean it would matter to you. In my case, I imagine it would take me slightly longer to migrate to Hadoop than to just eat that .1 seconds every now and then. Good luck with your site
The question you should ask yourself is why you need that data. If you need it for display purposes, do that in your view code. If you need it for query purposes, then do as Andrew suggested, and store it as an extra field on the object. Mongo doesn't provide server-side transformations (usually, and where it does, you usually don't want to use them); the answer is usually to not treat your data as you would in a relational DB, but to use the more flexible nature of the data store to pre-bake your data into the formats that you're going to be using.
If you can provide more information on how this data should be used, then we might be able to answer a little more usefully.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources