Given a table like this:
CREATE TABLE demo (
id BIGINT PRIMARY KEY,
offset_minutes INTEGER NOT NULL,
scheduled_time TIMESTAMP WITH TIME ZONE
);
I want to update scheduled_time to an application-supplied time plus offset_minutes. In plain SQL on PostgreSQL, it'd look something like
UPDATE demo
SET scheduled_time =
timestamp with time zone '2022-09-12T01:23:45Z' +
offset_minutes * interval '1 minute'
WHERE id = 12345;
What's the best way to express this in jOOQ in a database-engine-independent way?
This is the reverse of the usual way people want to add minutes to a time value, which is well-covered by other SO questions: in my case, the Instant is supplied by the code and the number of minutes is in a database column, not the other way around.
The best I've been able to come up with is to do a floating-point calculation on the number of minutes, since we can add days to Instants:
dslContext.update(DEMO)
.set(DEMO.SCHEDULED_TIME,
DSL.instant(instant).add(DEMO.OFFSET_MINUTES.div(24.0 * 60.0)))
.where(DEMO.ID.eq(id))
.execute();
On PostgreSQL, jOOQ ends up generating a SQL expression for the value of the SET clause:
(timestamp with time zone '2022-09-12 01:23:45+00:00' +
("public"."demo"."offset_minutes" / 1.44E3) * interval '1 day')
The fact that jOOQ is generating interval '1 day' makes me hopeful that there's a way to tell it to change day to minute and avoid the floating-point calculation. I'm never nuts about doing floating-point calculations on discrete quantities if it can be avoided.
Pending feature requests
Fixing your immediate problem
There's a pending feature request #6723 to add support for DSL::offsetDateTimeAdd, which would allow for adding intervals to TIMESTAMP WITH TIME ZONE data types.
It still wouldn't work for Instant data types, where we'd need yet another "overload" for all the possible date time arithmetic variants.
Both of these would produce a lot of new methods in an already crowded DSL class, without really adding much new functionality, given they're doing something similar to the existing timestampAdd() or localDateTimeAdd() functions, just offering the same thing for new types.
A more strategic, thorough change
There's a big change #11088 suggested on the long term roadmap, which would allow for more cleanly separating the usual two types T and U involved with a Field:
T being the "JDBC type", i.e. the type the database understands (e.g. OffsetDateTime)
U being the "user type", i.e. the type you want to see in your code (e.g. Instant)
This way, there would be a single method accepting T = OffsetDateTime and any arbitrary U type (e.g. Instant).
Workaround for you right now
While the above tasks take time to get right, you can always use the usual escape hatch and use plain SQL templating. E.g.
DSL.field("({0} + {1} * interval '1 minute')",
SQLDataType.INSTANT,
DSL.instant(instant),
DEMO.OFFSET_MINUTES
);
You can extract the hard-coded arguments (DSL.instant(instant) and DEMO.OFFSET_MINUTES) and make a reusable function for the above, and thus create a mini library for the missing functionality in jOOQ.
Related
Basically what I want to do is the Sequelize's equivalent of this question:
More efficient way of querying for this data?
My use case is a bit different from the above question though, much more troublesome. In particular:
Unlike the original question, I use MySQL.
My case could potentially have not just a pair of values, but a set of up to 4 different values (number of values in each set are not fixed), all thanks to my company's immaculate database
The maximum amount of sets is not just limited to ~100 sets. I can see this easily exceeds 2000 sets. (this is my main concern)
This query is a part of a already rather complex function. I tried to trim the thing down as much as possible already, but it still take quite a while to do. This query would be triggered, in my estimation, 5 to 7 times throughout the runtime of the function. I have tried the following:
The conventional way of just stuffing the processed search set inside of [Op.or] would fire up a really long query, which could exceed MySQL's query line limit (I'm not allowed to change this).
Querying item by item is reliable but slower.
The main function right now runs in approximately 1 minute (note that this is me using a smaller set of data for the purpose of testing, actual runtime can easily be 4-5 times this), which I don't think is acceptable as it is called multiple times a day. I also can't heavily modify the database itself, as it is a legacy database which is also used by other applications. If the original database had been designed properly, we wouldn't have gone to this, but alas, I can only try my best.
Any help would be very appreciated.
In MySQL, you can use tuple in WHERE clause and you can fill the missing value with ANY_VALUE(attribute name) to match anything.
SELECT * FROM Employees
WHERE (name, age, dept, salary) IN (
('Alice', 40, ANY_VALUE(dept), ANY_VALUE(salary)),
('Bob', ANY_VALUE(age), 'Tech', 120),
('Mike', 25, 'HR', ANY_VALUE(salary))
)
I tested with 100k data with 1k criterion and the query returns with 2.954s on my laptop.
========================================================
UPDATE
If you always have 4 values and no needs of ANY_VALUE, it can write in Sequelize with least literal.
const criteria = [
['Alice', 40, 'Tech', 120],
['Bob', 30, 'Tech', 120],
['Mike', 25, 'HR', 120]
];
const result = await db.Employee.findAll({
where: Sequelize.where(Sequelize.literal('(name, age, dept, salary)'), Op.in, [criteria])
});
However, in your case, the set doesn't guarantee to have all 4 values, thus needs ANY_VALUE. Unfortunately, I cannot use Sequelize.fn('ANY_VALUE', 'name') in Sequelize.where as it tries to escape it and it cannot be escaped.
Therefore, the 3rd argument for Sequelize.where also need to be replaced with literal. At this point, the code is mostly literal and I don't see any differences to just using Sequelize.query unless if you are using many other options such as offset, limit, attributes... that can still benefit the Sequelize's query generator.
const result = await db.sequelize.query(`
SELECT * FROM Employees WHERE (name, age, dept, salary) IN (${constructedCriteria})`,
{ type: Sequelize.QueryTypes.SELECT } // This will let Sequelize to format the response as in `findAll` function.
);
Some thoughts.
What is the use case of >1k criterion? Are all 1k criterion distinct?
Maybe this scenario is more suited with search engine like ElasticSearch? (if your situation is flexible)
I would like to represent the timestamp coming from an HTMLMediaElement. Its defining characteristics are:
Its value is represented as a Double
It can be queried at any time using getCurrentTime :: IO Double (as partially applied on a given HTMLMediaElement)
It is potentially continuously changing (whenever the media player is playing)
My initial plan was to represent it as a Behavior t Double that re-runs the IO Double every time it is observed, but that hasn't worked out too well.
Things I've tried:
Using a Behavior that is prodded under the hood at a fixed frequency, as described in the workaround section of this question
Passing an Event t () representing the desired sampling frequency, and returning an Event t Double that holds the coinciding timestamps
I don't really like either -- the first one either couples the behaviour (sorry) too much to my specific use case (if I use the eventual sampling frequency I'll use in my app) or seems wasteful (if I use something like 1 kHz sampling when creating the Behavior just to then sample it at 60 Hz on the application end), and the second is quite inflexible if you want to do more than one thing with the timestamp at different sampling rates.
Right now, using an Event to explicitly sample the time (your second option) value is your best bet. We haven't yet created a way to write Behaviors that lazily poll outside resources, although that is something that I hope we'll be able to get done soon.
Keep in mind that, with your second option, you don't necessarily need to use a specific sampling rate; instead, you can sample on-demand, and even have multiple locations doing that sampling. It's not perfect, but I hope that'll let you get the job done!
I have an input file in this format: (length 20, 10 chars and 10 numerics)
jname1 0000500006
bname1 0000100002
wname1 0000400007
yname1 0000000006
jname1 0000100001
mname1 0000500012
mname2 0000700013
In my jcl I have defined my sysin data as such:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,FD)
DATAEND
*
It works fine as long as I don't add the sum fields so I'm wondering if I'm using the wrong format for my numerics seeing as I know they start at field 11 and have a length of 10 the format is the only thing that could be wrong.
As you might have already realised the point of this JCL is to just list the values but grouped by the first letter of the name (so for the example data and JCL I have given it would group the numeric for mname1 and mname2 together but leave the other records untouched).
I'm kind of new at this so I was wonder what I need for the format if my numerics are like that in the input file.
If new to DFSORT, get hold of the DFSORT Getting Started guide for your version of DFSORT (http://www-01.ibm.com/support/docview.wss?uid=isg3T7000080).
This takes your through all the basic operations with many examples.
The DFSORT Application Programming Guide describes everything you need to know, in detail. Again with examples. Appendix C of that document contains all the data-types available (note, when you tried to use FD, FD is not valid data-type, so probably a typo). There are Tables throughout the document listing what data-types are available where, if there is a particular limit.
For advanced techniques, consult the DFSORT Smart Tricks publication here: http://www-01.ibm.com/support/docview.wss?uid=isg3T7000094
You need to understand a bit more the way data is stored on a Mainframe as well.
Decimals (which can be "packed-decimal" or "zoned-decimal") do not contain a decimal-point. The decimal-point is implied. In high-level languages you tell the compiler where the decimal-point is (in a fixed position) and the compiler does the alignments for you. In Assembler, you do everything yourself.
Decimals are 100% accurate, as there are machine-instructions which act directly on packed-decimal data giving packed-decimal results.
A field which actually contains a decimal-point, cannot be directly used in arithmetic.
An unsigned field is treated as positive when used in any arithmetic.
The SUM statement supports a limited number of numeric definitions, and you have chosen the correct one. It does not matter that your data is unsigned.
If the format of the output from SUM is not what you want, look at OPTION ZDPRINT (or NOZDPRINT).
If you want further formatting, you can use OUTREC or OUTFIL.
As an option to using SUM, you can use OUTFIL reporting functions (especially, although not limited to, if you want a report). You can use SECTIONS and TRAILER3 with TOT/TOTAL.
Something to watch for with SUM (which is not a problem with the reporting features) is if any given one (or more) of your SUMmed fields exceed the field size. To continue to use SUM if that happens, you need to extend the field in INREC and then get SUM to use the new, sufficient, size.
After some trial and error I finally found it, appearantly the format I needed to use was the ZD format (zoned decimal, signed), so my sysin becomes this:
SYSIN DATA *
SORT FIELDS=(1,1,CH,A)
SUM FIELDS=(11,10,ZD)
DATAEND
*
even though my records don't contain any decimals and they are unsigned, I don't really get it so if someone knows why it's like that please go ahead and explain it to me.
For now the way I'm going to remember it is this: Z = symbol for real (meaning integers so no decimals)
I'm trying to store some time series data on the following column family:
create column family t_data with comparator=TimeUUIDType and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
I'm successfully inserting data this way:
data={datetime.datetime(2013, 3, 4, 17, 8, 57, 919671):'VALUE'}
key='row_id'
col_fam.insert(key,data)
As you can see, using a datetime object as the column name pycassa converts to a timeUUID object correctly.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
Sometimes, the application needs to update some data. The problem is that when I try to update that column, passing the same datetime object, pycassa creates a different UUID object (the time part is the same) so instead of updating the column, it creates another one.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
=> (column=**f36ad7be**-84ed-11e2-b2fa-a6d3e28fea13, value=VALUE, timestamp=1362424025433209)
The question is, how can I update TimeUUID based columns with pycassa passing the datetime object? or, if this is not the correct way to doing it, what is the recommended way?
Unless you do a read-modify-write you can't. UUIDs are by their nature unique. They exist to solve the problem of how to get unique IDs that sort in chronological order but at the same time avoid collisions for things that happen at exactly the same time.
So to update that column you need to first read it, so you can find its column key, change its value and write it back again.
It's not a particularly elegant solution. You should really avoid read-modify-write in Cassandra. Perhaps TimeUUID isn't the right type for your column keys? Or perhaps there's another way you can design your application to avoid having to go back and change things.
Without knowing what your query patterns look like I can't say exactly what you should do instead, but here are some suggestions that hopefully are relevant:
Don't update values, just write new values. If something was true at time T will always have been true for time T, even if it changes at time T + 1. When things change you write a new value with the time of the change and let the old values be. When you read the time line you resolve these conflics by picking the most recent value -- and since the values will be sorted in chronological order the most recent value will always be the last one. This is very similar to how Cassandra does things internally, and it's a very powerful pattern.
Don't worry that this will use up more disk space, or require some extra CPU when reading the time series, it will most likely be tiny in comparison with the read-modify-write complexity that you would otherwise have to implement.
There might be other ways to solve your problem, and if you give us some more details maybe we can come up with someting that fits better.
What's the difference between the text data type and the character varying (varchar) data types?
According to the documentation
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
and
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
So what's the difference?
There is no difference, under the hood it's all varlena (variable length array).
Check this article from Depesz: http://www.depesz.com/index.php/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/
A couple of highlights:
To sum it all up:
char(n) – takes too much space when dealing with values shorter than n (pads them to n), and can lead to subtle errors because of adding trailing
spaces, plus it is problematic to change the limit
varchar(n) – it's problematic to change the limit in live environment (requires exclusive lock while altering table)
varchar – just like text
text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
The article does detailed testing to show that the performance of inserts and selects for all 4 data types are similar. It also takes a detailed look at alternate ways on constraining the length when needed. Function based constraints or domains provide the advantage of instant increase of the length constraint, and on the basis that decreasing a string length constraint is rare, depesz concludes that one of them is usually the best choice for a length limit.
As "Character Types" in the documentation points out, varchar(n), char(n), and text are all stored the same way. The only difference is extra cycles are needed to check the length, if one is given, and the extra space and time required if padding is needed for char(n).
However, when you only need to store a single character, there is a slight performance advantage to using the special type "char" (keep the double-quotes — they're part of the type name). You get faster access to the field, and there is no overhead to store the length.
I just made a table of 1,000,000 random "char" chosen from the lower-case alphabet. A query to get a frequency distribution (select count(*), field ... group by field) takes about 650 milliseconds, vs about 760 on the same data using a text field.
(this answer is a Wiki, you can edit - please correct and improve!)
UPDATING BENCHMARKS FOR 2016 (pg9.5+)
And using "pure SQL" benchmarks (without any external script)
use any string_generator with UTF8
main benchmarks:
2.1. INSERT
2.2. SELECT comparing and counting
CREATE FUNCTION string_generator(int DEFAULT 20,int DEFAULT 10) RETURNS text AS $f$
SELECT array_to_string( array_agg(
substring(md5(random()::text),1,$1)||chr( 9824 + (random()*10)::int )
), ' ' ) as s
FROM generate_series(1, $2) i(x);
$f$ LANGUAGE SQL IMMUTABLE;
Prepare specific test (examples)
DROP TABLE IF EXISTS test;
-- CREATE TABLE test ( f varchar(500));
-- CREATE TABLE test ( f text);
CREATE TABLE test ( f text CHECK(char_length(f)<=500) );
Perform a basic test:
INSERT INTO test
SELECT string_generator(20+(random()*(i%11))::int)
FROM generate_series(1, 99000) t(i);
And other tests,
CREATE INDEX q on test (f);
SELECT count(*) FROM (
SELECT substring(f,1,1) || f FROM test WHERE f<'a0' ORDER BY 1 LIMIT 80000
) t;
... And use EXPLAIN ANALYZE.
UPDATED AGAIN 2018 (pg10)
little edit to add 2018's results and reinforce recommendations.
Results in 2016 and 2018
My results, after average, in many machines and many tests: all the same (statistically less than standard deviation).
Recommendation
Use text datatype, avoid old varchar(x) because sometimes it is not a standard, e.g. in CREATE FUNCTION clauses varchar(x)≠varchar(y).
express limits (with same varchar performance!) by with CHECK clause in the CREATE TABLE e.g. CHECK(char_length(x)<=10). With a negligible loss of performance in INSERT/UPDATE you can also to control ranges and string structure e.g. CHECK(char_length(x)>5 AND char_length(x)<=20 AND x LIKE 'Hello%')
On PostgreSQL manual
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I usually use text
References: http://www.postgresql.org/docs/current/static/datatype-character.html
In my opinion, varchar(n) has it's own advantages. Yes, they all use the same underlying type and all that. But, it should be pointed out that indexes in PostgreSQL has its size limit of 2712 bytes per row.
TL;DR:
If you use text type without a constraint and have indexes on these columns, it is very possible that you hit this limit for some of your columns and get error when you try to insert data but with using varchar(n), you can prevent it.
Some more details: The problem here is that PostgreSQL doesn't give any exceptions when creating indexes for text type or varchar(n) where n is greater than 2712. However, it will give error when a record with compressed size of greater than 2712 is tried to be inserted. It means that you can insert 100.000 character of string which is composed by repetitive characters easily because it will be compressed far below 2712 but you may not be able to insert some string with 4000 characters because the compressed size is greater than 2712 bytes. Using varchar(n) where n is not too much greater than 2712, you're safe from these errors.
text and varchar have different implicit type conversions. The biggest impact that I've noticed is handling of trailing spaces. For example ...
select ' '::char = ' '::varchar, ' '::char = ' '::text, ' '::varchar = ' '::text
returns true, false, true and not true, true, true as you might expect.
Somewhat OT: If you're using Rails, the standard formatting of webpages may be different. For data entry forms text boxes are scrollable, but character varying (Rails string) boxes are one-line. Show views are as long as needed.
A good explanation from http://www.sqlines.com/postgresql/datatypes/text:
The only difference between TEXT and VARCHAR(n) is that you can limit
the maximum length of a VARCHAR column, for example, VARCHAR(255) does
not allow inserting a string more than 255 characters long.
Both TEXT and VARCHAR have the upper limit at 1 Gb, and there is no
performance difference among them (according to the PostgreSQL
documentation).
The difference is between tradition and modern.
Traditionally you were required to specify the width of each table column. If you specify too much width, expensive storage space is wasted, but if you specify too little width, some data will not fit. Then you would resize the column, and had to change a lot of connected software, fix introduced bugs, which is all very cumbersome.
Modern systems allow for unlimited string storage with dynamic storage allocation, so the incidental large string would be stored just fine without much waste of storage of small data items.
While a lot of programming languages have adopted a data type of 'string' with unlimited size, like C#, javascript, java, etc, a database like Oracle did not.
Now that PostgreSQL supports 'text', a lot of programmers are still used to VARCHAR(N), and reason like: yes, text is the same as VARCHAR, except that with VARCHAR you MAY add a limit N, so VARCHAR is more flexible.
You might as well reason, why should we bother using VARCHAR without N, now that we can simplify our life with TEXT?
In my recent years with Oracle, I have used CHAR(N) or VARCHAR(N) on very few occasions. Because Oracle does (did?) not have an unlimited string type, I used for most string columns VARCHAR(2000), where 2000 was at some time the maximum for VARCHAR, and in all practical purposes not much different from 'infinite'.
Now that I am working with PostgreSQL, I see TEXT as real progress. No more emphasis on the VAR feature of the CHAR type. No more emphasis on let's use VARCHAR without N. Besides, typing TEXT saves 3 keystrokes compared to VARCHAR.
Younger colleagues would now grow up without even knowing that in the old days there were no unlimited strings. Just like that in most projects they don't have to know about assembly programming.
I wasted way too much time because of using varchar instead of text for PostgreSQL arrays.
PostgreSQL Array operators do not work with string columns. Refer these links for more details: (https://github.com/rails/rails/issues/13127) and (http://adamsanderson.github.io/railsconf_2013/?full#10).
If you only use TEXT type you can run into issues when using AWS Database Migration Service:
Large objects (LOBs) are used but target LOB columns are not nullable
Due to their unknown and sometimes large size, large objects (LOBs) require more processing
and resources than standard objects. To help with tuning migrations of systems that contain
LOBs, AWS DMS offers the following options
If you are only sticking to PostgreSQL for everything probably you're fine. But if you are going to interact with your db via ODBC or external tools like DMS you should consider using not using TEXT for everything.
character varying(n), varchar(n) - (Both the same). value will be truncated to n characters without raising an error.
character(n), char(n) - (Both the same). fixed-length and will pad with blanks till the end of the length.
text - Unlimited length.
Example:
Table test:
a character(7)
b varchar(7)
insert "ok " to a
insert "ok " to b
We get the results:
a | (a)char_length | b | (b)char_length
----------+----------------+-------+----------------
"ok "| 7 | "ok" | 2