How much space does a spark decimal column really consume? - apache-spark

For a double data type it is stated in the docs, that it is consuming exactly 8 bytes, but for decimal, there is no such clear statement. In SQL Server - as comparison - it is stated, that decimal takes between 5 - 17 bytes depending on the chosen precision. see here.
How much space is decimal consuming in spark?

Related

Oracle - export results to excel with headers/columns more than 30 characters

I can export a Oracle (12.1) SQL results to excel using PL/SQL developer
But sometimes the requirements are to give a meaningful name for the column/header
for example "total amount for previous 21 days"
Obviously, it exceed 30 characters and gets ORA-00972 identifier is too long
prior to Oracle version 12.2, identifiers are not allowed to exceed 30 characters in length. See the Oracle SQL Language Reference.
However, from version 12.2 they can be up to 128 bytes long. (Note: bytes, not characters).
This question relevant also to newer version limit
Can I export with different column names without manually renaming in output excel?
EDIT
When I define a not explicit alias it can pass the 30 limit, e.g. using inner select
(select 'longtext' from veryverylongtablename),
Will create a column selectlongtextfromveryverylongtablename
Or
'total amount for previous 21 days'||id
Will create a column totalamountforprevious21daysis
So is there a workaround for showing meaningful headers?
No, it isn't possible to do this. As stated in the docs, the maximum length of object name (tables, columns, triggers, packages, etc.) is 30 bytes:
http://docs.oracle.com/database/121/SQLRF/sql_elements008.htm#SQLRF51129
The only exceptions are database names (8 byte limit) and database links (128 bytes).
As of Oracle Database 12.2, the maximum length of names increased to 128 bytes (provided compatible is set to 12.2 or higher). Database names are still limited to 8 bytes. And the names of disk groups, pluggable databases (PDBs), rollback segments, tablespaces, and tablespace sets are limited to 30 bytes.
According to AllRoundAutomations it isn't possible
On Oracle 12.1 this is not possible. On Oracle 12.2 and later you can use long identifiers.

Cassandra timeuuid column to nanoseconds precision

Cassandra table has timeuuid data type column so how do I see the value of type timeuuid in nanoseconds?
timeuuid:
49cbda60-961b-11e8-9854-134d5b3f9cf8
49cbda60-961b-11e8-9854-134d5b3f9cf9
How to convert this timeuuid to nanoseconds
need a select statement like:
select Dateof(timeuuid) from the table a;
There is a utility method in driver UUIDs.unixTimestamp(UUID id) that returns a normal epoch timestamp that can be converted into a Date object.
Worth noting that ns precision from the time UUID will not necessarily be meaningful. A type 1 uuid includes a timestamp which is the number of 100 nanosecond intervals since the Gregorian calendar was first adopted at midnight, October 15, 1582 UTC. But the driver takes a 1ms timestamp (precision depends on OS really, can be 10 or 40ms precision even) and keeps a monotonic counter to fill the rest of the 10000 unused precision but can end up counting into the future if in a 1ms there are over 10k values (note: performance limitations will ultimately prevent this). This is much more performant and guarantees no duplicates, especially as sub ms time accuracy in computers is pretty meaningless in a distributed system.
So if your looking from a purely CQL perspective theres no way to do it without a UDF, not that there is much value in getting beyond ms value anyway so dateOf should be sufficient. If you REALLY want it though
CREATE OR REPLACE FUNCTION uuidToNS (id timeuuid)
CALLED ON NULL INPUT RETURNS bigint
LANGUAGE java AS '
return id.timestamp();
';
Will give you the 100ns's from October 15, 1582. To translate that to nanoseconds from epoc, mulitply it by 100 to convert to nanos and add the difference from epoch time (-12219292800L * 1_000_000_000 in nanos). This might overflow longs so might need to use something different.

Is it worth converting 64bit integers to 32bit (of 16bit) ints in a spark dataframe?

I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:
def switchType(df, colName):
df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
df = df.drop(colName)
return df.withColumnRenamed(colName + 'SmallInt', colName)
positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())
This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.
So, is it worth trying to truncate ints at all?
I am using Spark 2.01 in stand alone mode.
If you plan to write it in format that saves numbers in binary (parquet, avro) it may save some space. For calculations there will be probably no difference in speed.
Ok, for the benefit of anyone else that stumbles across this. If I understand it, it depends on your JVM implementation (so, machine/OS specific), but in my case it makes little difference. I'm running java 1.8.0_102 on RHEL 7 64bit.
I tried it with a larger dataframe (3tn+ records). The dataframe contains 7 coulmns of type short/long, and 2 as doubles:
As longs - 59.6Gb
As shorts - 57.1Gb
The tasks I used to create this cached dataframe also showed no real difference in execution time.
What is nice to note is that the storage size does seem to scale linearly with the number of records. So that is good.

What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?

I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.
In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.
The data types I would like to know the byte size of are:
A single Cassandra TEXT character (I googled answers from 2 - 4 bytes)
A Cassandra DECIMAL
A Cassandra INT (I suppose it is 4 bytes)
A Cassandra BIGINT (I suppose it is 8 bytes)
A Cassandra BOOELAN (I suppose it is 1 byte, .. or is it a single bit)
Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.
Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.
I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size
I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:
uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate
hope it helps
The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigints columns and you're supposing that your overhead is X, if you have 400 bigint columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.
Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.
Know your data.

Cassandra 2.0 eating disk space

I am using cassandra in my app and it started eating up disk space much faster than I expected and much faster than defined in manual. Consider this most simple example:
CREATE TABLE sizer (
id ascii,
time timestamp,
value float,
PRIMARY KEY (id,time)
) WITH compression={'sstable_compression': ''}"
I am turning off compression on purpose to see how many bytes will each record take.
Then I insert few values, I run nodetool flush and then I check the size of data file on disk to see how much space did it take.
Results show huge waste of space. Each record take 67 bytes, I am not sure how that is possible.
My id is 13 bytes long at it is saved only once in data file, since it is always the same for testing purposes.
According to: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningUserData_t.html
Size should be:
timestamp should be 8 bytes
value as column name takes 6 bytes
column value float takes 4 bytes
column overhead 15 bytes
TOTAL: 33 bytes
For testing sake, my id is always same, so I have actually only 1 row if I understood correctly.
So, my questions is how do I end up on using 67 bytes instead of 33.
Datafile size is correct, I tried inserting 100, 1000 and 10000 records. Size is always 67 bytes.
There are 3 overheads discussed in the file. One is the column overhead, which you have accommodated for. The second is the row overhead. And also if you have replication_factor greater than 1 there's an over head for that as well.

Resources