when compressing a sas dataset increases its size?

when compressing a sas dataset increases its size? - linux

I had written a code which creates SAS dataset with compress=yes option. That said the resultant datasets is getting compressed with an increased size as seen in log
1374 +proc sql;
1375 + create table seg.KRG_EO_PVS_CUST_PROD_&op_cyc.
1376 + (
1377 + COMPRESS = YES
1378 + ) as
1379 + select
^L32 The SAS System 02:15 Thursday, August 20, 2015
1380 + W6DFFTE1.DIB_CUST_ID length = 8
1381 + format = 15.
1382 + informat = 15.
1383 + label = 'The logical customer id',
1384 + W6DFFTE1.DIB_PROD_ID length = 8
1385 + format = 15.
1386 + informat = 15.
1387 + label = 'The product id',
1388 + case when W5TM24S0.OFFER_FLAG = "1" then "1" else "0" end as OFFER_FLAG length = 1,
1389 + sum(W6DFFTE1.TOT_QUANTITY ) as TOT_QUANTITY length = 8
1390 + format = 10.
1391 + informat = 5.
1392 + label = 'Number of items purchased'
1393 + from
1394 + work.W6DFFTE1 left join
1395 + work.W5TM24S0
1396 + on
1397 + (
1398 + W5TM24S0.DIB_STORE_ID = W6DFFTE1.DIB_STORE_ID
1399 + and W5TM24S0.DIB_SCAN_ID = W6DFFTE1.DIB_SCAN_ID
1400 + )
1401 + group by
1402 + W6DFFTE1.DIB_CUST_ID,
1403 + W6DFFTE1.DIB_PROD_ID,
1404 + W5TM24S0.OFFER_FLAG
1405 + ;
NOTE: Compressing data set SEG.KRG_EO_PVS_CUST_PROD_20150701 increased size by 43.27 percent.
Compressed is 1961732 pages; un-compressed would require 1369265 pages.
NOTE: Table SEG.KRG_EO_PVS_CUST_PROD_20150701 created, with 346423801 rows and 4 columns.
I just want to know what are the probable reasons for this to happen

SAS compression is pretty primitive and compress=yes just lets SAS save disk space by not writing actual bytes of data for unused length in character variables. It looks like your data is three numeric variables, plus a one-character-long variable. This is not much to work with, plus it would have to add whatever formatting overhead is involved with a compressed file.
If you need to compress files for medium or long term storage, you're much better off using a separate zip or tar utility.
EDIT: I don't mean to disparage SAS compression. I believe the designers were more concerned with preserving relatively fast disk access than with with providing actual zip-style compression.

Related

Why Scapy recalculates fragmentation size?

I am trying to fragment 120 bytes IP payload by 100 bytes. However, in output I got two packets one with 138 bytes and other with 50 bytes (Ethernet and IP header size are 14 and 20 bytes respectively). In first packet data offset starts from 0 to 103 and for second packet data offset starts from 104 to 119. Firstly I cannot understand why it works in this way. In order to understand I tried to look to source of defined fragment function in layers/inet.py line 552.
Scapy recalculates fragmentation size as follows:
def fragment(self, fragsize=1480):
"""Fragment IP datagrams"""
fragsize = (fragsize + 7) // 8 * 8 # <- RECALCULATION OF FRAGMENT SIZE
lst = []
fnb = 0
fl = self
while fl.underlayer is not None:
fnb += 1
fl = fl.underlayer
for p in fl:
s = raw(p[fnb].payload)
nb = (len(s) + fragsize - 1) // fragsize
for i in range(nb):
q = p.copy()
del(q[fnb].payload)
del(q[fnb].chksum)
del(q[fnb].len)
if i != nb - 1:
q[fnb].flags |= 1
q[fnb].frag += i * fragsize // 8
r = conf.raw_layer(load=s[i * fragsize:(i + 1) * fragsize])
r.overload_fields = p[fnb].payload.overload_fields.copy()
q.add_payload(r)
lst.append(q)
return lst
Can somebody explain why it is doing so?
N.B:
Ethernet header size 14 byte
IPv4 header size 20 byte

See https://github.com/secdev/scapy/issues/2424#issuecomment-576879663
From https://www.rfc-editor.org/rfc/rfc791#section-3.2 (page 25, top):
If an internet datagram is fragmented, its data portion must be broken on 8 octet boundaries.
To answer your question, fragment size must be a multiple of 8.
104 is a multiple of 8, not 100

Add model information to the Gurobi log

I'm using Gurobi in Python. I'm iterating over a set of nodes and at each iteration, I'm adding a constraint to solve. After solving, it produces the Gurobi log as follows:
Optimize a model with 6 rows, 36 columns and 41 nonzeros
Variable types: 0 continuous, 36 integer (36 binary)
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [2e+01, 9e+01]
Bounds range [1e+00, 1e+00]
RHS range [2e+00, 2e+00]
MIP start did not produce a new incumbent solution
MIP start violates constraint R5 by 2.000000000
Found heuristic solution: objective 347.281
Presolve removed 2 rows and 21 columns
Presolve time: 0.00s
Presolved: 4 rows, 15 columns, 27 nonzeros
Found heuristic solution: objective 336.2791955
Variable types: 0 continuous, 15 integer (15 binary)
Root relaxation: objective 3.043757e+02, 6 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
* 0 0 0 304.3757488 304.37575 0.00% - 0s
Explored 0 nodes (6 simplex iterations) in 0.02 seconds
Thread count was 4 (of 4 available processors)
Solution count 3: 304.376 336.279 339.43
Optimal solution found (tolerance 1.00e-04)
Best objective 3.043757488224e+02, best bound 3.043757488224e+02, gap 0.0000%
But after a certain iteration, my answer is not what I'm expecting. So I wish to print all my model details (objective function, constraints etc) in the Gurobi log at every iteration.How can I do that?
But model.write() prints the objective function and the constraint that we have coded.
Minimize
0 x(0,0) + 75.47184905645283 x(0,1) + 57.55866572463264 x(0,2)
+ 33.97057550292606 x(0,3) + 23.3238075793812 x(0,4)
+ 40.80441152620633 x(0,5) + 75.47184905645283 x(1,0) + 0 x(1,1)
+ 32.7566787083184 x(1,2) + 90.60905032059435 x(1,3)
+ 55.71355310873648 x(1,4) + 40.60788100849391 x(1,5)
+ 57.55866572463264 x(2,0) + 32.7566787083184 x(2,1) + 0 x(2,2)
+ 83.36066218546971 x(2,3) + 46.57252408878007 x(2,4)
+ 41.4004830889689 x(2,5) + 33.97057550292606 x(3,0)
+ 90.60905032059435 x(3,1) + 83.36066218546971 x(3,2) + 0 x(3,3)
+ 37.12142238654117 x(3,4) + 50.00999900019995 x(3,5)
+ 23.3238075793812 x(4,0) + 55.71355310873648 x(4,1)
+ 46.57252408878007 x(4,2) + 37.12142238654117 x(4,3) + 0 x(4,4)
+ 17.69180601295413 x(4,5) + 40.80441152620633 x(5,0)
+ 40.60788100849391 x(5,1) + 41.4004830889689 x(5,2)
+ 50.00999900019995 x(5,3) + 17.69180601295413 x(5,4) + 0 x(5,5)
Subject To
R0: x(0,1) + x(0,2) + x(0,3) + x(0,4) + x(0,5) >= 2
R1: x(1,0) + x(1,2) + x(1,3) + x(1,4) + x(1,5) >= 2
R2: x(1,0) + x(1,3) + x(1,4) + x(2,0) + x(2,3) + x(2,4) + x(5,0) +
x(5,3)+ x(5,4) >= 2
R3: x(3,0) + x(3,1) + x(3,2) + x(3,4) + x(3,5) >= 2
R4: x(0,1) + x(0,2) + x(0,5) + x(3,1) + x(3,2) + x(3,5) + x(4,1) +
x(4,2)+ x(4,5) >= 2
R5: x(0,1) + x(0,2) + x(3,1) + x(3,2) + x(4,1) + x(4,2) + x(5,1) +
x(5,2)>= 2
Bounds
Binaries
x(0,0) x(0,1) x(0,2) x(0,3) x(0,4) x(0,5) x(1,0) x(1,1) x(1,2) x(1,3)
x(1,4) x(1,5) x(2,0) x(2,1) x(2,2) x(2,3) x(2,4) x(2,5) x(3,0) x(3,1)
x(3,2) x(3,3) x(3,4) x(3,5) x(4,0) x(4,1) x(4,2) x(4,3) x(4,4) x(4,5)
x(5,0) x(5,1) x(5,2) x(5,3) x(5,4) x(5,5)
End
What I need in this is to know what is happening at each iteration. That's because one iteration gives me another false answer and so I want to check whether any redundant constraint is adding into the model when solving.
In other words, does "Gurobi callbacks" allow us to access all information that is available in the model? What will it produce?

In other words, does "Gurobi callbacks" allow us to access all
information that is available in the model? What will it produce?
No, you cannot print constraints generated in a callback function.
Most likely, the issue is one of the following:
You are calling the wrong function inside the callback. There are two kinds of constraints you can add: lazy constraints and user cuts. Lazy constraints are necessary for the structure; a solution must satisfy all lazy constraints. However, you use lazy constraints when they are too numerous to add to the model, and you only want to add those that get violated. User cuts are not necessary, but they can help remove fractional solutions and tighten the LP relaxation of a MIP. In your case, it sounds like you have lazy constraints.
You are not adding all violated lazy constraints. As stated in the documentation: "Your callback should be prepared to cut off solutions that violate any
of your lazy constraints, including those that have already been
added." You should not track whether you added a lazy constraint already; you must add it every time you see that it is violated. This is due to the parallel processing of the Gurobi solver.

How do I convert μm^2 to meters^2?

From my text, I read:
Estimated soma area, in μm^2, is from 1073 to 2400 and estimated total
somadendritic area is from 3914 to 11,158 μm^2.
How do I convert μm^2 to meters^2?

1 μm = 10^-6 m
Hence
1 μm = 10^-4 cm
so
1 μm^2 = 10^-8 cm^2

Distribution of bytes within jpeg files

when observing compressed data, I expect an almost uniformely distributed byte stream. When using the chi square test for measure the distribution, I get this result e.g. for ZIP-files and other compressed data, but not for JPG-files. Last days I spent with finding reasons for this, but I cannot find any.
When calculating the entropy of JPGs, I get a high result (e.g. 7,95 Bits/Byte). I thought there must be a connection between the entropy and the distribution: the entropy is hight, when every byte appears with almost the same probability. But when using chi square, a get a p-value which is about 4,5e-5...
I just want to understand how different distributions influence the test results... I thought I can measure the same property with both tests, but obviously I can not.
Thank you very much for any hint!
tom

Distribution in jpeg-files
Ignoring the meta-information and the jpeg-header-data, the payload of a jpeg consists of blocks describing huffmann-tables or encoded MCUs (Minimum-Coded-Units, square blocks of the size 16x16). There may be others but this are the most frequent ones.
Those blocks are delimited by 0xFF 0xSS, where 0xSS is a specific startcode. Here is the first problem: 0xFF is a bit more frequent as twalberg mentioned in the comments.
It may happen, that 0xFF occur in an encoded MCU. To distinguish between this normal payload and the start of a new block, 0xFF 0x00 is inserted. If the distribution of unstuffed payload is perfectly uniform, 0x00 will be twice as often in the stuffed data. To make bad things worse, every MCU is filled up with binary ones to get byte-alignment (a slight bias to larger values) and we might need stuffing again.
There may be also some other factors I'm not aware of. If you need more information you have to provide the jpeg-file.
And about your basic assumption:
for rand_data:
dd if=/dev/urandom of=rand_data count=4096 bs=256
for rand_pseudo (python):
s = "".join(chr(i) for i in range(256))
with file("rand_pseudo", "wb") as f:
for i in range(4096):
f.write(s)
Both should be uniform regarding byte-values, shouldn't they? ;)
$ ll rand_*
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:11 rand_data
-rw-r--r-- 1 apuch apuch 1048967 2012-12-04 20:13 rand_data.tar.gz
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:14 rand_pseudo
-rw-r--r-- 1 apuch apuch 4538 2012-12-04 20:15 rand_pseudo.tar.gz
A uniform distribution might indicate a high entropy but its not a guarantee. Also, rand_data might consists out of 1MB of 0x00. Its extremely unlikely, but possible.

Here you can find two files: the first one is random data, generated with dev/unrandom (about 46MB), the second one is a normal JPG file (about 9MB). It is obvious that the symbols of the JPG-file are not as equally distributed as in dev/urandom.
If I compare both files:
Entropy:
JPG: 7,969247 Bits/Byte
RND: 7,999996 Bits/Byte
P-Value of chi-square test:
JPG: 0
RND: 0,3621
How can the entropy lead to such a high result?!?

Here is my java code
public static double getShannonEntropy_Image(BufferedImage actualImage){
List<String> values= new ArrayList<String>();
int n = 0;
Map<Integer, Integer> occ = new HashMap<>();
for(int i=0;i<actualImage.getHeight();i++){
for(int j=0;j<actualImage.getWidth();j++){
int pixel = actualImage.getRGB(j, i);
int alpha = (pixel >> 24) & 0xff;
int red = (pixel >> 16) & 0xff;
int green = (pixel >> 8) & 0xff;
int blue = (pixel) & 0xff;
//0.2989 * R + 0.5870 * G + 0.1140 * B greyscale conversion
//System.out.println("i="+i+" j="+j+" argb: " + alpha + ", " + red + ", " + green + ", " + blue);
int d= (int)Math.round(0.2989 * red + 0.5870 * green + 0.1140 * blue);
if(!values.contains(String.valueOf(d)))
values.add(String.valueOf(d));
if (occ.containsKey(d)) {
occ.put(d, occ.get(d) + 1);
} else {
occ.put(d, 1);
}
++n;
}
}
double e = 0.0;
for (Map.Entry<Integer, Integer> entry : occ.entrySet()) {
int cx = entry.getKey();
double p = (double) entry.getValue() / n;
e += p * log2(p);
}
return -e;
}

What exactly does a Sample Rate of 44100 sample?

I'm using FMOD library to extract PCM from an MP3. I get the whole 2 channel - 16 bit thing, and I also get that a sample rate of 44100hz is 44,100 samples of "sound" in 1 second. What I don't get is, what exactly does the 16 bit value represent. I know how to plot coordinates on an xy axis, but what am I plotting? The y axis represents time, the x axis represents what? Sound level? Is that the same as amplitude? How do I determine the different sounds that compose this value. I mean, how do I get a spectrum from a 16 bit number.
This may be a separate question, but it's actually what I really need answered: How do I get the amplitude at every 25 milliseconds? Do I take 44,100 values, divide by 40 (40 * 0.025 seconds = 1 sec) ? That gives 1102.5 samples; so would I feed 1102 values into a blackbox that gives me the amplitude for that moment in time?
Edited original post to add code I plan to test soon: (note, I changed the frame rate from 25 ms to 40 ms)
// 44100 / 25 frames = 1764 samples per frame -> 1764 * 2 channels * 2 bytes [16 bit sample] = 7056 bytes
private const int CHUNKSIZE = 7056;
uint bytesread = 0;
var squares = new double[CHUNKSIZE / 4];
const double scale = 1.0d / 32768.0d;
do
{
result = sound.readData(data, CHUNKSIZE, ref read);
Marshal.Copy(data, buffer, 0, CHUNKSIZE);
//PCM samples are 16 bit little endian
Array.Reverse(buffer);
for (var i = 0; i < buffer.Length; i += 4)
{
var avg = scale * (Math.Abs((double)BitConverter.ToInt16(buffer, i)) + Math.Abs((double)BitConverter.ToInt16(buffer, i + 2))) / 2.0d;
squares[i >> 2] = avg * avg;
}
var rmsAmplitude = ((int)(Math.Floor(Math.Sqrt(squares.Average()) * 32768.0d))).ToString("X2");
fs.Write(buffer, 0, (int) read);
bytesread += read;
statusBar.Text = "writing " + bytesread + " bytes of " + length + " to output.raw";
} while (result == FMOD.RESULT.OK && read == CHUNKSIZE);
After loading mp3, seems my rmsAmplitude is in the range 3C00 to 4900. Have I done something wrong? I was expecting a wider spread.

Yes, a sample represents amplitude (at that point in time).
To get a spectrum, you typically convert it from the time domain to the frequency domain.
Last Q: Multiple approaches are used - You may want the RMS.

Generally, the x axis is the time value and y axis is the amplitude. To get the frequency, you need to take the Fourier transform of the data (most likely using the Fast Fourier Transform [fft] algorithm).
To use one of the simplest "sounds", let's assume you have a single frequency noise with frequency f. This is represented (in the amplitude/time domain) as y = sin(2 * pi * x / f).
If you convert that into the frequency domain, you just end up with Frequency = f.

Each sample represents the voltage of the analog signal at a given time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

when compressing a sas dataset increases its size? - linux

Related

Why Scapy recalculates fragmentation size?

Add model information to the Gurobi log

How do I convert μm^2 to meters^2?

Distribution of bytes within jpeg files

What exactly does a Sample Rate of 44100 sample?

Categories

Resources