The JPEG standard defines the DECODE procedure like below. I'm confused about a few parts.
CODE > MAXCODE(I), if this is true then it enters in a loop and apply left shift (<<) to code. AFAIK, if we apply left shift on non-zero number, the number will be larger then previous. In this figure it applies SLL (shift left logical operation), would't CODE always be greater than MAXCODE?
Probably I coundn't read the figure correctly
What does + NEXTBIT mean? For instance if CODE bits are 10101 and NEXTBIT is 00000001 then will result be 101011 (like string appending), am I right?
Does HUFFVAL list is same as defined in DHT marker (Vi,j values). Do I need to build extra lookup table or something? Because it seems the procedure used that list directly
Thanks for clarifications
EDIT:
My DECODE code (C):
uint8_t
jpg_decode(ImScan * __restrict scan,
ImHuffTbl * __restrict huff) {
int32_t i, j, code;
i = 1;
code = jpg_nextbit(scan);
/* TODO: infinite loop ? */
while (code > huff->maxcode[i]) {
i++;
code = (code << 1) | jpg_nextbit(scan);
}
j = huff->valptr[i];
j = code + huff->delta[i]; /* delta = j - mincode[i] */
return huff->huffval[j];
}
It's not MAXCODE, it's MAXCODE(I), which is a different value each time I is incremented.
+NEXTBIT means literally adding the next bit from the input, which is a 0 or a 1. (NEXTBIT is not 00000001. It is only one bit.)
Once you've found the length of the current code, you get the Vi,j indexing into HUFFVAL decoding table.
Related
I've been running Blarggs CPU tests through my Gameboy emulator, and the op r,r test shows that my ADC instruction is not working properly, but that ADD is. My understanding is that the only difference between the two is adding the existing carry flag to the second operand before addition. As such, my ADC code is the following:
void Emu::add8To8Carry(BYTE &a, BYTE b) //4 cycles - 1 byte
{
if((Flags >> FLAG_CARRY) & 1)
b++;
FLAGCLEAR_N;
halfCarryAdd8_8(a, b); //generates H flag based on addition
carryAdd8_8(a, b); //generates C flag appropriately
a+=b;
if(a == 0)
FLAGSET_Z;
else
FLAGCLEAR_Z;
}
I entered the following into a test ROM:
06 FE 3E 01 88
Which leaves A with the value 0 (Flags = B0) when the carry flag is set, and FF (Flags = 00) when it is not. This is how it should work, as far as my understanding goes. However, it still fails the test.
From my research, I believe that flags are affected in an identical manner to ADD. Literally the only change in my code from the working ADD instruction is the addition of the flag check/potential increment in the first two lines, which my test code seems to prove works.
Am I missing something? Perhaps there's a peculiarity with flag states between ADD/ADC? As a side note, SUB instructions also pass, but SBC fails in the same way.
Thanks
The problem is that b is an 8 bit value. If b is 0xff and carry is set then adding 1 to b will set it to 0 and won't generate carry if added with a >= 1. You get similar problems with the half carry flag if the lower nybble is 0xf.
This might be fixed if you call halfCarryAdd8_8(a, b + 1); and carryAdd8_8(a, b + 1); when carry is set. However, I suspect that those routines also take byte operands so you may have to make changes to them internally. Perhaps by adding the carry as a separate argument so that you can do tmp = a + b + carry; without overflow of b. But I can only speculate without the source to those functions.
On a somewhat related note, there's a fairly simple way to check for carry over all the bits:
int sum = a + b;
int no_carry_sum = a ^ b;
int carry_into = sum ^ no_carry_sum;
int half_carry = carry_into & 0x10;
int carry = carry_info & 0x100;
How does that work? Consider that bitwise "xor" gives the expected result of each bit if there is no carry going in to that bit: 0 ^ 0 == 0, 1 ^ 0 == 0 ^ 1 == 1 and 1 ^ 1 == 0. By xoring sum with no_carry_sum we get the bits where the sum differs from the bit-by-bit addition. sum is only different whenever there is a carry into a particular bit position. Thus both the half carry and carry bits can be obtained with almost no overhead.
I am not familiar with C++ and current face a problem about size_t calculation with double type.
I provide a part of source code as below. The variable "storage" is define as double and "pos" as size_t type. How come they can be calculate together? I review the value of "pos and it shows value like 0, 1, 2 and so on. Moreover, in the case of double* result = storage + pos, it shows 108 + 2 comes out the result x 117.
Further, sometimes 108 + 0 comes out the result x zero. what the condition lead to the result?
How do I know the exact value of size_t before the calculation?
Any advice & suggestion is appreciated.
double* getPosValue(size_t pos, IdentifierType *idRule, unsigned int *errorNumber, bool *found)
{
double * storage = *from other function with value 108*
double* result = storage + pos;
uint16_t* stat = status + pos; }
The size of a variable (or type) can be obtained with:
sizeof(variableNameOrTypeName)
If you're after the address of a given array element such as variableName[42], it's simply:
&(variableName[42])
with no explicit mucking about with pointers.
If you want to manipulate the actual double value when you only have a pointer to it, you need to dereference the pointer. For example:
double xyzzy = 108.0; // this is the VALUE.
double *pXyzzy = &xyzzy; // this is a POINTER to it.
double plugh = *pXyzzy + 12.0;
The final line above gets the value from the pointer (*pXyzzy) and adds twelve to that, before storing it into another variable named plugh.
You should be very wary of things like:
double * storage = 108;
That creates a pointer to a double with the address of 108. In no way does it create an actual double with the value 108. Dereferencing that pointer is likely to lead to, shall we say, interesting results :-)
I have an array that I want to load up from a binary file:
parameter c_ROWS = 8;
parameter c_COLS = 16;
reg [15:0] r_Image_Raw[0:c_ROWS-1][0:c_COLS-1];
My input file is binary data, 256 bytes long (same total space as r_Image_Raw). I tried using $fread to accomplish this, but it only works through the 4th column of the last row:
n_File_ID = $fopen(s_File_Name, "r");
n_Temp = $fread(r_Image_Raw, n_File_ID);
I also tried using $fscanf for this, but I get an error about packed types when opening the synthesis tool:
while (!$feof(n_File_ID))
n_Temp = $fscanf(n_File_ID, "%h", r_Image_Raw);
I feel like this should be easy to do. Do I have create a 2D for loop and loop through the r_Image_Raw variable, reading in 16 bits at a time? I feel like it should not be that complicated.
I realized my mistake. It should be:
n_File_ID = $fopen(s_File_Name, "rb");
n_Temp = $fread(r_Image_Raw, n_File_ID);
I was using "r" and not "rb" to specify that it was a binary file. Interestingly enough, "r" does work for the majority of the data, but it is unable read in the last ~13 locations from the file.
Try this.
f_bin = $fopen(s_File_Name,"rb");
for (r = 0; r < c_ROWS; r = r+1) begin
for (c = 0; c < c_COLS; c = c+1) begin
f = $fread(r16,f_bin);
r_Image_Raw[r][c] = r16;
end
end
See that $fread(r16,f_bin) first param is reg, second - file!
Below an example for reading from a binary file with systemverilog.
As shown in IEEE SV Standard documentation, the "nchar_code" will return the number of bytes/chars read. In case EOF have been already reached on last read this number will be zero.
Please, notice that "nchar_code" can be zero but EOF has not been reached, this happens if you have spaces or returns at the end of the data file.
You can control the number of bytes to be read with the $fread function. This is done with the type definition of the "data_write_temp" or "mem" of the below examples. If the "data_write_temp" variable is 16bits long then it will read 16bits each time the $fread is called. Besides, $fread will return "nchar_code=2" because 16bits are 2bytes. In case, "data_write_temp" is 32bits as in the example, the $fread will read nchar_code=4bytes(32bits). You can also define an array and the $fread function will try to fill that array.
Lets define a multidimensional array mem.
logic [31:0] mem [0:2][0:4][5:8];
In the example word contents, wzyx,
-w shows the start of the word
-z corresponds to words of the [0:2] dimension (3 blocks).
-y corresponds to words of the [0:4] dimension (5 rows).
-x corresponds to words of the [5:8] dimension (4 columns).
The file will be structure as below (notice #z shows the z dimension blocks):
#0 w005 w006 w007 w008
w015 w016 w017 w018
w025 w026 w027 w028
w035 w036 w037 w038
w045 w046 w047 w048
#1 w105 w106 w107 w108
w115 w116 w117 w118
w125 w126 w127 w128
w135 w136 w137 w138
w145 w146 w147 w148
#2 w205 w206 w207 w208
w215 w216 w217 w218
w225 w226 w227 w228
w235 w236 w237 w238
w245 w246 w247 w248
In the previous structure, the numbers shows the index of each dimension.
e.g. w048 means, the word w (32bits) value on index z =0, index y= 4 and index x= 8.
Now, you have many ways to read this.
You can read all in a single shot using the type "mem" declared above, or you can do a while loop until EOF reading pieces of 32bits using a "data_write_temp" variable of 32bits. The loop is interesting if you want to do something some checks for every word piece and you are not interested having a memory value.
In case multidimensional array / single shot read is chosen, then you can either use $fread or use an specific function $readmemh defined in SV standard.
$readmemh("mem.data", mem, 1, (3*5*4));
is equivalent to
$readmemh("mem.data", mem);
The $readmemh spare you the need to open/close the file.
If you use $fread for one shot read
logic [31:0] mem [0:2][0:4][5:8];
register_init_id = $fopen("mem.data","rb");
nchar_code = $fread(mem, register_init_id);
if (nchar_code!=(3*5*4)*4)) begin
`uvm_error("do_read_file", $sformatf("Was not possible to read the whole expected bytes"));
end
$fclose(register_init_id);
In case you wanted to do a loop using 32b word read. Then see the following example.
The example uses the data which is read from the file to write to AHB Bus using an AHB Verification Component.
logic [31:0] data_write_temp;
...
//DO REGISTER FILE
register_init_id = $fopen("../../software/binary.bin","rb");
if (register_init_id==0) begin `uvm_error("do_read_file", $sformatf("Was not possible to open the register_init_id file")); end
count_32b_words=0;
while(!$feof(register_init_id)) begin
nchar_code = $fread(data_write_temp, register_init_id);
if ((nchar_code!=4)||(nchar_code==0)) begin
if (nchar_code!=0) begin
`uvm_error("do_read_file", $sformatf("Was not possible to read from file a whole 4bytes word:%0d",nchar_code));
end
end else begin
tmp_ahb_address = (pnio_pkg::conf_ahb_register_init_file_part1 + 4*count_32b_words);
data_write_temp = (data_write_temp << 8*( (tmp_ahb_address)%(DATAWIDTH/(8))));//bit shift if necessary not aligned to 4 bytes
`uvm_create_on(m_ahb_xfer,p_sequencer.ahb0_seqr);
assert(m_ahb_xfer.randomize(* solvefaildebug *) with {
write == 1;//perform a write
HADDR == tmp_ahb_address;
HSIZE == SIZE_32_BIT;
HBURST == HBURST_SINGLE;
HXDATA.size() == 1; //only one data for single bust
HXDATA[0] == data_write_temp;
}) else $fatal (0, "Randomization failed"); //end assert
`uvm_send(m_ahb_xfer);
count_32b_words++;
end //end if there is a word read
end //end while
$fclose(register_init_id);
I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"
In a sequence S of n characters; each character may occur many times in the sequence. You want to find the longest subsequence of S where all occurrences of the same character are together in one place;
For ex. if S = aaaccaaaccbccbbbab, then the longest such subsequence(answer) is aaaaaaccccbbbb i.e= aaa__aaacc_ccbbb_b.
In other words, any alphabet character that appears in S may only appear in one contiguous block in the subsequence. If possible, give a polynomial time
algorithm to determine the solution.
Design
Below I give a C++ implementation of a dynamic programming algorithm that solves this problem. An upper bound on the running time (which is probably not tight) is given by O(g*(n^2 + log(g))), where n is the length of the string and g is the number of distinct subsequences in the input. I don't know a good way to characterise this number, but it can be as bad as O(2^n) for a string consisting of n distinct characters, making this algorithm exponential-time in the worst case. It also uses O(ng) space to hold the DP memoisation table. (A subsequence, unlike a substring, may consist of noncontiguous character from the original string.) In practice, the algorithm will be fast whenever the number of distinct characters is small.
The two key ideas used in coming up with this algorithm were:
Every subsequence of a length-n string is either (a) the empty string or (b) a subsequence whose first element is at some position 1 <= i <= n and which is followed by another subsequence on the suffix beginning at position i+1.
If we append characters (or more specifically character positions) one at a time to a subsequence, then in order to build all and only the subsequences that satisfy the validity criteria, whenever we add a character c, if the previous character added, p, was different from c, then it is no longer possible to add any p characters later on.
There are at least 2 ways to manage the second point above. One way is to maintain a set of disallowed characters (e.g. using a 256-bit array), which we add to as we add characters to the current subsequence. Every time we want to add a character to the current subsequence, we first check whether it is allowed.
Another way is to realise that whenever we have to disallow a character from appearing later in the subsequence, we can achieve this by simply deleting all copies of the character from the remaining suffix, and using this (probably shorter) string as the subproblem to solve recursively. This strategy has the advantage of making it more likely that the solver function will be called multiple times with the same string argument, which means more computation can be avoided when the recursion is converted to DP. This is how the code below works.
The recursive function ought to take 2 parameters: the string to work on, and the character most recently appended to the subsequence that the function's output will be appended to. The second parameter must be allowed to take on a special value to indicate that no characters have been appended yet (which happens in the top-level recursive case). One way to accomplish this would be to choose a character that does not appear in the input string, but this introduces a requirement not to use that character. The obvious workaround is to pass a 3rd parameter, a boolean indicating whether or not any characters have already been added. But it's slightly more convenient to use just 2 parameters: a boolean indicating whether any characters have been added yet, and a string. If the boolean is false, then the string is simply the string to be worked on. If it is true, then the first character of the string is taken to be the last character added, and the rest is the string to be worked on. Adopting this approach means the function takes only 2 parameters, which simplifies memoisation.
As I said at the top, this algorithm is exponential-time in the worst case. I can't think of a way to completely avoid this, but some optimisations can help certain cases. One that I've implemented is to always add maximal contiguous blocks of the same character in a single step, since if you add at least one character from such a block, it can never be optimal to add fewer than the entire block. Other branch-and-bound-style optimisations are possible, such as keeping track of a globally best string so far and cutting short the recursion whenever we can be certain that the current subproblem cannot produce a longer one -- e.g. when the number of characters added to the subsequence so far, plus the total number of characters remaining, is less than the length of the best subsequence so far.
Code
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <functional>
#include <map>
using namespace std;
class RunFinder {
string s;
map<string, string> memo[2]; // DP matrix
// If skip == false, compute the longest valid subsequence of t.
// Otherwise, compute the longest valid subsequence of the string
// consisting of t without its first character, taking that first character
// to be the last character of a preceding subsequence that we will be
// adding to.
string calc(string const& t, bool skip) {
map<string, string>::iterator m(memo[skip].find(t));
// Only calculate if we haven't already solved this case.
if (m == memo[skip].end()) {
// Try the empty subsequence. This is always valid.
string best;
// Try starting a subsequence whose leftmost position is one of
// the remaining characters. Instead of trying each character
// position separately, consider only contiguous blocks of identical
// characters, since if we choose one character from this block there
// is never any harm in choosing all of them.
for (string::const_iterator i = t.begin() + skip; i != t.end();) {
if (t.end() - i < best.size()) {
// We can't possibly find a longer string now.
break;
}
string::const_iterator next = find_if(i + 1, t.end(), bind1st(not_equal_to<char>(), *i));
// Just use next - 1 to cheaply give us an extra char at the start; this is safe
string u(next - 1, t.end());
u[0] = *i; // Record the previous char for the recursive call
if (skip && *i != t[0]) {
// We have added a new segment that is different from the
// previous segment. This means we can no longer use the
// character from the previous segment.
u.erase(remove(u.begin() + 1, u.end(), t[0]), u.end());
}
string v(i, next);
v += calc(u, true);
if (v.size() > best.size()) {
best = v;
}
i = next;
}
m = memo[skip].insert(make_pair(t, best)).first;
}
return (*m).second;
}
public:
RunFinder(string s) : s(s) {}
string calc() {
return calc(s, false);
}
};
int main(int argc, char **argv) {
RunFinder rf(argv[1]);
cout << rf.calc() << '\n';
return 0;
}
Example results
C:\runfinder>stopwatch runfinder aaaccaaaccbccbbbab
aaaaaaccccbbbb
stopwatch: Terminated. Elapsed time: 0ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch runfinder abbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbfabbaaasdbasdnfa,mnbmansdbfsbdnamsdnbf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,mnnsdbbbf
stopwatch: Terminated. Elapsed time: 609ms
stopwatch: Process completed with exit code 0.
C:\runfinder>stopwatch -v runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop
stopwatch: Command to be run: <runfinder abcdefghijklmnopqrstuvwxyz123456abcdefghijklmnop>.
stopwatch: Global memory situation before commencing: Used 2055507968 (49%) of 4128813056 virtual bytes, 1722564608 (80%) of 2145353728 physical bytes.
stopwatch: Process start time: 21/11/2012 02:53:14
abcdefghijklmnopqrstuvwxyz123456
stopwatch: Terminated. Elapsed time: 8062ms, CPU time: 7437ms, User time: 7328ms, Kernel time: 109ms, CPU usage: 92.25%, Page faults: 35473 (+35473), Peak working set size: 145440768, Peak VM usage: 145010688, Quota peak paged pool usage: 11596, Quota peak non paged pool usage: 1256
stopwatch: Process completed with exit code 0.
stopwatch: Process completion time: 21/11/2012 02:53:22
The last run, which took 8s and used 145Mb, shows how it can have problems with strings containing many distinct characters.
EDIT: Added in another optimisation: we now exit the loop that looks for the place to start the subsequence if we can prove that it cannot possibly be better than the best one discovered so far. This drops the time needed for the last example from 32s down to 8s!
EDIT: This solution is wrong for OP's problem. I'm not deleting it because it might be right for someone else. :)
Consider a related problem: find the longest subsequence of S of consecutive occurrences of a given character. This can be solved in linear time:
char c = . . .; // the given character
int start = -1;
int bestStart = -1;
int bestLength = 0;
int currentLength = 0;
for (int i = 0; i < S.length; ++i) {
if (S.charAt(i) == c) {
if (start == -1) {
start = i;
}
++currentLength;
} else {
if (currentLength > bestLength) {
bestStart = start;
bestLength = currentLength;
}
start = -1;
currentLength = 0;
}
}
if (bestStart >= 0) {
// longest sequence of c starts at bestStart
} else {
// character c does not occur in S
}
If the number of distinct characters (call it m) is reasonably small, just apply this algorithm in parallel to each character. This can be easily done by converting start, bestStart, currentLength, bestLength to arrays m long. At the end, scan the bestLength array for the index of the largest entry and use the corresponding entry in the bestStart array as your answer. The total complexity is O(mn).
import java.util.*;
public class LongestSubsequence {
/**
* #param args
*/
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
String str = sc.next();
execute(str);
}
static void execute(String str) {
int[] hash = new int[256];
String ans = "";
for (int i = 0; i < str.length(); i++) {
char temp = str.charAt(i);
hash[temp]++;
}
for (int i = 0; i < hash.length; i++) {
if (hash[i] != 0) {
for (int j = 0; j < hash[i]; j++)
ans += (char) i;
}
}
System.out.println(ans);
}
}
Space: 256 -> O(256), I don't if it's correct to say this way..., cause O(256) I think is O(1)
Time: O(n)