The function f in the following code simply attempts to print out it's arguments and how many it receives. However, it expands array parameters (but not arraylists) as illustrated on the line f(x) // 3. Is there anyway to get f not to expand array parameters, or alternatively at the very least detect that it has happened, and perhaps correct for it. The reason for this is because my "real" f function isn't as trivial and instead passes it's parameters to a given function g, which often isn't a variable parameter function which instead expects an array directly as an argument, and the expansion by f mucks that up.
def f = {
Object... args ->
print "There are: ";
print args.size();
println " arguments and they are: ";
args.each { println it };
println "DONE"
}
def x = new int[2];
x[0] = 1;
x[1] = 2;
f(1,2); // 1
f([1,2]); // 2
f(x); // 3
I doubt there is any clean solution to this, as it behaves as Java varargs. You may test the size of the array inside the closure, or, as in Java, use a method overload:
public class Arraying {
public static void main(String[] args) {
// prints "2"
System.out.println( concat( new Object[] { "a", "b" } ) );
// prints "a". Commenting the second concat(), prints "1"
System.out.println( concat( "a" ) );
// prints "3"
System.out.println( concat( "a", "b", "c" ) );
}
static String concat(Object... args) {
return String.valueOf(args.length);
}
static String concat(Object obj) { return obj.toString(); }
}
If you comment the concat(Object obj) method, all three methods will match the concat(Object... args).
You can use a label for the argument as follow:
def f = {
Object... args ->
print "There are: ";
print args.size();
println " arguments and they are: ";
args.each { println it };
println "DONE"
}
def x = new int[2];
x[0] = 1;
x[1] = 2;
f(1,2); // 1
f([1,2]); // 2
f(a:x); // <--- used label 'a', or anything else
then the output is:
There are: 2 arguments and they are:
1
2
DONE
There are: 1 arguments and they are:
[1, 2]
DONE
There are: 1 arguments and they are:
[a:[1, 2]]
DONE
I've come across this
#define DsHook(a,b,c) if (!c##_) { INT_PTR* p=b+*(INT_PTR**)a; VirtualProtect(&c##_,4,PAGE_EXECUTE_READWRITE,&no); *(INT_PTR*)&c##_=*p; VirtualProtect(p,4,PAGE_EXECUTE_READWRITE,&no); *p=(INT_PTR)c; }
and everything is clear except the "c##_" word, what does that mean?
It means to "glue" together, so c and _ get "glued together" to form c_. This glueing happens after argument replacement in the macro. See my example:
#define glue(a,b) a##_##b
const char *hello_world = "Hello, World!";
int main(int arg, char *argv[]) {
printf("%s\n", glue(hello,world)); // prints Hello, World!
return 0;
}
It is called a token-pasting operator. Example:
// preprocessor_token_pasting.cpp
#include <stdio.h>
#define paster( n ) printf( "token" #n " = %d", token##n )
int token9 = 9;
int main()
{
paster(9);
}
Output
token9 = 9
That's concatenation that appends an underscore to the name passed as c. So when you use
DsHook(a,b,Something)
that part turns into
if (!Something_)
After the preprocessor, your macro will be expanded as:
if (!c_) { INT_PTR* p=b+*(INT_PTR**)a; VirtualProtect(&c_,4,PAGE_EXECUTE_READWRITE,&no); *(INT_PTR*)&c_=*p; VirtualProtect(p,4,PAGE_EXECUTE_READWRITE,&no); *p=(INT_PTR)c; }
The ## directive concatenates the value of c which you pass as a macro parameter to _
Simple one:
#define Check(a) if(c##x == 0) { }
At call site:
int varx; // Note the x
Check(var);
Would expand as:
if(varx == 0) { }
It is called Token Concatenation and it is used to concatenate tokens during the preprocessing
For example the following code will print out the values of the values of c, c_, c_spam:
#include<stdio.h>
#define DsHook(a,b,c) if (!c##_) \
{printf("c=%d c_ = %d and c_spam = %d\n",\
c, c##_,c##_spam);}
int main(){
int a,b,c=3;
int c_ = 0, c_spam = 4;
DsHook(a,b,c);
return 0;
}
Output:
c=3 c_ = 0 and c_spam = 4
Does C# 4.0 allow optional out or ref arguments?
No.
A workaround is to overload with another method that doesn't have out / ref parameters, and which just calls your current method.
public bool SomeMethod(out string input)
{
...
}
// new overload
public bool SomeMethod()
{
string temp;
return SomeMethod(out temp);
}
If you have C# 7.0, you can simplify:
// new overload
public bool SomeMethod()
{
return SomeMethod(out _); // declare out as an inline discard variable
}
(Thanks #Oskar / #Reiner for pointing this out.)
As already mentioned, this is simply not allowed and I think it makes a very good sense.
However, to add some more details, here is a quote from the C# 4.0 Specification, section 21.1:
Formal parameters of constructors, methods, indexers and delegate types can be declared optional:
fixed-parameter:
attributesopt parameter-modifieropt type identifier default-argumentopt
default-argument:
= expression
A fixed-parameter with a default-argument is an optional parameter, whereas a fixed-parameter without a default-argument is a required parameter.
A required parameter cannot appear after an optional parameter in a formal-parameter-list.
A ref or out parameter cannot have a default-argument.
No, but another great alternative is having the method use a generic template class for optional parameters as follows:
public class OptionalOut<Type>
{
public Type Result { get; set; }
}
Then you can use it as follows:
public string foo(string value, OptionalOut<int> outResult = null)
{
// .. do something
if (outResult != null) {
outResult.Result = 100;
}
return value;
}
public void bar ()
{
string str = "bar";
string result;
OptionalOut<int> optional = new OptionalOut<int> ();
// example: call without the optional out parameter
result = foo (str);
Console.WriteLine ("Output was {0} with no optional value used", result);
// example: call it with optional parameter
result = foo (str, optional);
Console.WriteLine ("Output was {0} with optional value of {1}", result, optional.Result);
// example: call it with named optional parameter
foo (str, outResult: optional);
Console.WriteLine ("Output was {0} with optional value of {1}", result, optional.Result);
}
There actually is a way to do this that is allowed by C#. This gets back to C++, and rather violates the nice Object-Oriented structure of C#.
USE THIS METHOD WITH CAUTION!
Here's the way you declare and write your function with an optional parameter:
unsafe public void OptionalOutParameter(int* pOutParam = null)
{
int lInteger = 5;
// If the parameter is NULL, the caller doesn't care about this value.
if (pOutParam != null)
{
// If it isn't null, the caller has provided the address of an integer.
*pOutParam = lInteger; // Dereference the pointer and assign the return value.
}
}
Then call the function like this:
unsafe { OptionalOutParameter(); } // does nothing
int MyInteger = 0;
unsafe { OptionalOutParameter(&MyInteger); } // pass in the address of MyInteger.
In order to get this to compile, you will need to enable unsafe code in the project options. This is a really hacky solution that usually shouldn't be used, but if you for some strange, arcane, mysterious, management-inspired decision, REALLY need an optional out parameter in C#, then this will allow you to do just that.
ICYMI: Included on the new features for C# 7.0 enumerated here, "discards" is now allowed as out parameters in the form of a _, to let you ignore out parameters you don’t care about:
p.GetCoordinates(out var x, out _); // I only care about x
P.S. if you're also confused with the part "out var x", read the new feature about "Out Variables" on the link as well.
No, but you can use a delegate (e.g. Action) as an alternative.
Inspired in part by Robin R's answer when facing a situation where I thought I wanted an optional out parameter, I instead used an Action delegate. I've borrowed his example code to modify for use of Action<int> in order to show the differences and similarities:
public string foo(string value, Action<int> outResult = null)
{
// .. do something
outResult?.Invoke(100);
return value;
}
public void bar ()
{
string str = "bar";
string result;
int optional = 0;
// example: call without the optional out parameter
result = foo (str);
Console.WriteLine ("Output was {0} with no optional value used", result);
// example: call it with optional parameter
result = foo (str, x => optional = x);
Console.WriteLine ("Output was {0} with optional value of {1}", result, optional);
// example: call it with named optional parameter
foo (str, outResult: x => optional = x);
Console.WriteLine ("Output was {0} with optional value of {1}", result, optional);
}
This has the advantage that the optional variable appears in the source as a normal int (the compiler wraps it in a closure class, rather than us wrapping it explicitly in a user-defined class).
The variable needs explicit initialisation because the compiler cannot assume that the Action will be called before the function call exits.
It's not suitable for all use cases, but worked well for my real use case (a function that provides data for a unit test, and where a new unit test needed access to some internal state not present in the return value).
Use an overloaded method without the out parameter to call the one with the out parameter for C# 6.0 and lower. I'm not sure why a C# 7.0 for .NET Core is even the correct answer for this thread when it was specifically asked if C# 4.0 can have an optional out parameter. The answer is NO!
For simple types you can do this using unsafe code, though it's not idiomatic nor recommended. Like so:
// unsafe since remainder can point anywhere
// and we can do arbitrary pointer manipulation
public unsafe int Divide( int x, int y, int* remainder = null ) {
if( null != remainder ) *remainder = x % y;
return x / y;
}
That said, there's no theoretical reason C# couldn't eventually allow something like the above with safe code, such as this below:
// safe because remainder must point to a valid int or to nothing
// and we cannot do arbitrary pointer manipulation
public int Divide( int x, int y, out? int remainder = null ) {
if( null != remainder ) *remainder = x % y;
return x / y;
}
Things could get interesting though:
// remainder is an optional output parameter
// (to a nullable reference type)
public int Divide( int x, int y, out? object? remainder = null ) {
if( null != remainder ) *remainder = 0 != y ? x % y : null;
return x / y;
}
The direct question has been answered in other well-upvoted answers, but sometimes it pays to consider other approaches based on what you're trying to achieve.
If you're wanting an optional parameter to allow the caller to possibly request extra data from your method on which to base some decision, an alternative design is to move that decision logic into your method and allow the caller to optionally pass a value for that decision criteria in. For example, here is a method which determines the compass point of a vector, in which we might want to pass back the magnitude of the vector so that the caller can potentially decide if some minimum threshold should be reached before the compass-point judgement is far enough away from the origin and therefore unequivocally valid:
public enum Quadrant {
North,
East,
South,
West
}
// INVALID CODE WITH MADE-UP USAGE PATTERN OF "OPTIONAL" OUT PARAMETER
public Quadrant GetJoystickQuadrant([optional] out magnitude)
{
Vector2 pos = GetJoystickPositionXY();
float azimuth = Mathf.Atan2(pos.y, pos.x) * 180.0f / Mathf.PI;
Quadrant q;
if (azimuth > -45.0f && azimuth <= 45.0f) q = Quadrant.East;
else if (azimuth > 45.0f && azimuth <= 135.0f) q = Quadrant.North;
else if (azimuth > -135.0f && azimuth <= -45.0f) q = Quadrant.South;
else q = Quadrant.West;
if ([optonal.isPresent(magnitude)]) magnitude = pos.Length();
return q;
}
In this case we could move that "minimum magnitude" logic into the method and end-up with a much cleaner implementation, especially because calculating the magnitude involves a square-root so is computationally inefficient if all we want to do is a comparison of magnitudes, since we can do that with squared values:
public enum Quadrant {
None, // Too close to origin to judge.
North,
East,
South,
West
}
public Quadrant GetJoystickQuadrant(float minimumMagnitude = 0.33f)
{
Vector2 pos = GetJoystickPosition();
if (minimumMagnitude > 0.0f && pos.LengthSquared() < minimumMagnitude * minimumMagnitude)
{
return Quadrant.None;
}
float azimuth = Mathf.Atan2(pos.y, pos.x) * 180.0f / Mathf.PI;
if (azimuth > -45.0f && azimuth <= 45.0f) return Quadrant.East;
else if (azimuth > 45.0f && azimuth <= 135.0f) return Quadrant.North;
else if (azimuth > -135.0f && azimuth <= -45.0f) return Quadrant.South;
return Quadrant.West;
}
Of course, that might not always be viable. Since other answers mention C# 7.0, if instead what you're really doing is returning two values and allowing the caller to optionally ignore one, idiomatic C# would be to return a tuple of the two values, and use C# 7.0's Tuples with positional initializers and the _ "discard" parameter:
public (Quadrant, float) GetJoystickQuadrantAndMagnitude()
{
Vector2 pos = GetJoystickPositionXY();
float azimuth = Mathf.Atan2(pos.y, pos.x) * 180.0f / Mathf.PI;
Quadrant q;
if (azimuth > -45.0f && azimuth <= 45.0f) q = Quadrant.East;
else if (azimuth > 45.0f && azimuth <= 135.0f) q = Quadrant.North;
else if (azimuth > -135.0f && azimuth <= -45.0f) q = Quadrant.South;
else q = Quadrant.West;
return (q, pos.Length());
}
(Quadrant q, _) = GetJoystickQuadrantAndMagnitude();
if (q == Quadrant.South)
{
// Do something.
}
What about like this?
public bool OptionalOutParamMethod([Optional] ref string pOutParam)
{
return true;
}
You still have to pass a value to the parameter from C# but it is an optional ref param.
void foo(ref int? n)
{
return null;
}
I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry
-----
Yabba
Dabba
Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user#host]$ g++ -o unique_chars unique_chars.cpp
[user#host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).
Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file.
Python w/sets (quick and dirty)
s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))
Python w/sets (with nicer output)
import re
text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
print "Unique Characters: {%s}" % unique
Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.
Why the arbitrary limitation that you need a "script" that does it?
What exactly is a script anyway?
Would Python do?
If so, then this is one solution:
import sys;
s = set([]);
while True:
line = sys.stdin.readline();
if not line:
break;
line = line.rstrip();
for c in line.lower():
s.add(c);
print("".join(sorted(s)));
Algorithm:
Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.
cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).
While not an script this java program will do the work. It's easy to understand an fast ( to run )
import java.util.*;
import java.io.*;
public class Unique {
public static void main( String [] args ) throws IOException {
int c = 0;
Set s = new TreeSet();
while( ( c = System.in.read() ) > 0 ) {
s.add( Character.toLowerCase((char)c));
}
System.out.println( "Unique characters:" + s );
}
}
You'll invoke it like this:
type yourFile | java Unique
or
cat yourFile | java Unique
For instance, the unique characters in the HTML of this question are:
Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
Print unique characters (ASCII and Unicode UTF-8)
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py, and run as python unique.py.
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.
Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):
var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
var L = system.stdin.readLine();
for (var i = L.length; i-- > 0; )
{
var c = L[i].toLowerCase();
if (!(c in seenAlreadyMap))
{
seenAlreadyMap[c] = true;
seenAlreadyArray.push(c);
}
}
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
file = open('location.txt', 'r')
letters = {}
for line in file:
if line == "":
break
for character in line.strip():
if character not in letters:
letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)
#include<stdio.h>
#define CHARSINSET 256
#define FILENAME "location.txt"
char buf[CHARSINSET + 1];
char *getUniqueCharacters(int *charactersInFile) {
int x;
char *bufptr = buf;
for (x = 0; x< CHARSINSET;x++) {
if (charactersInFile[x] > 0)
*bufptr++ = (char)x;
}
bufptr = '\0';
return buf;
}
int main() {
FILE *fp;
char c;
int *charactersInFile = calloc(sizeof(int), CHARSINSET);
if (NULL == (fp = fopen(FILENAME, "rt"))) {
printf ("File not found.\n");
return 1;
}
while(1) {
c = getc(fp);
if (c == EOF) {
break;
}
if (c != '\n' && c != '\r')
charactersInFile[c]++;
}
fclose(fp);
printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
return 0;
}
Quick and dirty solution using grep (assuming the file name is "file"):
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
if [ ! -z "`grep -li $char file`" ]; then
echo -n $char;
fi;
done;
echo
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)
Well my friend, I think this is what you had in mind....At least this is the python version!!!
f = open("location.txt", "r") # open file
ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)
I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
import itertools, sys
# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
This answer above mentioned using a dictionary.
If so, the code presented there can be streamlined a bit, since the Python documentation states:
It is best to think of a dictionary as
an unordered set of key: value pairs,
with the requirement that the keys are
unique (within one dictionary).... If
you store using a key that is already
in use, the old value associated with
that key is forgotten.
Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
if character not in letters:
And that should make it a little faster.
Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(#"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(#"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique
Python without using a set.
file = open('location', 'r')
letters = []
for line in file:
for character in line:
if character not in letters:
letters.append(character)
print(letters)
Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;)
BACKGROUND
I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character.
1. First try: Jay's C++ solution
I started with #jay's C++ answer:
(Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars)
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Timing for this one:
real 10m55.026s
user 10m51.691s
sys 0m3.329s
2. Read entire file at once
I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half.
(I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines).
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
/* ignore whitespace and case */
for (char& ch : str) {
if (!isspace(ch)) {
seen_chars.insert(tolower(ch));
}
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for this one:
real 4m41.910s
user 3m32.014s
sys 0m17.858s
3. Remove isspace() check and tolower()
Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes.
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
for (char& ch : str) {
// removed isspace() and tolower()
seen_chars.insert(ch);
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for final version:
real 3m12.397s
user 2m58.771s
sys 0m13.624s
The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked).
However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output):
import io
with io.open("my-file.csv",'r',encoding='utf8') as f:
text = f.read()
print "Unique Characters: {%s}" % ''.join(sorted(set(text)))