Extra characters and symbols outputted when doing substitution in C - string

When I run the code using the following key, extra characters are outputted...
TERMINAL WINDOW:
$ ./substitution abcdefghjklmnopqrsTUVWXYZI
plaintext: heTUXWVI ii ssTt
ciphertext: heUVYXWJ jj ttUuh|
This is the instructions (cs50 substitution problem)
Design and implement a program, substitution, that encrypts messages using a substitution cipher.
Implement your program in a file called substitution.c in a ~/pset2/substitution directory.
Your program must accept a single command-line argument, the key to use for the substitution. The key itself should be case-insensitive, so whether any character in the key is uppercase or lowercase should not affect the behavior of your program.
If your program is executed without any command-line arguments or with more than one command-line argument, your program should print an error message of your choice (with printf) and return from main a value of 1 (which tends to signify an error) immediately.
If the key is invalid (as by not containing 26 characters, containing any character that is not an alphabetic character, or not containing each letter exactly once), your program should print an error message of your choice (with printf) and return from main a value of 1 immediately.
Your program must output plaintext: (without a newline) and then prompt the user for a string of plaintext (using get_string).
Your program must output ciphertext: (without a newline) followed by the plaintext’s corresponding ciphertext, with each alphabetical character in the plaintext substituted for the corresponding character in the ciphertext; non-alphabetical characters should be outputted unchanged.
Your program must preserve case: capitalized letters must remain capitalized letters; lowercase letters must remain lowercase letters.
After outputting ciphertext, you should print a newline. Your program should then exit by returning 0 from main.
My code:
#include <cs50.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(int argc,string argv[])
{
char alpha[26] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
string key = argv[1];
int totalchar = 0;
for (char c ='a'; c <= 'z'; c++)
{
for (int i = 0; i < strlen(key); i++)
{
if (tolower(key[i]) == c)
{
totalchar++;
}
}
}
//accept only singular 26 key
if (argc == 2 && totalchar == 26)
{
string plaint = get_string("plaintext: ");
int textlength =strlen(plaint);
char subchar[textlength];
for (int i= 0; i< textlength; i++)
{
for (int j =0; j<26; j++)
{
// substitute
if (tolower(plaint[i]) == alpha[j])
{
subchar[i] = tolower(key[j]);
// keep plaintext's case
if (plaint[i] >= 'A' && plaint[i] <= 'Z')
{
subchar[i] = (toupper(key[j]));
}
}
// if isn't char
if (!(isalpha(plaint[i])))
{
subchar[i] = plaint[i];
}
}
}
printf("ciphertext: %s\n", subchar);
return 0;
}
else
{
printf("invalid input\n");
return 1;
}
}

strcmp compares two strings. plaint[i] and alpha[j] are chars. The can be compared with "regular" comparison operators, like ==.

Related

Why I cant print initials of a name?

I am trying to write a code that prints initials of a given name? I have been getting error whenever I use '/0' - to initialize the ith character of a given string. I am using it to identify initials of the 2nd word? Any suggestions to detect the 2nd word and print the initial of the 2nd word? Additionally, I am trying to write this code so that it ignores any additional spaces:
#include <stdio.h>
#include <cs50.h>
#include <string.h>
#include <ctype.h>
int main(void)
{
printf ("Enter you Name:");//print name
string s = get_string(); //get input from user
printf("%c", toupper(s[0])); // use placeholder to call the string
for (int i=0, n = strlen(s); i < n; i++)
{
if (s[i] == '/0')
{
printf("%c", toupper(s[i+1]));
}
}
printf ("\n");
}
I'm not sure what your get_string() function does but this does what you're asking for.
int main(void)
{
printf ("Enter you Name:");//print name
char input[100];
cin.getline(input, sizeof(input));
string s(input);
//get input from user
printf("%c", toupper(s[0])); // use placeholder to call the string
for (int i=0, n = s.length(); i < n; i++)
{
if (isspace(s.at(i)))
{
printf("%c", toupper(s[i+1]));
}
}
printf ("\n");
}
Try to use ' ' instead of '/0'.
if (s[i] == ' ')
or if you prefer ASCII codes (Space is 0x20 or 32 as decimal):
if (s[i] == 0x20)
your code will work if there is only 1 space between names. To avoid this, the condition should also check the next char. If it's not a space then probably it's a 'letter':
if (s[i] == 0x20 && s[i+1] != 0x20) {...
Note that the for cycle should stop at i < n - 1, otherwise the loop will fail if there are trailing spaces...

Append Char To StringBuilder C++/CLI

I am trying to use StringBuilder to create the output that is being sent over the serial port for a log file. The output is stored in a byte array, and I am recursing through it.
ref class UART_G {
public:
static array<System::Byte>^ message = nullptr;
static uint8_t message_length = 0;
};
static void logSend ()
{
StringBuilder^ outputsb = gcnew StringBuilder();
outputsb->Append("Sent ");
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (
(mychar >= ' ' && mychar <= 'Z') || //Includes 0-9, A-Z.
(mychar >= '^' && mychar <= '~') || //Includes a-z.
(mychar >= 128 && mychar <= 254)) //I think these are okay.
{
outputsb->Append(L""+mychar);
}
else
{
outputsb->Append("[");
outputsb->Append(mychar);
outputsb->Append("]");
}
}
log_line(outputsb->ToString());
}
I want all plain text characters (eg A, :) to be sent as text, while functional characters (eg BEL, NEWLINE) will be sent like [7][13].
What is happening is that the StringBuilder, in all cases, is outputting the character as a number. For example, A is being sent out as 65.
For example, if I have the string 'APPLE' and a newline in my byte array, I want to see:
Sent APPLE[13]
Instead, I see:
Sent 6580807669[13]
I have tried every way imaginable to get it to display the character properly, including type-casting, concatenating it to a string, changing the variable type, etc... I would really appreciate if anyone knows how to do this. My log files are largely unreadable without this function.
You're getting the ASCII values because the compiler is choosing one of the Append overloads that takes an integer of some sort. To fix this, you could do a explicit cast to System::Char, to force the correct overload.
However, that won't necessarily give the proper results for 128-255. You could cast a value in that range from Byte to Char, and it'll give something, but not necessarily what you expect. First off, 0x80 through 0x9F are control characters, and whereever you're getting the bytes from might not intend the same representation for 0xA0 through 0xFF as Unicode has.
In my opinion, the best solution would be to use the "[value]" syntax that you're using for the other control characters for 0x80 through 0xFF as well. However, if you do want to convert those to characters, I'd use Encoding::Default, not Encoding::ASCII. ASCII only defines 0x00 through 0x7F, 0x80 and higher will come out as "?". Encoding::Default is whatever code page is defined for the language you have selected in Windows.
Combine all that, and here's what you'd end up with:
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (mychar >= ' ' && mychar <= '~' && mychar != '[' && mychar != ']')
{
// Use the character directly for all ASCII printable characters,
// except '[' and ']', because those have a special meaning, below.
outputsb->Append((System::Char)(mychar));
}
else if (mychar >= 128)
{
// Non-ASCII characters, use the default encoding to convert to Unicode.
outputsb->Append(Encoding::Default->GetChars(UART_G::message, i, 1));
}
else
{
// Unprintable characters, use the byte value in brackets.
// Also do this for bracket characters, so there's no ambiguity
// what a bracket means in the logs.
outputsb->Append("[");
outputsb->Append((unsigned int)mychar);
outputsb->Append("]");
}
}
You are recieveing ascii value of the string .
See the Ascii chart
65 = A
80 = P
80 = P
76 = L
69 = E
Just write a function that converts the ascii value to string
Here is the code I came up with which resolved the issue:
static void logSend ()
{
StringBuilder^ outputsb = gcnew StringBuilder();
ASCIIEncoding^ ascii = gcnew ASCIIEncoding;
outputsb->Append("Sent ");
for (uint8_t i = 0; i < UART_G::message_length; i ++)
{
unsigned char mychar = UART_G::message[i];
if (
(mychar >= ' ' && mychar <= 'Z') || //Includes 0-9, A-Z.
(mychar >= '^' && mychar <= '~') || //Includes a-z.
(mychar >= 128 && mychar <= 254)) //I think these are okay.
{
outputsb->Append(ascii->GetString(UART_G::message, i, 1));
}
else
{
outputsb->Append("[");
outputsb->Append(mychar);
outputsb->Append("]");
}
}
log_line(outputsb->ToString());
}
I still appreciate any alternatives which are more efficient or simpler to read.

Vigenere.c CS50 Floating Point Exception (Core Dumped)

I am working on the Vigenere exercise from Harvard's CS50 (in case you noticed I'm using string and not str).
My program gives me a Floating Point Exception error when I use "a" in the keyword.
It actually gives me that error
when I use "a" by itself, and
when I use "a" within a bigger word it just gives me wrong output.
For any other kind of keyword, the program works perfectly fine.
I've run a million tests. Why is it doing this? I can't see where I'm dividing or % by 0. The length of the keyword is always at least 1. It is probably going to be some super simple mistake, but I've been at this for about 10 hours and I can barely remember my name.
#include <stdio.h>
#include <cs50.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
int main (int argc, string argv[])
{
//Error message if argc is not 2 and argv[1] is not alphabetical
if (argc != 2)
{
printf("Insert './vigenere' followed by an all alphabetical key\n");
return 1;
}
else if (argv[1])
{
for (int i = 0, n = strlen(argv[1]); i < n; i++)
{
if (isalpha((argv[1])[i]) == false)
{
printf("Insert './vigenere' followed by an all alphabetical key\n");
return 1;
}
}
//Store keyword in variable
string keyword = argv[1];
//Convert all capital chars in keyword to lowercase values, then converts them to alphabetical corresponding number
for (int i = 0, n = strlen(keyword); i < n; i++)
{
if (isupper(keyword[i])) {
keyword[i] += 32;
}
keyword[i] -= 97;
}
//Ask for users message
string message = GetString();
int counter = 0;
int keywordLength = strlen(keyword);
//Iterate through each of the message's chars
for (int i = 0, n = strlen(message); i < n; i++)
{
//Check if ith char is a letter
if (isalpha(message[i])) {
int index = counter % keywordLength;
if (isupper(message[i])) {
char letter = (((message[i] - 65) + (keyword[index])) % 26) + 65;
printf("%c", letter);
counter++;
} else if (islower(message[i])) {
char letter = (((message[i] - 97) + (keyword[index])) % 26) + 97;
printf("%c", letter);
counter++;
}
} else {
//Prints non alphabetic characters
printf("%c", message[i]);
}
}
printf("\n");
return 0;
}
}
This behavior is caused by the line keyword[i] -= 97;, there you make every 'a' in the key stream a zero. Later you use strlen() on the transformed key. So when the key starts with an 'a', keywordLength therefor is set to zero, and the modulo keywordLength operation get into a division by zero. You can fix this by calculating the keyword length before the key transformation.

Replacing and deleting a character from a string in c++?

This program is giving wrong output,, basically i want to remove the character specified and replace it by 'g'...For e.g: All that glitters is not gold if the user entered o then the output should be All that glitters is ngt ggld but the program is deleting all the characters from n onwards
#include <iostream>
using namespace std;
int main()
{
string input(" ALL GLItters are not gold");
char a;
cin>>a;
for(int i=0;i<input.size();i++)
{
if(input.at(i)==a)
{
input.erase(i,i+1);
input.insert(i,"g");
}
}
cout<<"\n";
cout<<input;
}
string& erase (size_t pos = 0, size_t len = npos);
The second parameter ( len ) is the Number of characters to erase.
You have to put 1 not i+1 :
input.erase(i,1);
http://www.cplusplus.com/reference/string/string/erase/
Why not replace it directly? Replace your for loop with this:
for (char& c : input)
{
if (c == a)
c = 'g';
}
Live example here.

Find Unique Characters in a File

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry
-----
Yabba
Dabba
Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user#host]$ g++ -o unique_chars unique_chars.cpp
[user#host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).
Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file.
Python w/sets (quick and dirty)
s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))
Python w/sets (with nicer output)
import re
text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
print "Unique Characters: {%s}" % unique
Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.
Why the arbitrary limitation that you need a "script" that does it?
What exactly is a script anyway?
Would Python do?
If so, then this is one solution:
import sys;
s = set([]);
while True:
line = sys.stdin.readline();
if not line:
break;
line = line.rstrip();
for c in line.lower():
s.add(c);
print("".join(sorted(s)));
Algorithm:
Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.
cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).
While not an script this java program will do the work. It's easy to understand an fast ( to run )
import java.util.*;
import java.io.*;
public class Unique {
public static void main( String [] args ) throws IOException {
int c = 0;
Set s = new TreeSet();
while( ( c = System.in.read() ) > 0 ) {
s.add( Character.toLowerCase((char)c));
}
System.out.println( "Unique characters:" + s );
}
}
You'll invoke it like this:
type yourFile | java Unique
or
cat yourFile | java Unique
For instance, the unique characters in the HTML of this question are:
Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
Print unique characters (ASCII and Unicode UTF-8)
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py, and run as python unique.py.
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.
Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):
var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
var L = system.stdin.readLine();
for (var i = L.length; i-- > 0; )
{
var c = L[i].toLowerCase();
if (!(c in seenAlreadyMap))
{
seenAlreadyMap[c] = true;
seenAlreadyArray.push(c);
}
}
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
file = open('location.txt', 'r')
letters = {}
for line in file:
if line == "":
break
for character in line.strip():
if character not in letters:
letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)
#include<stdio.h>
#define CHARSINSET 256
#define FILENAME "location.txt"
char buf[CHARSINSET + 1];
char *getUniqueCharacters(int *charactersInFile) {
int x;
char *bufptr = buf;
for (x = 0; x< CHARSINSET;x++) {
if (charactersInFile[x] > 0)
*bufptr++ = (char)x;
}
bufptr = '\0';
return buf;
}
int main() {
FILE *fp;
char c;
int *charactersInFile = calloc(sizeof(int), CHARSINSET);
if (NULL == (fp = fopen(FILENAME, "rt"))) {
printf ("File not found.\n");
return 1;
}
while(1) {
c = getc(fp);
if (c == EOF) {
break;
}
if (c != '\n' && c != '\r')
charactersInFile[c]++;
}
fclose(fp);
printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
return 0;
}
Quick and dirty solution using grep (assuming the file name is "file"):
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
if [ ! -z "`grep -li $char file`" ]; then
echo -n $char;
fi;
done;
echo
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)
Well my friend, I think this is what you had in mind....At least this is the python version!!!
f = open("location.txt", "r") # open file
ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)
I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
import itertools, sys
# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
This answer above mentioned using a dictionary.
If so, the code presented there can be streamlined a bit, since the Python documentation states:
It is best to think of a dictionary as
an unordered set of key: value pairs,
with the requirement that the keys are
unique (within one dictionary).... If
you store using a key that is already
in use, the old value associated with
that key is forgotten.
Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
if character not in letters:
And that should make it a little faster.
Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(#"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(#"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique
Python without using a set.
file = open('location', 'r')
letters = []
for line in file:
for character in line:
if character not in letters:
letters.append(character)
print(letters)
Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;)
BACKGROUND
I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character.
1. First try: Jay's C++ solution
I started with #jay's C++ answer:
(Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars)
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Timing for this one:
real 10m55.026s
user 10m51.691s
sys 0m3.329s
2. Read entire file at once
I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half.
(I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines).
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
/* ignore whitespace and case */
for (char& ch : str) {
if (!isspace(ch)) {
seen_chars.insert(tolower(ch));
}
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for this one:
real 4m41.910s
user 3m32.014s
sys 0m17.858s
3. Remove isspace() check and tolower()
Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes.
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
for (char& ch : str) {
// removed isspace() and tolower()
seen_chars.insert(ch);
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for final version:
real 3m12.397s
user 2m58.771s
sys 0m13.624s
The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked).
However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output):
import io
with io.open("my-file.csv",'r',encoding='utf8') as f:
text = f.read()
print "Unique Characters: {%s}" % ''.join(sorted(set(text)))

Resources