The programmed cant find my file - visual-c++

I was doing a programmed that require me to prepare two separate input file which was name group1.txt and group2.txt and then create a programmed that finds the average score for each group
I wrote this but it cant find my file for reasons I do not know
some help maybe thanks !
Here is the code:
#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h>
using namespace std;
int main()
{
string line;
const int num_lines = 10; //change numline to any number you like, its to set the size of the array
string sub_code[num_lines];
float avrgs[num_lines];
string sub_code2[num_lines];
float avrgs2[num_lines];
ifstream myfile ("group1.txt");
int sum=0,score=0,j=1,i=0,i2=0;
double ave=0.0;
if (myfile.is_open())
{
while ( myfile>>line )
{
//cout<<line<<endl;
sub_code[i] = line;
while ( myfile>>score && score <100 && score >= 0)
{
sum += score;
j++;
}
ave = sum/j;
j=1;
sum = 0;
avrgs[i]=ave;
i++;
}
}
else
{cout << "Unable to open file"<<endl;}
myfile.close();
//this is for the secode line
ifstream myfile2 ("group2.txt");
if (myfile2.is_open())
{
while ( myfile2>>line )
{
//cout<<line<<endl;
sub_code2[i2] = line; //add all the subject code into the array to store sub codes
while ( myfile2>>score && score <100 && score >= 0) //well basically the score should be between 0 - 100,
{ //so -999 wont be read. Can change to while ( myfile2>>score && score!=-999)
sum += score; //read each grade and add it to sum
j++; //just to know how many grades are there so that division can be done
}
ave = sum/j; //find the average.
j=1; //set j back to 1, cause j is used to count the number of marks.
sum = 0; //since its sum+=score, we need to set sum back to 0, or else it will be adding on to the old marks
avrgs2[i2]=ave; //add that calculated ave into the array for average
i2++; //i2 is to basically know how many entries are in the file for grp2
}
}
else
{cout << "Unable to open file"<<endl;}
myfile2.close();
int gr1,gr2;
//outputing the averages and so on...
for (gr1=0;gr1<i;gr1++)
{
for(gr2=0;gr2<i2;gr2++)
{
if(sub_code[gr1]==sub_code2[gr2]) // compare subject id before displaying
{
cout<<sub_code[gr1]<<"\t"<<" 1 "<<avrgs[gr1]<<endl;
cout<<sub_code2[gr2]<<"\t"<<" 2 "<<avrgs2[gr2]<<endl;
break;
}
}
}
//
cout<<endl;
double grpave1=0,grpave2=0;
//to find the average of each group
for (gr1=0;gr1<i;gr1++)
{
grpave1+=avrgs[gr1];
}
for(gr2=0;gr2<i2;gr2++)
{
grpave2+=avrgs2[gr2];
}
grpave1=grpave1/i;
grpave2=grpave2/i2;
cout<<"Average for group 1:"<<grpave1<<endl;
cout<<"Average for group 2:"<<grpave2<<endl;
system("pause");
return 0;
}
Just need to know how to get my file ! I have put in Desktop, MY document, Projects, With the C++ file and I have no idea how !! Somemore I need to have a soft copy which I dun know whether it will be able to find my file in there !

The way you specify the file name, as "group1.txt" the file will only be found by the program if it's in the current working directory of the compiled program.
There are two ways to solve this problem:
actually copy the file to the working directory of the program. In many cases that will be the directory the executable resides in.
use an absolute filename, like ("c:\Users\youruser\Desktop\group1.txt" for a file stored on the desktop on a Win7 system.

Related

pset2 readability always printing before grade 1 no matter what input

I know this is a fairly newbie question so I'm sorry if the solution is painfully obvious to you guys.
I've fully coded up pset 2 readability and it worked for printing out the number of letters, words and sentences for the user inputted text- I have since removed those print statement as they aren't needed for the pset (I just wanted to actually make sure the functions were returning something- they worked just fine).
I'm up to printing out the grade level now but no matter what text I input I only get before grade 1. I've already checked to see if I had anything wrong with my print statements and I can seem to find an issue there so I'm thinking that there may be an error in the calculation of the grade level itself- I've looked until my eyes have gone square and for the life of me I can not see anything wrong.
If someone could shed some light on my problem I would love to be saved the headache :), or even point me in the right direction so I get the learning. (also first time poster, long time lurkers so forgive me if anything is formatted incorrectly).
Thank you all!
Here is my code:
#include <cs50.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <math.h>
// my functions to calculate letters, words
int count_letters(string text);
int count_words(string text);
int count_sentences(string text);
int main(void)
{
string text = get_string("Text: ");
int letters = count_letters(text);
int words = count_words(text);
int sentences = count_sentences(text);
float calculation = (0.0588 * letters / words * 100) - (0.296 * sentences / words *
100) - 15.8;
int index = round(calculation);
if (index < 1)
{
printf("Before grade 1.\n");
}
else if (index >= 16)
{
printf("Grade: 16+.\n");
}
else
{
printf("Grade: %i.\n", index);
}
}
int count_letters(string text)
{
int letters = 0;
for (int i = 0; i < strlen(text); i++)
//((text[i] > 65 && text[i] < 90) || (text[i] > 97 && text[i] < 122))
if (isalpha(text[i]))
{
letters++;
}
return letters;
}
int count_words(string text)
{
int words = 0;
for (int i = 0; i < strlen(text); i++)
if isspace ((text[i]))
{
words++;
words = words + 1;
}
return words;
}
int count_sentences(string text)
{
int sentences = 0;
for (int i = 0; i < strlen(text); i++)
if (text[i] == '.' || text[i] == '!' || text[i] == '?')
{
sentences++;
}
return sentences;
}
I just wanted to actually make sure the functions were returning something- they worked just fine
Yes, the functions return "something" but is it the right thing? Suggest you add back the debug printf and look at the results carefully and critically. Start with the simplest text ("One fish. Two fish. Red fish. Blue fish."). 29 letters, 8 words, 4 sentences. What result is printed?
Inside of the function count_words
you have an if statement, if statements need to have brackets, the line wouldn't make sense even if you had the brackets, so double-check the logic as well.

check50 error messages: On my Mario.c pset1 solution

:( handles a height of 0 correctly \ expected an exit code of 0, not output of "\n Please enter a positive integer valu..."
:( rejects a non-numeric height of "" \ expected output, not a prompt for input
https://sandbox.cs50.net/checks/5593ad8059ce4492804c07aff8e377eb
I think I should put part of my code too:
#include <stdio.h>
int clean_stdin()
{
while (getchar()!='\n');
return 1;
} //snippet gotten from http://stackoverflow.com/questions/14104013/prevent-users-from-entering-wrong-data-types
int main (void)
{
int row, pyramid_height, space, hash;
char c;
do
{
printf("\n Please enter a positive integer value, less than 24 as the height: ");
}
while (((scanf("%i%c", &pyramid_height, &c) != 2 || c!='\n') && clean_stdin()) || pyramid_height < 1 || pyramid_height > 23);
//snippet gotten from http://stackoverflow.com/questions/14104013/prevent-users-from-entering-wrong-data-types
Please help:
Also, Is there an easier way to prevent users from entering wrong data?
Thank you.
You should do something like this to ask the user for input
printf("Enter height < 23 and a non-negative number\n");
do{
printf("Height: ");
height = GetInt(); // ask user again until valid input is given
}while(height < 0 || height >23);
If you don't want to use GetInt() you can do this with scanf too. Just replace the GetInt line with scanf("%d",&height);. It'll work the same except when you enter a wrong number it'll yell at you by saying Height:
rather than Retry:.
And you should remove that clean_stdin function. That level of precision is not required in pset1.
Now the remaining part is the nested for loops which you've not provided in the question so, I am assuming that you have a problem there too since your program can't handle 0 properly.
Try something like this in place of the for loops.
for(int i=1; i<=height; i++){ // i number of #s in each step
for(int j=0; j<height-i; j++){ //print appropriate number of spaces
printf(" ");
}
for(int k=0; k<=i; k++){ //print #s
printf("#");
}
printf("\n"); //change line
}

Counter for two binary strings C++

I am trying to count two binary numbers from string. The maximum number of counting digits have to be 253. Short numbers works, but when I add there some longer numbers, the output is wrong. The example of bad result is "10100101010000111111" with "000011010110000101100010010011101010001101011100000000111000000000001000100101101111101000111001000101011010010111000110".
#include <iostream>
#include <stdlib.h>
using namespace std;
bool isBinary(string b1,string b2);
int main()
{
string b1,b2;
long binary1,binary2;
int i = 0, remainder = 0, sum[254];
cout<<"Get two binary numbers:"<<endl;
cin>>b1>>b2;
binary1=atol(b1.c_str());
binary2=atol(b2.c_str());
if(isBinary(b1,b2)==true){
while (binary1 != 0 || binary2 != 0){
sum[i++] =(binary1 % 10 + binary2 % 10 + remainder) % 2;
remainder =(binary1 % 10 + binary2 % 10 + remainder) / 2;
binary1 = binary1 / 10;
binary2 = binary2 / 10;
}
if (remainder != 0){
sum[i++] = remainder;
}
--i;
cout<<"Result: ";
while (i >= 0){
cout<<sum[i--];
}
cout<<endl;
}else cout<<"Wrong input"<<endl;
return 0;
}
bool isBinary(string b1,string b2){
bool rozhodnuti1,rozhodnuti2;
for (int i = 0; i < b1.length();i++) {
if (b1[i]!='0' && b1[i]!='1') {
rozhodnuti1=false;
break;
}else rozhodnuti1=true;
}
for (int k = 0; k < b2.length();k++) {
if (b2[k]!='0' && b2[k]!='1') {
rozhodnuti2=false;
break;
}else rozhodnuti2=true;
}
if(rozhodnuti1==false || rozhodnuti2==false){ return false;}
else{ return true;}
}
One of the problems might be here: sum[i++]
This expression, as it is, first returns the value of i and then increases it by one.
Did you do it on purporse?
Change it to ++i.
It'd help if you could also post the "bad" output, so that we can try to move backward through the code starting from it.
EDIT 2015-11-7_17:10
Just to be sure everything was correct, I've added a cout to check what binary1 and binary2 contain after you assing them the result of the atol function: they contain the integer numbers 547284487 and 18333230, which obviously dont represent the correct binary-to-integer transposition of the two 01 strings you presented in your post.
Probably they somehow exceed the capacity of atol.
Also, the result of your "math" operations bring to an even stranger result, which is 6011111101, which obviously doesnt make any sense.
What do you mean, exactly, when you say you want to count these two numbers? Maybe you want to make a sum? I guess that's it.
But then, again, what you got there is two signed integer numbers and not two binaries, which means those %10 and %2 operations are (probably) misused.
EDIT 2015-11-07_17:20
I've tried to use your program with small binary strings and it actually works; with small binary strings.
It's a fact(?), at this point, that atol cant handle numerical strings that long.
My suggestion: use char arrays instead of strings and replace 0 and 1 characters with numerical values (if (bin1[i]){bin1[i]=1;}else{bin1[i]=0}) with which you'll be able to perform all the math operations you want (you've already written a working sum function, after all).
Once done with the math, you can just convert the char array back to actual characters for 0 and 1 and cout it on the screen.
EDIT 2015-11-07_17:30
Tested atol on my own: it correctly converts only strings that are up to 10 characters long.
Anything beyond the 10th character makes the function go crazy.

Vigenere.c CS50 Floating Point Exception (Core Dumped)

I am working on the Vigenere exercise from Harvard's CS50 (in case you noticed I'm using string and not str).
My program gives me a Floating Point Exception error when I use "a" in the keyword.
It actually gives me that error
when I use "a" by itself, and
when I use "a" within a bigger word it just gives me wrong output.
For any other kind of keyword, the program works perfectly fine.
I've run a million tests. Why is it doing this? I can't see where I'm dividing or % by 0. The length of the keyword is always at least 1. It is probably going to be some super simple mistake, but I've been at this for about 10 hours and I can barely remember my name.
#include <stdio.h>
#include <cs50.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
int main (int argc, string argv[])
{
//Error message if argc is not 2 and argv[1] is not alphabetical
if (argc != 2)
{
printf("Insert './vigenere' followed by an all alphabetical key\n");
return 1;
}
else if (argv[1])
{
for (int i = 0, n = strlen(argv[1]); i < n; i++)
{
if (isalpha((argv[1])[i]) == false)
{
printf("Insert './vigenere' followed by an all alphabetical key\n");
return 1;
}
}
//Store keyword in variable
string keyword = argv[1];
//Convert all capital chars in keyword to lowercase values, then converts them to alphabetical corresponding number
for (int i = 0, n = strlen(keyword); i < n; i++)
{
if (isupper(keyword[i])) {
keyword[i] += 32;
}
keyword[i] -= 97;
}
//Ask for users message
string message = GetString();
int counter = 0;
int keywordLength = strlen(keyword);
//Iterate through each of the message's chars
for (int i = 0, n = strlen(message); i < n; i++)
{
//Check if ith char is a letter
if (isalpha(message[i])) {
int index = counter % keywordLength;
if (isupper(message[i])) {
char letter = (((message[i] - 65) + (keyword[index])) % 26) + 65;
printf("%c", letter);
counter++;
} else if (islower(message[i])) {
char letter = (((message[i] - 97) + (keyword[index])) % 26) + 97;
printf("%c", letter);
counter++;
}
} else {
//Prints non alphabetic characters
printf("%c", message[i]);
}
}
printf("\n");
return 0;
}
}
This behavior is caused by the line keyword[i] -= 97;, there you make every 'a' in the key stream a zero. Later you use strlen() on the transformed key. So when the key starts with an 'a', keywordLength therefor is set to zero, and the modulo keywordLength operation get into a division by zero. You can fix this by calculating the keyword length before the key transformation.

Find Unique Characters in a File

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry
-----
Yabba
Dabba
Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user#host]$ g++ -o unique_chars unique_chars.cpp
[user#host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).
Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file.
Python w/sets (quick and dirty)
s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))
Python w/sets (with nicer output)
import re
text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
print "Unique Characters: {%s}" % unique
Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.
Why the arbitrary limitation that you need a "script" that does it?
What exactly is a script anyway?
Would Python do?
If so, then this is one solution:
import sys;
s = set([]);
while True:
line = sys.stdin.readline();
if not line:
break;
line = line.rstrip();
for c in line.lower():
s.add(c);
print("".join(sorted(s)));
Algorithm:
Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.
cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).
While not an script this java program will do the work. It's easy to understand an fast ( to run )
import java.util.*;
import java.io.*;
public class Unique {
public static void main( String [] args ) throws IOException {
int c = 0;
Set s = new TreeSet();
while( ( c = System.in.read() ) > 0 ) {
s.add( Character.toLowerCase((char)c));
}
System.out.println( "Unique characters:" + s );
}
}
You'll invoke it like this:
type yourFile | java Unique
or
cat yourFile | java Unique
For instance, the unique characters in the HTML of this question are:
Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
Print unique characters (ASCII and Unicode UTF-8)
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py, and run as python unique.py.
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.
Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):
var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
var L = system.stdin.readLine();
for (var i = L.length; i-- > 0; )
{
var c = L[i].toLowerCase();
if (!(c in seenAlreadyMap))
{
seenAlreadyMap[c] = true;
seenAlreadyArray.push(c);
}
}
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
file = open('location.txt', 'r')
letters = {}
for line in file:
if line == "":
break
for character in line.strip():
if character not in letters:
letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)
#include<stdio.h>
#define CHARSINSET 256
#define FILENAME "location.txt"
char buf[CHARSINSET + 1];
char *getUniqueCharacters(int *charactersInFile) {
int x;
char *bufptr = buf;
for (x = 0; x< CHARSINSET;x++) {
if (charactersInFile[x] > 0)
*bufptr++ = (char)x;
}
bufptr = '\0';
return buf;
}
int main() {
FILE *fp;
char c;
int *charactersInFile = calloc(sizeof(int), CHARSINSET);
if (NULL == (fp = fopen(FILENAME, "rt"))) {
printf ("File not found.\n");
return 1;
}
while(1) {
c = getc(fp);
if (c == EOF) {
break;
}
if (c != '\n' && c != '\r')
charactersInFile[c]++;
}
fclose(fp);
printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
return 0;
}
Quick and dirty solution using grep (assuming the file name is "file"):
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
if [ ! -z "`grep -li $char file`" ]; then
echo -n $char;
fi;
done;
echo
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)
Well my friend, I think this is what you had in mind....At least this is the python version!!!
f = open("location.txt", "r") # open file
ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)
I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
import itertools, sys
# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
This answer above mentioned using a dictionary.
If so, the code presented there can be streamlined a bit, since the Python documentation states:
It is best to think of a dictionary as
an unordered set of key: value pairs,
with the requirement that the keys are
unique (within one dictionary).... If
you store using a key that is already
in use, the old value associated with
that key is forgotten.
Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
if character not in letters:
And that should make it a little faster.
Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(#"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(#"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique
Python without using a set.
file = open('location', 'r')
letters = []
for line in file:
for character in line:
if character not in letters:
letters.append(character)
print(letters)
Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;)
BACKGROUND
I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character.
1. First try: Jay's C++ solution
I started with #jay's C++ answer:
(Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars)
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Timing for this one:
real 10m55.026s
user 10m51.691s
sys 0m3.329s
2. Read entire file at once
I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half.
(I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines).
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
/* ignore whitespace and case */
for (char& ch : str) {
if (!isspace(ch)) {
seen_chars.insert(tolower(ch));
}
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for this one:
real 4m41.910s
user 3m32.014s
sys 0m17.858s
3. Remove isspace() check and tolower()
Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes.
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
for (char& ch : str) {
// removed isspace() and tolower()
seen_chars.insert(ch);
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for final version:
real 3m12.397s
user 2m58.771s
sys 0m13.624s
The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked).
However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output):
import io
with io.open("my-file.csv",'r',encoding='utf8') as f:
text = f.read()
print "Unique Characters: {%s}" % ''.join(sorted(set(text)))

Resources