Powershell string to number functions? - string

I am looking for an easy way to convert numeric strings e.g. "1.78M" "1.47B" to an integer variable.
Any help is appreciated.
thanks

Are you looking for M == MB or M == 1E6? If it is the former, PowerShell understands KB, MB, GB and TB e.g.:
C:\PS> Invoke-Expression "2MB"
2097152
Big caveat here with Invoke-Expression, if you're getting the string from a user, file, i.e. an untrusted source. You have to be careful about executing it. Say the string is "2MB; Remove-Item C:\ -Recurse -Force -Whatif -EA 0", you'd have a bad day using Invoke-Expression on that string. BTW, I'm being nice here by adding the -Whatif. :-)
If it is the latter, you could do a regex -replace followed by a coercion e.g.:
C:\PS> [long]("3.34 B" -replace '(\d+)\s*(B)','$1E9')
3340000000

There is nothing built in that supports suffixes like M for million, B for billion, etc. There is only built-in support for "file size" suffixes, for example 32KB -> 32768
Here's my attempt at a basic script version to address your problem. This supports multi-character suffixes if needed, or no suffix at all. It will always return an [int], so be wary of overflow (e.g. 5.5B will cause an error since it won't fit in an int). You can modify the types a bit to support bigger numbers.
function ToNumber
{
param([string] $NumberString)
# add other multiplier suffixes to this table
$multipliers = #{ 'B' = 1000000000; 'M' = 1000000; 'K' = 1000; '' = 1 }
switch -regex ($numberString)
{
'^(?<base>[\d\.]+)(?<suffix>\w*)$'
{
$base = [double] $matches['base']
$multiplier = [int] $multipliers[$matches['suffix']]
if($multiplier)
{
[int]($base * $multiplier)
}
else
{
throw "$($matches['suffix']) is an unknown suffix"
}
}
default
{
throw 'Unable to parse input'
}
}
}
C:\> ToNumber '1.7B'
1700000000
C:\> ToNumber '1.7K'
1700

Related

Why does using a variable as a function parameter cause a type mismatch error in a later function call?

def make-list [val?: string] {
["f1" "f2"]
}
def use-list [list: list] {
$list | each { |it| $it + "!" }
}
let $val = "abc"
let $l = (make-list $val)
use-list $l
In this code, I have two functions, one to make a list, and another to consume that list. For the purposes of this MRE, make-list returns a simple hard-coded list, and use-list prints each element with an exclamation mark added to the end.
If this script executes correctly, it should print a list with two elements "f1!" and "f2!".
make-list has an optional parameter. In the code above, I create a $val variable, then pass that into make-list, and store the result of the pipeline to $l. Executing $l | describe instead of use-list $l prints list<string>, so I'm confident that $l is indeed a list.
The code above throws the following compile error:
Error: nu::parser::type_mismatch (link)
× Type mismatch.
╭─[/.../test.nu:10:1]
10 │ let $l = (make-list $val)
11 │ use-list $l
· ─┬
· ╰── expected List(Any), found String
╰────
However, I can modify the let $l ... line to allow the script to compile and execute correctly. Both of these options will allow the script to compile and returns the expected result.
The parameter for make-list can be removed (because the parameter is optional).
let $l = (make-list)
The parameter for make-list can be replaced by a string literal
let $l = (make-list "abc")
I don't understand why using a variable as a parameter call is suddenly causing a type issue with the use-list function call. What am I doing wrong here?
As an interesting side note, and this might explain the "...found String" part of the error message, if I change the make-list parameter to an int:
def make-list [val?: int] {
["f1" "f2"]
}
# ...
let $val = 3
let $l = (make-list $val)
use-list $l
It will fail to compile with the same kind of error, but the "found" type will reflect the updated parameter type.
Error: nu::parser::type_mismatch (link)
× Type mismatch.
╭─[/.../test.nu:10:1]
10 │ let $l = (make-list $val)
11 │ use-list $l
· ─┬
· ╰── expected List(Any), found Int
╰────
Tested using Nushell version 0.74.0 and 0.75.0 on Arch Linux and Windows 11, respectively.
Something seems to "go wrong" with the variable binding. Your scenario does work if the literal string is being evaluated in a subexpression before it is being bound:
let $val = ("abc")
# or
let $val = do { "abc" }
Even a seemingly useless type casting later on does the trick:
let $val = "abc"
let $l = (make-list ($val | into string))
This is also consistent with other types. Adapting your int example:
def make-list [val?: int] { … }
…
let $val = 123 # fails
let $val = (123) # works
Likewise, having a do closure, or type-casting using into int, do work as well.
Not of importance, however, is the syntax of the variable binding. With the $ sign omitted, all examples from above still work or fail under the same circumstances, e.g.:
let val = "abc" # fails
let val = ("abc") # works
As expected, using let l = instead of let $l = also yields the same results.

Perl - How to set key based on column header when converting from xlsx to perl hash

I have a xlsx that im converting into a perl hash
Name
Type
Symbol
Colour
JOHN
SUV
X
R
ROB
MPV
Y
B
JAMES
4X4
Y
G
Currently, I can only set the hash superkey to the column wanted based on column array. I cant seem to figure out how to choose based on column header.
use Data::Dumper;
use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "windows-1251");
use Spreadsheet::XLSX;
my $excel = Spreadsheet::XLSX->new('file.xlsx', $converter);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} eq "sheet1"){
my %data;
for my $row ( 0 .. $sheet-> {MaxRow}) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
my $super_key = $sheet->{Cells}[$row][$col]{Val};
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}
print Dumper \%data;
}}
The outcome i get is,
$VAR1 = {
'' => {
'JOHN' => {
'SUV' => R
I would like to have;
$VAR1 = {
'X' => {
'JOHN' => {
'SUV' => R
`
You are missing use strict; in your perl script. If you had it, you would have seen your error yourself
Defining the $super_key with my in your If-clause, makes this variable lose scope as soon as you exit it.
And using a variable $col without defining it doesn't work either.
Better (and probably working) is:
for my $row ( 0 .. $sheet-> {MaxRow}) {
my $super_key;
foreach my $col (0 .. 3) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Symbol"){
$super_key = $sheet->{Cells}[$row][$col]{Val};
}
}
my $key = $sheet->{Cells}[$row][0]{Val};
my $value = $sheet->{Cells}[$row][2]{Val};
my $value2= $sheet->{Cells}[$row][3]{Val};
$data{$super_key}->{$key}->{$value}=${value2};
}

Taking strnig as input for calculator program in Perl

I am new to Perl and I'm trying to create a simple calculator program, but the rules are different from normal maths. All operations have the same power and the math problem must be solved from left to right.
Here is an example:
123 - 10 + 4 * 10 = ((123 - 10) + 4) * 10 = 1170
8 * 7 / 3 + 2 = ((8 * 7) / 3) + 2 = 20.666
So in the first case the user needs to enter one string: 123 - 10 + 4 * 10.
How do i approach this task?
I'm sorry if it's too much of a general question, but i'm not sure how to even begin. Do i need a counter? Like - every second character of the string is an operator, while the two on the sides are digits.
I'm afraid I'm lazy so I'll parse with a regex and process as I parse.
#!/usr/bin/env perl
#use Data::Dumper;
use Params::Validate (':all');
use 5.01800;
use warnings;
my $string=q{123 - 10 + 4 * 10};
my $result;
sub fee {
my ($a)=validate_pos(#_,{ type=>SCALAR });
#warn Data::Dumper->Dump([\$a],[qw(*a)]),' ';
$result=$a;
};
sub fi {
my ($op,$b)=validate_pos(#_,{ type=>SCALAR},{ type=>SCALAR });
#warn Data::Dumper->Dump([\$op,\$b],[qw(*op *b)]),' ';
$result = $op eq '+' ? $result+$b :
$op eq '-' ? $result-$b :
$op eq '*' ? $result*$b :
$op eq '/' ? $result/$b :
undef;
#warn Data::Dumper->Dump([\$result],[qw(*result)]),' ';
};
$string=~ m{^(\d+)(?{ fee($1) })(?:(?: *([-+/*]) *)(\d+)(?{ fi($2,$3) }))*$};
say $result;
Note the use of (?{...}) 1
To be clear, you are not looking for a regular calculator. You are looking for a calculator that bends the rules of math.
What you want is to extract the operands and operators, then handle them 3 at the time, with the first one being the rolling "sum", the second an operator and the third an operand.
A simple way to handle it is to just eval the strings. But since eval is a dangerous operation, we need to de-taint the input. We do this with a regex match: /\d+|[+\-*\/]+/g. This matches either 1 or more + digits \d or |, 1 or more + of either +-*/. And we do this match as many times as we can /g.
use strict;
use warnings;
use feature 'say';
while (<>) { # while we get input
my ($main, #ops) = /\d+|[+\-*\/]+/g; # extract the ops
while (#ops) { # while the list is not empty
$main = calc($main, splice #ops, 0, 2); # take 2 items off the list and process
}
say $main; # print result
}
sub calc {
eval "#_"; # simply eval a string of 3 ops, e.g. eval("1 + 2")
}
You may wish to add some input checking, to count the args and make sure they are the correct number.
A more sensible solution is to use a calling table, using the operator as the key from a hash of subs designed to handle each math operation:
sub calc {
my %proc = (
"+" => sub { $_[0] + $_[1] },
"-" => sub { $_[0] - $_[1] },
"/" => sub { $_[0] / $_[1] },
"*" => sub { $_[0] * $_[1] }
);
return $proc{$_[1]}($_[0], $_[2]);
}
As long as the middle argument is an operator, this will perform the required operation without the need for eval. This will also allow you to add other math operations that you might want for the future.
Just to read raw input from the user you would simply read the STDIN file handle.
$input = <STDIN>;
This will give you a string, say "123 + 234 - 345" which will have a end of line marker. You can remove this safely with the chomp command.
After that you will want to parse your string to get your appropriate variables. You can brute force this with a stream scanner that looks at each character as you read it and processes it accordingly. For example:
#input = split //, $input;
for $ch (#input) {
if ($ch > 0 and $ch <= 9) {
$tVal = ($tVal * 10) + $ch;
} elsif ($ch eq " ") {
$newVal = $oldVal
} elsif ($ch eq "+") {
# Do addition stuff
}...
}
Another approach would be to split it into words so you can just deal with whole terms.
#input = split /\s+/, $input;
Instead of a stream of characters, as you process the array values will be 123, +, 234, -, and 345...
Hope this points you in the right direction...

stuck at understanding "$worksheet->add_write_handler(qr[\w], \&store_string_widths);" line in the below piece of code

My final goal for my first perl program :To create an excel sheet for reporting purpose and email the sheet as an attachment.
I have reached till creating a csv file. now i wanted to convert this to excel sheet and autofit the content.
I have an example code in our environment,could someone take time to explain each line on the below code, it would be very grateful.
outputfile,urloutputfile,scomoutputfile - are the csv files, now being converted to excel sheets.
Please explain how an element is being passed to the other function also.
my $parser = Text::CSV::Simple->new;
my $workbook = Excel::Writer::XLSX->new($auditxl);
my #totcsvlist;
push(#totcsvlist,$outputfile);
push(#totcsvlist,$urloutputfile);
push(#totcsvlist,$scomoutputfile);
my #data;
my $subject = 'worksheet';
foreach my $totcsvlis (#totcsvlist)
{
undef #data;
chomp($totcsvlis);
if ($totcsvlis eq $outputfile) { $subject="Service Status"; }
if ($totcsvlis eq $urloutputfile) { $subject="URL Status"; }
if ($totcsvlis eq $scomoutputfile) { $subject="SCOM Agent Status"; }
#data = $parser->read_file($totcsvlis);
my $headers = shift #data;
import_data($workbook, $subject, $headers, \#data);
}
$workbook->close();
sub autofit_columns {
my $worksheet = shift;
my $col = 0;
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width) if $width;
$col++;
}
}
sub import_data {
my $workbook = shift;
my $base_name = shift;
my $colums = shift;
my $data = shift;
my $limit = shift || 50_000;
my $start_row = shift || 1;
my $bold = $workbook->add_format();
$bold->set_bold(1);
$bold->set_bg_color('gray');
$bold->set_border();
my $celbor = $workbook->add_format();
$celbor->set_border();
my $worksheet = $workbook->add_worksheet($base_name);
$worksheet->add_write_handler(qr[\w], \&store_string_widths);
my $w = 1;
$worksheet->write('A' . $start_row, $colums, $bold);
my $i = $start_row;
my $qty = 0;
for my $row (#$data) {
$qty++;
$worksheet->write($i++, 0, $row,$celbor);
}
autofit_columns($worksheet);
warn "Convereted $qty rows.";
return $worksheet;
}
sub autofit_columns {
my $worksheet = shift;
my $col = 0;
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width + 5) if $width;
$col++;
}
}
sub store_string_widths {
my $worksheet = shift;
my $col = $_[1];
my $token = $_[2];
return if not defined $token; # Ignore undefs.
return if $token eq ''; # Ignore blank cells.
return if ref $token eq 'ARRAY'; # Ignore array refs.
return if $token =~ /^=/; # Ignore formula
return if $token =~ m{^[fh]tt?ps?://};
return if $token =~ m{^mailto:};
return if $token =~ m{^(?:in|ex)ternal:};
my $old_width = $worksheet->{__col_widths}->[$col];
my $string_width = string_width($token);
if (not defined $old_width or $string_width > $old_width) {
$worksheet->{__col_widths}->[$col] = $string_width;
}
return undef;
}
sub string_width {
return length $_[0];
}
I have tried to search and read modules used in the above code, but over head.
https://github.com/jmcnamara/spreadsheet-writeexcel/blob/master/examples/autofit.pl
-- has similar code and has provided a basic over view. but i would like to understand in detail.
Thank you so much in advance.
Regards,
Kaushik KM
Here is the documentation for the add_write_handler() method call. It says:
add_write_handler( $re, $code_ref )
This method is used to extend the Excel::Writer::XLSX write() method
to handle user defined data.
And later, it says:
The add_write_handler() method take two arguments, $re, a regular
expression to match incoming data and $code_ref a callback function
to handle the matched data
So, here you have a method call that takes two arguments. The first is a regex that tells the object what type of data this new write handler is used for. The second is a reference to the subroutine that should be used as the write handler for data that matches the regex.
The regex you have is qr[\w]. The actual regex bit of that is \w. And that just means "match a word character". The qr is to compile a string into a regex and the [ ... ] is just the delimiter for the regex string (qr/.../ is one of a class of Perl operators that allows you to use almost any character you want as a delimiter).
So, if your object is called on to write some data that contains at least one word character, the subroutine which is given as the second argument is used. But we take a reference to the subroutine.
Elsewhere in your code, you define the store_string_widths() subroutine. Subroutines in Perl are a lot like variables, and that means that they have their own sigil. The sigil for a subroutine is & (like the $ for scalar and # for arrays). You very rarely need the & in modern Perl code, so you won't see it used very often. One place that it is still used, is when we take a reference to a subroutine. You take a reference to any variable by putting a slash in front of the variable's full name (like \#array or \%hash) and subroutines are no different. So \&store_string_widths means "get a reference to the subroutine called store_string_widths()".
You say that this is your first Perl program. I have to say that this feels a little ambitious for your first Perl code. I don't cover references at all in my two-day beginners course and on my intermediate course I cover most references, but on mention subroutine references in passing. If you can understand references enough to get this all working, then I think you're doing really well.

Find Unique Characters in a File

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry
-----
Yabba
Dabba
Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user#host]$ g++ -o unique_chars unique_chars.cpp
[user#host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).
Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file.
Python w/sets (quick and dirty)
s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))
Python w/sets (with nicer output)
import re
text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
print "Unique Characters: {%s}" % unique
Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.
Why the arbitrary limitation that you need a "script" that does it?
What exactly is a script anyway?
Would Python do?
If so, then this is one solution:
import sys;
s = set([]);
while True:
line = sys.stdin.readline();
if not line:
break;
line = line.rstrip();
for c in line.lower():
s.add(c);
print("".join(sorted(s)));
Algorithm:
Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.
cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).
While not an script this java program will do the work. It's easy to understand an fast ( to run )
import java.util.*;
import java.io.*;
public class Unique {
public static void main( String [] args ) throws IOException {
int c = 0;
Set s = new TreeSet();
while( ( c = System.in.read() ) > 0 ) {
s.add( Character.toLowerCase((char)c));
}
System.out.println( "Unique characters:" + s );
}
}
You'll invoke it like this:
type yourFile | java Unique
or
cat yourFile | java Unique
For instance, the unique characters in the HTML of this question are:
Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
Print unique characters (ASCII and Unicode UTF-8)
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py, and run as python unique.py.
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.
Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):
var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
var L = system.stdin.readLine();
for (var i = L.length; i-- > 0; )
{
var c = L[i].toLowerCase();
if (!(c in seenAlreadyMap))
{
seenAlreadyMap[c] = true;
seenAlreadyArray.push(c);
}
}
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
file = open('location.txt', 'r')
letters = {}
for line in file:
if line == "":
break
for character in line.strip():
if character not in letters:
letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)
#include<stdio.h>
#define CHARSINSET 256
#define FILENAME "location.txt"
char buf[CHARSINSET + 1];
char *getUniqueCharacters(int *charactersInFile) {
int x;
char *bufptr = buf;
for (x = 0; x< CHARSINSET;x++) {
if (charactersInFile[x] > 0)
*bufptr++ = (char)x;
}
bufptr = '\0';
return buf;
}
int main() {
FILE *fp;
char c;
int *charactersInFile = calloc(sizeof(int), CHARSINSET);
if (NULL == (fp = fopen(FILENAME, "rt"))) {
printf ("File not found.\n");
return 1;
}
while(1) {
c = getc(fp);
if (c == EOF) {
break;
}
if (c != '\n' && c != '\r')
charactersInFile[c]++;
}
fclose(fp);
printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
return 0;
}
Quick and dirty solution using grep (assuming the file name is "file"):
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
if [ ! -z "`grep -li $char file`" ]; then
echo -n $char;
fi;
done;
echo
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)
Well my friend, I think this is what you had in mind....At least this is the python version!!!
f = open("location.txt", "r") # open file
ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)
I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
import itertools, sys
# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
This answer above mentioned using a dictionary.
If so, the code presented there can be streamlined a bit, since the Python documentation states:
It is best to think of a dictionary as
an unordered set of key: value pairs,
with the requirement that the keys are
unique (within one dictionary).... If
you store using a key that is already
in use, the old value associated with
that key is forgotten.
Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
if character not in letters:
And that should make it a little faster.
Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(#"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(#"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique
Python without using a set.
file = open('location', 'r')
letters = []
for line in file:
for character in line:
if character not in letters:
letters.append(character)
print(letters)
Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;)
BACKGROUND
I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character.
1. First try: Jay's C++ solution
I started with #jay's C++ answer:
(Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars)
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Timing for this one:
real 10m55.026s
user 10m51.691s
sys 0m3.329s
2. Read entire file at once
I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half.
(I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines).
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
/* ignore whitespace and case */
for (char& ch : str) {
if (!isspace(ch)) {
seen_chars.insert(tolower(ch));
}
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for this one:
real 4m41.910s
user 3m32.014s
sys 0m17.858s
3. Remove isspace() check and tolower()
Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes.
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
for (char& ch : str) {
// removed isspace() and tolower()
seen_chars.insert(ch);
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for final version:
real 3m12.397s
user 2m58.771s
sys 0m13.624s
The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked).
However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output):
import io
with io.open("my-file.csv",'r',encoding='utf8') as f:
text = f.read()
print "Unique Characters: {%s}" % ''.join(sorted(set(text)))

Resources