Print specific lines from file

Print specific lines from file - search

I have a text file like this:
OBJ O1 = {
{P11},
{0}
};
OBJ O2 = {
{P21},
{P22},
{P23},
{0}
};
OBJ O3 = {
{P31},
{P32},
{0}
};
I want to print only lines starting by OBJ O2 and ending by first {0} after OBJ O2
OBJ O2 = {
{P21},
{P22},
{P23},
{0}
};
Is there a solution to print these lines using cat, sed, awk or grep?

You can do this :
awk '/OBJ O2/,/^};$/' file

An awk alternative (without range expression), cf.
Is a /start/,/end/ range expression ever useful in awk?
awk '/^OBJ O2/{f=1} f{print; if(/};/) f=0}' file

Related

How to sort the characters of a word using awk?

I can't seem to find any way of sorting a word based on its characters in awk.
For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?

With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:
$ echo 'hello' |
awk '
BEGIN { PROCINFO["sorted_in"]="#val_str_asc" }
{
split($1,chars,"")
word = ""
for (i in chars) {
word = word chars[i]
}
print word
}
'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo
$ echo 'hello' | awk -v ordr='#val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe

Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.
printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo
There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.

This would be more doable with gawk, which includes the asort function to sort an array:
awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello
This outputs:
ehllo
Demo: https://ideone.com/ylWQLJ

You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):
function siw(word, result, arr, arrlen, arridx) {
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
return result
}
And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
And use this function with awk sort function:
asort(array_test, array_test_result, "compare_by_letters")
Then, the sample program is:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}
function compare_by_letters(i1, v1, i2, v2, left, right) {
left = siw(v1)
right = siw(v2)
if (left < right)
return -1
else if (left == right)
return 0
else
return 1
}
{
array_test[i++] = $0
}
END {
alen = asort(array_test, array_test_result, "compare_by_letters")
for (aind = 1; aind <= alen; aind++) {
print array_test_result[aind]
}
}
Executed like this:
echo -e "fail\nhello\nborn" | awk -f sort_letter.awk
Output:
fail
born
hello
Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:
function siw(word, result, arr, arrlen, arridx) {
result = hash_word[word]
if (result != "") {
return result
}
split(word, arr, "")
arrlen = asort(arr)
for (arridx = 1; arridx <= arrlen; arridx++) {
result = result arr[arridx]
}
hash_word[word] = result
return result
}

here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :
mawk/mawk2/gawk 'BEGIN { FS="^$"
# to make it AaBbCc… etc; chr(65) = ascii "A"
for (x = 65; x < 91; x++) {
ref = sprintf("%s%c%c",ref, x, x+32)
} } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ {
# for gawk/nawk, feel free to change
# that to /[[:alpha:]]{2,}/
# the >= 2+ condition is to prevent wasting time
# sorting single letter words "A" and "I"
s=""; x=1; len=length(inp=$0);
while ( len && (x<53) ) {
if (inp~(ch = substr(ref,x++,1))) {
while ( sub(ch,"",inp) ) {
s = s ch;
len -= 1 ;
} } }
print s }'
I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input
(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)
The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)
i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.
added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters

Relocation strings using awk/sed from a index file

I'd always appreciate all helps from this website. I would like to relocate strings based on the index number from an index file.
Index numbers are shown on the first column in the index file (index.txt) and I would like to relocate "path" based on index numbers. Paths are placed in the same row if the index number is the same. For example, there are two zeros so path_sparc_ifu_dec_in_3826 is placed on the first row and path_sparc_ifu_dec_in_4349 is placed on the first row and next to path_sparc_ifu_dec_in_3826.
index.txt:
0 path_sparc_ifu_dec_in_3826 str DR - -
0 path_sparc_ifu_dec_in_4349 stf DR - -
1 path_sparc_ifu_dec_in_2374 stf DR - -
1 path_sparc_ifu_dec_in_4011 stf DR - -
2 path_sparc_ifu_dec_in_3078 stf DR - -
However, strings are written in another file (source.txt) and each "path" has four lines of strings.
source.txt:
path_sparc_ifu_dec_in_3826
dtu_inst_d[14]
dec_fcl_rdsr_sel_pc_d
0.8664
path_sparc_ifu_dec_in_4349
dtu_inst_d[18]
dec_swl_rdsr_sel_thr_d
0.795429
path_sparc_ifu_dec_in_2374
dtu_inst_d[13]
dec_dcl_cctype_d[2]
0.938914
path_sparc_ifu_dec_in_4011
dtu_inst_d[13]
ifu_exu_useimm_d
0.843643
path_sparc_ifu_dec_in_3078
dtu_inst_d[12]
ifu_exu_shiftop_d[2]
0.915818
The desired output is:
path_sparc_ifu_dec_in_3826 path_sparc_ifu_dec_in_4349
dtu_inst_d[14] dtu_inst_d[18]
dec_fcl_rdsr_sel_pc_d dec_swl_rdsr_sel_thr_d
0.8664 0.795429
path_sparc_ifu_dec_in_2374 path_sparc_ifu_dec_in_4011
dtu_inst_d[13] dtu_inst_d[13]
dec_dcl_cctype_d[2] ifu_exu_useimm_d
0.938914 0.843643
path_sparc_ifu_dec_in_3078
dtu_inst_d[12]
ifu_exu_shiftop_d[2]
0.915818
My idea is that (1)combining two files first and (2) relocate path info using the index number, but I don't know how to do this work. Probably, sed/awk is an appropriate language.
Any help is appreciated.
Best,
Jaeyoung

a one line awk solution could be :
awk -F'\t' 'FNR==NR{ind[$2]=$1;next} { if($1 in ind) { l=4*ind[$1]} else {l=l+1}; text[l]=text[l]"\t"$1 } END { for (i = 0; i < length(text); i++) {print substr(text[i],2)} }' index.txt source.txt
Explanation :
-F'\t'
This is to use tab as separator
FNR==NR
To process file after file
{ind[$2]=$1;next}
Use the first file to create an index
if($1 in ind) { l=4*ind[$1]} else {l=l+1}
"l" is the line number in the output file. If the string is in the index the line number is index*4. If it is not in the index it's the previous line number + 1.
text[l]=text[l]"\t"$1
Add the current string to the correct line.
END { for (i = 0; i < length(text); i++) {print substr(text[i],2)} }
At the end print everything. The subrstr is only here to delete the first useless tab (first char) of each line
My output from your data :
path_sparc_ifu_dec_in_3826 path_sparc_ifu_dec_in_4349
dtu_inst_d[14] dtu_inst_d[18]
dec_fcl_rdsr_sel_pc_d dec_swl_rdsr_sel_thr_d
0.8664 0.795429
path_sparc_ifu_dec_in_2374 path_sparc_ifu_dec_in_4011
dtu_inst_d[13] dtu_inst_d[13]
dec_dcl_cctype_d[2] ifu_exu_useimm_d
0.938914 0.843643
path_sparc_ifu_dec_in_3078
dtu_inst_d[12]
ifu_exu_shiftop_d[2]
0.915818

This is another code that works for me.
awk '
NR==FNR {T[$2] = $1
MX = $1
next
}
$1 in T {IX = T[$1]
}
{P[IX, (FNR+3)%4] = P[IX, (FNR+3)%4] "\t" $0
}
END {for (i=0; i<=MX; i++) for (j=0; j<4; j++) print P[i, j]
}
' index.txt source.txt

Remove function from file using sed or awk

I want to remove the function engine "map" { ... "foobar" ... }.
I tried in so many ways, it's so hard because it has empty lines and '}' at the end, delimiters doesn't work
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "bar"
}
}
}
mainfunc {
var = "baz"
engine "map" {
func {
var0 = "foo"
border = { 1, 1, 1, 1 }
var1 = "foobar"
}
}
}
... # more functions like 'mainfunc'
I tried
sed '/engine/,/^\s\s}$/d' file
but removes every engine function, I just need the one containing "foobar", maybe a pattern match everything even newlines until foobar something like this:
sed '/engine(.*)foobar/,/^\s\s}$/d' file
Is it possible?

Try:
sed '/engine/{:a;N;/foobar/{N;N;d};/ }/b;ba}' filename
or:
awk '/engine/{c=1}c{b=b?b"\n"$0:$0;if(/{/)a++;if(/}/)a--;if(!a){if(b!~/foobar/)print b;c=0;b="";next}}!c' filename

I would simple count the numbers of open / close brackets when you match engine "map", cannot say if this only works in gawk
awk '
/^[ \t]*engine "map"/ {
ship=1; # ship is used as a boolean
b=0 # The factor between open / close brackets
}
ship {
b += split($0, tmp, "{"); # Count numbers of { in line
b -= split($0, tmp, "}"); # Count numbers of } in line
# If open / close brackets are equal the function ends
if(b==0) {
ship = 0;
}
# Ship the rest (printing)
next;
}
1 # Print line
' file
Split returns the number of matches: split(string, array [, fieldsep [, seps ] ]):
Divide
string into pieces defined by fieldpat
and store the pieces in array and the separator strings in the
seps array. The first piece is stored in
array[1], the second piece in array[2], and so
forth. The third argument, fieldpat, is
a regexp describing the fields in string (just as FPAT is
a regexp describing the fields in input records).
It may be either a regexp constant or a string.
If fieldpat is omitted, the value of FPAT is used.
patsplit() returns the number of elements created.

C++/CLI - Split a string with a unknown number of spaces as separator?

I'm wondering how (and in which way it's best to do it) to split a string with a unknown number of spaces as separator in C++/CLI?
Edit: The problem is that the space number is unknown, so when I try to use the split method like this:
String^ line;
StreamReader^ SCR = gcnew StreamReader("input.txt");
while ((line = SCR->ReadLine()) != nullptr && line != nullptr)
{
if (line->IndexOf(' ') != -1)
for each (String^ SCS in line->Split(nullptr, 2))
{
//Load the lines...
}
}
And this is a example how Input.txt look:
ThisISSomeTxt<space><space><space><tab>PartNumberTwo<space>PartNumber3
When I then try to run the program the first line that is loaded is "ThisISSomeTxt" the second line that is loaded is "" (nothing), the third line that is loaded is also "" (nothing), the fourth line is also "" nothing, the fifth line that is loaded is " PartNumberTwo" and the sixth line is PartNumber3.
I only want ThisISSomeTxt and PartNumberTwo to be loaded :? How can I do this?

Why not just using System::String::Split(..)?

The following code example taken from http://msdn.microsoft.com/en-us/library/b873y76a(v=vs.80).aspx#Y0 , demonstrates how you can tokenize a string with the Split method.
using namespace System;
using namespace System::Collections;
int main()
{
String^ words = "this is a list of words, with: a bit of punctuation.";
array<Char>^chars = {' ',',','->',':'};
array<String^>^split = words->Split( chars );
IEnumerator^ myEnum = split->GetEnumerator();
while ( myEnum->MoveNext() )
{
String^ s = safe_cast<String^>(myEnum->Current);
if ( !s->Trim()->Equals( "" ) )
Console::WriteLine( s );
}
}

I think you can do what you need to do with the String.Split method.
First, I think you're expecting the 'count' parameter to work differently: You're passing in 2, and expecting the first and second results to be returned, and the third result to be thrown out. What it actually return is the first result, and the second & third results concatenated into one string. If all you want is ThisISSomeTxt and PartNumberTwo, you'll want to manually throw away results after the first 2.
As far as I can tell, you don't want any whitespace included in your return strings. If that's the case, I think this is what you want:
String^ line = "ThisISSomeTxt \tPartNumberTwo PartNumber3";
array<String^>^ split = line->Split((array<String^>^)nullptr, StringSplitOptions::RemoveEmptyEntries);
for(int i = 0; i < split->Length && i < 2; i++)
{
Debug::WriteLine("{0}: '{1}'", i, split[i]);
}
Results:
0: 'ThisISSomeTxt'
1: 'PartNumberTwo'

Find Unique Characters in a File

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry
-----
Yabba
Dabba
Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.

BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user#host]$ g++ -o unique_chars unique_chars.cpp
[user#host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s

As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done

Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).

Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file.

Python w/sets (quick and dirty)
s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))
Python w/sets (with nicer output)
import re
text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric
print "Unique Characters: {%s}" % unique

Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys

A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.
Why the arbitrary limitation that you need a "script" that does it?
What exactly is a script anyway?
Would Python do?
If so, then this is one solution:
import sys;
s = set([]);
while True:
line = sys.stdin.readline();
if not line:
break;
line = line.rstrip();
for c in line.lower():
s.add(c);
print("".join(sorted(s)));

Algorithm:
Slurp the file into memory.
Create an array of unsigned ints, initialized to zero.
Iterate though the in memory file, using each byte as a subscript into the array.
increment that array element.
Discard the in memory file
Iterate the array of unsigned int
if the count is not zero,
display the character, and its corresponding count.

cat yourfile |
perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'

Alternative solution using bash:
sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$
EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).

While not an script this java program will do the work. It's easy to understand an fast ( to run )
import java.util.*;
import java.io.*;
public class Unique {
public static void main( String [] args ) throws IOException {
int c = 0;
Set s = new TreeSet();
while( ( c = System.in.read() ) > 0 ) {
s.add( Character.toLowerCase((char)c));
}
System.out.println( "Unique characters:" + s );
}
}
You'll invoke it like this:
type yourFile | java Unique
or
cat yourFile | java Unique
For instance, the unique characters in the HTML of this question are:
Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]

Print unique characters (ASCII and Unicode UTF-8)
import codecs
file = codecs.open('my_file_name', encoding='utf-8')
# Runtime: O(1)
letters = set()
# Runtime: O(n^2)
for line in file:
for character in line:
letters.add(character)
# Runtime: O(n)
letter_str = ''.join(letters)
print(letter_str)
Save as unique.py, and run as python unique.py.

in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.

Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser):
var seenAlreadyMap={};
var seenAlreadyArray=[];
while (!system.stdin.eof)
{
var L = system.stdin.readLine();
for (var i = L.length; i-- > 0; )
{
var c = L[i].toLowerCase();
if (!(c in seenAlreadyMap))
{
seenAlreadyMap[c] = true;
seenAlreadyArray.push(c);
}
}
}
system.stdout.writeln(seenAlreadyArray.sort().join(''));

Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.
file = open('location.txt', 'r')
letters = {}
for line in file:
if line == "":
break
for character in line.strip():
if character not in letters:
letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"

A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)
#include<stdio.h>
#define CHARSINSET 256
#define FILENAME "location.txt"
char buf[CHARSINSET + 1];
char *getUniqueCharacters(int *charactersInFile) {
int x;
char *bufptr = buf;
for (x = 0; x< CHARSINSET;x++) {
if (charactersInFile[x] > 0)
*bufptr++ = (char)x;
}
bufptr = '\0';
return buf;
}
int main() {
FILE *fp;
char c;
int *charactersInFile = calloc(sizeof(int), CHARSINSET);
if (NULL == (fp = fopen(FILENAME, "rt"))) {
printf ("File not found.\n");
return 1;
}
while(1) {
c = getc(fp);
if (c == EOF) {
break;
}
if (c != '\n' && c != '\r')
charactersInFile[c]++;
}
fclose(fp);
printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
return 0;
}

Quick and dirty solution using grep (assuming the file name is "file"):
for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do
if [ ! -z "`grep -li $char file`" ]; then
echo -n $char;
fi;
done;
echo
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)

Well my friend, I think this is what you had in mind....At least this is the python version!!!
f = open("location.txt", "r") # open file
ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)
I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)
import itertools, sys
# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return

This answer above mentioned using a dictionary.
If so, the code presented there can be streamlined a bit, since the Python documentation states:
It is best to think of a dictionary as
an unordered set of key: value pairs,
with the requirement that the keys are
unique (within one dictionary).... If
you store using a key that is already
in use, the old value associated with
that key is forgotten.
Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:
if character not in letters:
And that should make it a little faster.

Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code
using System;
using System.IO;
using System.Collections;
using System.Diagnostics;
namespace ConsoleApplication {
class Program {
static void Main(string[] args) {
FileInfo fileInfo = new FileInfo(#"C:/data.txt");
Console.WriteLine(fileInfo.Length);
Stopwatch sw = new Stopwatch();
sw.Start();
Hashtable table = new Hashtable();
StreamReader sr = new StreamReader(#"C:/data.txt");
while (!sr.EndOfStream) {
char c = Char.ToLower((char)sr.Read());
if (!table.Contains(c)) {
table.Add(c, null);
}
}
sr.Close();
foreach (char c in table.Keys) {
Console.Write(c);
}
Console.WriteLine();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}
produces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

s=open("text.txt","r").read()
l= len(s)
unique ={}
for i in range(l):
if unique.has_key(s[i]):
unique[s[i]]=unique[s[i]]+1
else:
unique[s[i]]=1
print unique

Python without using a set.
file = open('location', 'r')
letters = []
for line in file:
for character in line:
if character not in letters:
letters.append(character)
print(letters)

Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;)
BACKGROUND
I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character.
1. First try: Jay's C++ solution
I started with #jay's C++ answer:
(Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars)
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Timing for this one:
real 10m55.026s
user 10m51.691s
sys 0m3.329s
2. Read entire file at once
I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half.
(I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines).
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
/* ignore whitespace and case */
for (char& ch : str) {
if (!isspace(ch)) {
seen_chars.insert(tolower(ch));
}
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for this one:
real 4m41.910s
user 3m32.014s
sys 0m17.858s
3. Remove isspace() check and tolower()
Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes.
#include <set>
#include <string>
#include <iostream>
#include <fstream>
#include <iterator>
int main(int argc, char **argv) {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
std::ifstream ifs(argv[1]);
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
fprintf(stderr, "Size of file: %lu\n", size);
std::string str(size, ' ');
ifs.seekg(0);
ifs.read(&str[0], size);
for (char& ch : str) {
// removed isspace() and tolower()
seen_chars.insert(ch);
}
for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) {
std::cout << *iter << " ";
}
std::cout << std::endl;
return 0;
}
Timing for final version:
real 3m12.397s
user 2m58.771s
sys 0m13.624s

The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked).
However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output):
import io
with io.open("my-file.csv",'r',encoding='utf8') as f:
text = f.read()
print "Unique Characters: {%s}" % ''.join(sorted(set(text)))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Print specific lines from file - search

You can do this : awk '/OBJ O2/,/^};$/' file

An awk alternative (without range expression), cf. Is a /start/,/end/ range expression ever useful in awk? awk '/^OBJ O2/{f=1} f{print; if(/};/) f=0}' file

Related

How to sort the characters of a word using awk?

Relocation strings using awk/sed from a index file

Remove function from file using sed or awk

C++/CLI - Split a string with a unknown number of spaces as separator?

Find Unique Characters in a File

Categories

Resources