Running statistics on multiple lines in bash - linux
I have multiple HTTP headers in one giant file, separated with one empty line.
Host
Connection
Accept
From
User-Agent
Accept-Encoding
Host
Connection
Accept
From
User-Agent
Accept-Encoding
X-Forwarded-For
cookie
Cache-Control
referer
x-fb-sim-hni
Host
Accept
user-agent
x-fb-net-sid
x-fb-net-hni
X-Purpose
accept-encoding
x-fb-http-engine
Connection
User-Agent
Host
Connection
Accept-Encoding
I have approximately 10,000,000 of headers separated with an empty line.
If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do that separately for all headers?):
Host,Connection,Accept,From,User-Agent,Accept-Encoding
and follow with: uniq -c|sort -nk1,
so I could receive:
197897 Host,Connection,Accept,From,User-Agent,Accept-Encoding
8732233 User-Agent,Host,Connection,Accept-Encoding
What would be the best approach and most effective one to parse that massive file and get that data?
Thanks for hints.
Using GNU awk for sorted_in, all you need is:
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="," }
{ $1=$1; cnt[$0]++ }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (rec in cnt) {
print cnt[rec] " " rec
}
}
After running dos2unix on the sample you posted (1.5milGETs.txt):
$ time awk -f tst.awk 1.5milGETs.txt > ou.awk
real 0m4.898s
user 0m4.758s
sys 0m0.108s
$ head -10 ou.awk
71639 Host,Accept,User-Agent,Pragma,Connection
70975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent
40781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For
35485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection
34005 User-Agent,Host,Connection,Accept-Encoding
30668 Host,User-Agent,Accept-Encoding,Connection
25547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent
22581 Host,User-Agent,Accept,Accept-Encoding
19311 Host,Connection,Accept,From,User-Agent,Accept-Encoding
14694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie
Here's an answer written in (POSIX) C, which AFAICT does what OP wants. The C solution seems to be faster than an AWK based solution. That may or may not be useful, it all depends on how frequent the program is run and the input data.
The main takeaway:
The program memory maps the input file and alters the mapped copy.
It replaces newline characters with commas where appropriate, and
newlines with nul characters to separate each entry in the input
file. IOW, foo\nbar\n\nbaz\n becomes foo,bar\0baz\0.
The program also builds a table of entries, which is just an array of
char-pointers into the memory mapped file.
The program sorts the entries using standard string functions, but only moves the pointers values, not the actual data
Then the program creates a new array of unique entries and counts how many instances there are for each string. (This part can probably be made a bit faster)
The array of unique entries is then sorted in descending order
Finally, the program prints the contents of the unique array
Anyway, here's the code. (disclaimer: It's written to be postable here on SO)
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
struct uniq {
char *val;
size_t count;
};
struct entry {
char *val;
};
// Some globals
size_t g_filesize;
char* g_baseaddr;
struct entry *g_entries;
size_t g_entrysize, g_entrycapacity;
struct uniq *g_unique;
size_t g_uniquesize, g_uniquecapacity;
static inline void mapfile(const char *filename)
{
int fd;
struct stat st;
if ((fd = open(filename, O_RDWR)) == -1 || fstat(fd, &st)) {
perror(filename);
exit(__LINE__);
}
g_baseaddr = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
if (g_baseaddr == (void *)MAP_FAILED) {
perror(filename);
close(fd);
exit(__LINE__);
}
close(fd);
g_filesize = st.st_size;
}
// Guestimate how many entries we have. We do this only to avoid early
// reallocs, so this isn't that important. Let's say 100 bytes per entry.
static inline void setup_entry_table(void)
{
g_entrycapacity = g_filesize / 100;
g_entrysize = 0;
size_t cb = sizeof *g_entries * g_entrycapacity;
if ((g_entries = malloc(cb)) == NULL)
exit(__LINE__);
memset(g_entries, 0, cb);
}
static inline void realloc_if_needed(void)
{
if (g_entrysize == g_entrycapacity) {
size_t newcap = g_entrycapacity * 2;
size_t cb = newcap * sizeof *g_entries;
struct entry *tmp = realloc(g_entries, cb);
if (tmp == NULL)
exit(__LINE__);
g_entries = tmp;
g_entrycapacity = newcap;
}
}
static inline void add_entry(char *p)
{
realloc_if_needed();
g_entries[g_entrysize].val = p;
g_entrysize++;
}
// Convert input data to proper entries by replacing \n with either
// ',' or \0. We add \0 to separate the entries.
static inline void convert_to_entries(void)
{
char *endaddr = g_baseaddr + g_filesize;
char *prev, *s = g_baseaddr;
// First entry
prev = s;
while(s < endaddr) {
char *nl = strchr(s, '\n');
if (nl == s) {
if (nl - prev > 0) // Skip empty strings
add_entry(prev);
*nl = '\0'; // Terminate entry
s = nl + 1; // Skip to first byte after \0
prev = s; // This is the start of the 'previous' record
}
else {
*nl = ','; // Replace \n with comma
s = nl + 1; // Move pointer forward (optimization).
if (*s == '\n')
*(s - 1) = '\0';// Don't add trailing comma
}
}
if (prev < s)
add_entry(prev); // Don't forget last entry
}
static int entrycmp(const void *v1, const void *v2)
{
const struct entry *p1 = v1, *p2 = v2;
return strcmp(p1->val, p2->val);
}
// Sort the entries so the pointers point to a sorted list of strings.
static inline void sort_entries(void)
{
qsort(g_entries, g_entrysize, sizeof *g_entries, entrycmp);
}
// We keep things really simple and allocate one unique entry for each
// entry. That's the worst case anyway and then we don't have to test
// for reallocation.
static inline void setup_unique_table(void)
{
size_t cb = sizeof *g_unique * g_entrysize;
if ((g_unique = malloc(cb)) == NULL)
exit(__LINE__);
g_uniquesize = 0;
g_uniquecapacity = g_entrysize;
}
static inline void add_unique(char *s)
{
g_unique[g_uniquesize].val = s;
g_unique[g_uniquesize].count = 1;
g_uniquesize++;
}
// Now count and skip duplicate entries.
// How? Just iterate over the entries table and find duplicates.
// For each duplicate, increment count. For each non-dup,
// add a new entry.
static inline void find_unique_entries(void)
{
char *last = g_entries[0].val;
add_unique(last);
for (size_t i = 1; i < g_entrysize; i++) {
if (strcmp(g_entries[i].val, last) == 0) {
g_unique[g_uniquesize - 1].count++; // Inc last added\'s count
}
else {
last = g_entries[i].val;
add_unique(last);
}
}
}
static inline void print_unique_entries(void)
{
for (size_t i = 0; i < g_uniquesize; i++)
printf("%zu %s\n", g_unique[i].count, g_unique[i].val);
}
static inline void print_entries(void)
{
for (size_t i = 0; i < g_entrysize; i++)
printf("%s\n", g_entries[i].val);
}
static int uniquecmp(const void *v1, const void *v2)
{
const struct uniq *p1 = v1, *p2 = v2;
return (int)p2->count - (int)p1->count;
}
static inline void sort_unique_entries(void)
{
qsort(g_unique, g_uniquesize, sizeof *g_unique, uniquecmp);
}
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "USAGE: %s filename\n", argv[0]);
exit(__LINE__);
}
mapfile(argv[1]);
setup_entry_table();
convert_to_entries();
if (g_entrysize == 0) // no entries in file.
exit(0);
sort_entries();
setup_unique_table();
find_unique_entries();
sort_unique_entries();
if (0) print_entries();
if (1) print_unique_entries();
// cleanup
free(g_entries);
free(g_unique);
munmap(g_baseaddr, g_filesize);
exit(0);
}
Using your your 1.5milGETs.txt file (and converting the triple \n\n\n to \n\n to separate blocks) you can use ruby in paragraph mode:
$ ruby -F'\n' -lane 'BEGIN{h=Hash.new(0); $/=""
def commafy(n)
n.to_s.reverse.gsub(/...(?=.)/,"\\&,").reverse
end
}
h[$F.join(",")]+=1
# p $_
END{ printf "Total blocks: %s\n", commafy(h.values.sum)
h2=h.sort_by {|k,v| -v}
h2[0..10].map {|k,v| printf "%10s %s\n", commafy(v), k}
}' 1.5milGETs.txt
That prints the total number of blocks, sorts them large->small, prints the top 10.
Prints:
Total blocks: 1,262,522
71,639 Host,Accept,User-Agent,Pragma,Connection
70,975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent
40,781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For
35,485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection
34,005 User-Agent,Host,Connection,Accept-Encoding
30,668 Host,User-Agent,Accept-Encoding,Connection
25,547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent
22,581 Host,User-Agent,Accept,Accept-Encoding
19,311 Host,Connection,Accept,From,User-Agent,Accept-Encoding
14,694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie
12,290 Host,User-Agent,Accept-Encoding
That takes about 8 seconds on a 6 year old Mac.
Awk will be 3x faster and entirely appropriate for this job.
Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.
Personally, I'd use a C program, other alternatives exist as well. Here's an awk snippet which folds lines. Not perfect, but should get you started :)
$cat foo.awk
// {
if (NF == 0)
printf("\n");
else
printf("%s ", $0);
}
$ awk -f foo.awk < lots_of_data | sort | uniq -c | sort -nk1
The last statement will take "forever", which is why a C program may be a good alternative. It depends mostly on how often you need to run the commands.
If you have enough memory (10M records, in your sample about 80 chars per record, 800MB, and if you're counting them, I assume a lot of duplicates) you could hash the records to memory and count while hashing:
$ awk 'BEGIN{ RS=""; OFS=","}
{
b="" # reset buffer b
for(i=1;i<=NF;i++) # for every header element in record
b=b (b==""?"":OFS) $i # buffer them and comma separate
a[b]++ # hash to a, counting
}
END { # in the end
for(i in a) # go thru the a hash
print a[i] " " i} # print counts and records
' file
1 Host,Connection,Accept,From,User-Agent,Accept-Encoding
1 cookie,Cache-Control,referer,x-fb-sim-hni,Host,Accept,user-agent,x-fb-net-sid,x-fb-net-hni,X-Purpose,accept-encoding,x-fb-http-engine,Connection
1 User-Agent,Host,Connection,Accept-Encoding
1 Host,Connection,Accept,From,User-Agent,Accept-Encoding,X-Forwarded-For
Output order is random due to the nature of i in a so order the output afterwards as pleases.
Edit:
As #dawg kindly pointed out in the comments, $1=$1 is enough to rebuild the record to comma-separated form:
$ awk 'BEGIN{ RS=""; OFS=","}
{
$1=$1 # rebuild the record
a[$0]++ # hash $0 to a, counting
}
END { # in the end
for(i in a) # go thru the a hash
print a[i] " " i} # print counts and records
' file
Related
Why I cant print initials of a name?
I am trying to write a code that prints initials of a given name? I have been getting error whenever I use '/0' - to initialize the ith character of a given string. I am using it to identify initials of the 2nd word? Any suggestions to detect the 2nd word and print the initial of the 2nd word? Additionally, I am trying to write this code so that it ignores any additional spaces: #include <stdio.h> #include <cs50.h> #include <string.h> #include <ctype.h> int main(void) { printf ("Enter you Name:");//print name string s = get_string(); //get input from user printf("%c", toupper(s[0])); // use placeholder to call the string for (int i=0, n = strlen(s); i < n; i++) { if (s[i] == '/0') { printf("%c", toupper(s[i+1])); } } printf ("\n"); }
I'm not sure what your get_string() function does but this does what you're asking for. int main(void) { printf ("Enter you Name:");//print name char input[100]; cin.getline(input, sizeof(input)); string s(input); //get input from user printf("%c", toupper(s[0])); // use placeholder to call the string for (int i=0, n = s.length(); i < n; i++) { if (isspace(s.at(i))) { printf("%c", toupper(s[i+1])); } } printf ("\n"); }
Try to use ' ' instead of '/0'. if (s[i] == ' ') or if you prefer ASCII codes (Space is 0x20 or 32 as decimal): if (s[i] == 0x20) your code will work if there is only 1 space between names. To avoid this, the condition should also check the next char. If it's not a space then probably it's a 'letter': if (s[i] == 0x20 && s[i+1] != 0x20) {... Note that the for cycle should stop at i < n - 1, otherwise the loop will fail if there are trailing spaces...
How to return a int converted to char array back to main for displaying it
My doubts are as follows : 1 : how to send 'str' from function 'fun' , So that i can display it in main function. 2 : And is the return type correct in the code ? 2 : the current code is displaying some different output. char * fun(int *arr) { char *str[5]; int i; for(i=0;i<5;i++) { char c[sizeof(int)] ; sprintf(c,"%d",arr[i]); str[i] = malloc(sizeof(c)); strcpy(str[i],c); } return str; } int main() { int arr[] = {2,1,3,4,5},i; char *str = fun(arr); for(i=0;i<5;i++) { printf("%c",str[i]); } return 0; }
how to send 'str' from function 'fun' , So that i can display it in main function. This is the way: char* str = malloc( size ); if( str == NULL ) { fprintf( stderr,"Failed to malloc\n"); } /* Do stuff with str, use str[index], * remember to free it in main*/ free(str); And is the return type correct in the code ? No, Probably char** is the one you need to return. the current code is displaying some different output. Consider explaining what/why do you want to do ? The way you have written, seems completely messed up way to me. You're passing array of integer but not its length. How is the fun() supposed to know length of array? Another problem is array of pointers in fun(). You can't write a int to a char (See the both size). So I used char array instead. However, I'm not sure if this is what you want to do (might be a quick and dirty way of doing it): #include <stdio.h> #include <stdlib.h> #include <string.h> char** fun(int *arr, int size) { char **str = malloc( sizeof(char*)*size ); if( str == NULL ) { fprintf( stderr, "Failed malloc\n"); } int i; for(i=0;i<5;i++) { str[i] = malloc(sizeof(int)); if( str == NULL ) { fprintf( stderr, "Failed malloc\n"); } sprintf(str[i],"%d",arr[i]); } return str; } int main() { int arr[] = {2,1,3,4,5},i; char **str = fun(arr, 5); for(i=0;i<5;i++) { printf("%s\n",str[i]); free(str[i]); } free(str); return 0; }
I made these changes to your code to get it working: #include <stdio.h> #include <stdlib.h> #include <string.h> char **fun(int *arr) { char **str = malloc(sizeof(char *) * 5); int i; for(i = 0; i < 5; i++) { if ((arr[i] >= 0) && (arr[i] <= 9)) { char c[2] ; sprintf(c, "%d", arr[i]); str[i] = (char *) malloc(strlen(c) + 1); strcpy(str[i],c); } } return str; } int main() { int arr[] = {2, 1, 3, 4, 5}, i; char **str = fun(arr); for(i = 0; i < 5; i++) { printf("%s", str[i]); free(str[i]); } printf("\n"); free(str); return 0; } Output 21345 I added a check to make sure that arr[i] is a single digit number. Also, returning a pointer to a stack variable will result in undefined behavior, so I changed the code to allocate an array of strings. I don't check the return value of the malloc calls, which means this program could crash due to a NULL pointer reference.
This solution differs from the others in that it attempts to answer your question based on the intended use. how to send 'str' from function 'fun' , So that i can display it in main function. First, you need to define a function that returns a pointer to array. char (*fun(int arr[]))[] Allocating variable length strings doesn't buy you anything. The longest string you'll need for 64bit unsigned int is 20 digits. All you need is to allocate an array of 5 elements of 2 characters long each. You may adjust the length to suit your need. This sample assumes 1 digit and 1 null character. Note the allocation is done only once. You may choose to use the length of 21 (20 digits and 1 null). For readability on which values here are related to the number of digits including the terminator, I'll define a macro that you can modify to suit your needs. #define NUM_OF_DIGITS 3 You can then use this macro in the whole code. char (*str)[NUM_OF_DIGITS] = malloc(5 * NUM_OF_DIGITS); Finally the receiving variable in main() can be declared and assigned the returned array. char (*str)[NUM_OF_DIGITS] = fun(arr); Your complete code should look like this: Code char (*fun(int arr[]))[] { char (*str)[NUM_OF_DIGITS] = malloc(5 * NUM_OF_DIGITS); int i; for(i=0;i<5;i++) { snprintf(str[i],NUM_OF_DIGITS,"%d",arr[i]); //control and limit to single digit + null } return str; } int main() { int arr[] = {24,1,33,4,5},i; char (*str)[NUM_OF_DIGITS] = fun(arr); for(i=0;i<5;i++) { printf("%s",str[i]); } free(str); return 0; } Output 2413345 With this method you only need to free the allocated memory once.
Vigenere.c CS50 Floating Point Exception (Core Dumped)
I am working on the Vigenere exercise from Harvard's CS50 (in case you noticed I'm using string and not str). My program gives me a Floating Point Exception error when I use "a" in the keyword. It actually gives me that error when I use "a" by itself, and when I use "a" within a bigger word it just gives me wrong output. For any other kind of keyword, the program works perfectly fine. I've run a million tests. Why is it doing this? I can't see where I'm dividing or % by 0. The length of the keyword is always at least 1. It is probably going to be some super simple mistake, but I've been at this for about 10 hours and I can barely remember my name. #include <stdio.h> #include <cs50.h> #include <stdlib.h> #include <ctype.h> #include <string.h> int main (int argc, string argv[]) { //Error message if argc is not 2 and argv[1] is not alphabetical if (argc != 2) { printf("Insert './vigenere' followed by an all alphabetical key\n"); return 1; } else if (argv[1]) { for (int i = 0, n = strlen(argv[1]); i < n; i++) { if (isalpha((argv[1])[i]) == false) { printf("Insert './vigenere' followed by an all alphabetical key\n"); return 1; } } //Store keyword in variable string keyword = argv[1]; //Convert all capital chars in keyword to lowercase values, then converts them to alphabetical corresponding number for (int i = 0, n = strlen(keyword); i < n; i++) { if (isupper(keyword[i])) { keyword[i] += 32; } keyword[i] -= 97; } //Ask for users message string message = GetString(); int counter = 0; int keywordLength = strlen(keyword); //Iterate through each of the message's chars for (int i = 0, n = strlen(message); i < n; i++) { //Check if ith char is a letter if (isalpha(message[i])) { int index = counter % keywordLength; if (isupper(message[i])) { char letter = (((message[i] - 65) + (keyword[index])) % 26) + 65; printf("%c", letter); counter++; } else if (islower(message[i])) { char letter = (((message[i] - 97) + (keyword[index])) % 26) + 97; printf("%c", letter); counter++; } } else { //Prints non alphabetic characters printf("%c", message[i]); } } printf("\n"); return 0; } }
This behavior is caused by the line keyword[i] -= 97;, there you make every 'a' in the key stream a zero. Later you use strlen() on the transformed key. So when the key starts with an 'a', keywordLength therefor is set to zero, and the modulo keywordLength operation get into a division by zero. You can fix this by calculating the keyword length before the key transformation.
Converting a 1D pointer array (char) into a 2D pointer array (char) in Visual C++.
I am new to c++ programming I have to call a function with following arguments. int Start (int argc, char **argv). When I try to call the above function with the code below I get run time exceptions. Can some one help me out in resolving the above problem. char * filename=NULL; char **Argument1=NULL; int Argument=0; int j = 0; int k = 0; int i=0; int Arg() { filename = "Globuss -dc bird.jpg\0"; for(i=0;filename[i]!=NULL;i++) { if ((const char *)filename[i]!=" ") { Argument1[j][k++] = NULL; // Here I get An unhandled // exception of type //'System.NullReferenceException' // occurred j++; k=0; } else { (const char )Argument1[j][k] = filename [j]; // Here I also i get exception k++; Argument++; } } Argument ++; return 0; } Start (Argument,Argument1);
Two things: char **Argument1=NULL; This is pointer to pointer, You need to allocate it with some space in memory. *Argument1 = new char[10]; for(i=0, i<10; ++i) Argument[i] = new char(); Don't forget to delete in the same style.
You appear to have no allocated any memory to you arrays, you just have a NULL pointer char * filename=NULL; char **Argument1=NULL; int Argument=0; int j = 0; int k = 0; int i=0; int Arg() { filename = "Globuss -dc bird.jpg\0"; //I dont' know why you have 2D here, you are going to need to allocate //sizes for both parts of the 2D array **Argument1 = new char *[TotalFileNames]; for(int x = 0; x < TotalFileNames; x++) Argument1[x] = new char[SIZE_OF_WHAT_YOU_NEED]; for(i=0;filename[i]!=NULL;i++) { if ((const char *)filename[i]!=" ") { Argument1[j][k++] = NULL; // Here I get An unhandled // exception of type //'System.NullReferenceException' // occurred j++; k=0; } else { (const char )Argument1[j][k] = filename [j]; // Here I also i get exception k++; Argument++; } } Argument ++; return 0; }
The first thing you have to do is to find the number of the strings you will have. Thats easy done with something like: int len = strlen(filename); int numwords = 1; for(i = 0; i < len; i++) { if(filename[i] == ' ') { numwords++; // eating up all spaces to not count following ' ' // dont checking if i exceeds len, because it will auto-stop at '\0' while(filename[i] == ' ') i++; } } In the above code i assume there will be at least one word in the filename (i.e. it wont be an empty string). Now you can allocate memory for Argument1. Argument1 = new char *[numwords]; After that you have two options: use strtok (http://www.cplusplus.com/reference/clibrary/cstring/strtok/) implement your function to split a string That can be done like this: int i,cur,last; for(i = last = cur = 0; cur < len; cur++) { while(filename[last] == ' ') { // last should never be ' ' last++; } if(filename[cur] == ' ') { if(last < cur) { Argument1[i] = new char[cur-last+1]; // +1 for string termination '\0' strncpy(Argument1[i], &filename[last], cur-last); last = cur; } } } The above code is not optimized, i just tried to make it as easy as possible to understand. I also did not test it, but it should work. Assumptions i made: string is null terminated there is at least 1 word in the string. Also whenever im referring to a string, i mean a char array :P Some mistakes i noticed in your code: in c/c++ " " is a pointer to a const char array which contains a space. If you compare it with another " " you will compare the pointers to them. They may (and probably will) be different. Use strcmp (http://www.cplusplus.com/reference/clibrary/cstring/strcmp/) for that. You should learn how to allocate dynamically memory. In c you can do it with malloc, in c++ with malloc and new (better use new instead of malloc). Hope i helped! PS if there is an error in my code tell me and ill fix it.
Find Unique Characters in a File
I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file. For instance, if my file were the following; Entry ----- Yabba Dabba Doo Then the result would be Unique characters: {abdoy} Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve. Update I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution. Update 2 By Fast, I mean fast to implement...not necessarily fast to run.
BASH shell script version (no sed/awk): while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries. #include <iostream> #include <set> int main() { std::set<char> seen_chars; std::set<char>::const_iterator iter; char ch; /* ignore whitespace and case */ while ( std::cin.get(ch) ) { if (! isspace(ch) ) { seen_chars.insert(tolower(ch)); } } for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) { std::cout << *iter << std::endl; } return 0; } Note that I'm ignoring whitespace and it's case insensitive as requested. For a 450,000+ entry file (chars.txt), here's a sample run time: [user#host]$ g++ -o unique_chars unique_chars.cpp [user#host]$ time ./unique_chars < chars.txt a b d o y real 0m0.638s user 0m0.612s sys 0m0.017s
As requested, a pure shell-script "solution": sed -e "s/./\0\n/g" inputfile | sort -u It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly. For even more ridiculousness, I present the version that dumps the output on one line: sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).
Quick and dirty C program that's blazingly fast: #include <stdio.h> int main(void) { int chars[256] = {0}, c; while((c = getchar()) != EOF) chars[c] = 1; for(c = 32; c < 127; c++) // printable chars only { if(chars[c]) putchar(c); } putchar('\n'); return 0; } Compile it, then do cat file | ./a.out To get a list of the unique printable characters in file.
Python w/sets (quick and dirty) s = open("data.txt", "r").read() print "Unique Characters: {%s}" % ''.join(set(s)) Python w/sets (with nicer output) import re text = open("data.txt", "r").read().lower() unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric print "Unique Characters: {%s}" % unique
Here's a PowerShell example: gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique which produces: D Y a b o I like that it's easy to read. EDIT: Here's a faster version: $letters = #{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result. Why the arbitrary limitation that you need a "script" that does it? What exactly is a script anyway? Would Python do? If so, then this is one solution: import sys; s = set([]); while True: line = sys.stdin.readline(); if not line: break; line = line.rstrip(); for c in line.lower(): s.add(c); print("".join(sorted(s)));
Algorithm: Slurp the file into memory. Create an array of unsigned ints, initialized to zero. Iterate though the in memory file, using each byte as a subscript into the array. increment that array element. Discard the in memory file Iterate the array of unsigned int if the count is not zero, display the character, and its corresponding count.
cat yourfile | perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
Alternative solution using bash: sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$ EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).
While not an script this java program will do the work. It's easy to understand an fast ( to run ) import java.util.*; import java.io.*; public class Unique { public static void main( String [] args ) throws IOException { int c = 0; Set s = new TreeSet(); while( ( c = System.in.read() ) > 0 ) { s.add( Character.toLowerCase((char)c)); } System.out.println( "Unique characters:" + s ); } } You'll invoke it like this: type yourFile | java Unique or cat yourFile | java Unique For instance, the unique characters in the HTML of this question are: Unique characters:[ , , , , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, #, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
Print unique characters (ASCII and Unicode UTF-8) import codecs file = codecs.open('my_file_name', encoding='utf-8') # Runtime: O(1) letters = set() # Runtime: O(n^2) for line in file: for character in line: letters.add(character) # Runtime: O(n) letter_str = ''.join(letters) print(letter_str) Save as unique.py, and run as python unique.py.
in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.
Try this file with JSDB Javascript (includes the javascript engine in the Firefox browser): var seenAlreadyMap={}; var seenAlreadyArray=[]; while (!system.stdin.eof) { var L = system.stdin.readLine(); for (var i = L.length; i-- > 0; ) { var c = L[i].toLowerCase(); if (!(c in seenAlreadyMap)) { seenAlreadyMap[c] = true; seenAlreadyArray.push(c); } } } system.stdout.writeln(seenAlreadyArray.sort().join(''));
Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should. file = open('location.txt', 'r') letters = {} for line in file: if line == "": break for character in line.strip(): if character not in letters: letters[character] = True file.close() print "Unique Characters: {" + "".join(letters.keys()) + "}"
A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :) #include<stdio.h> #define CHARSINSET 256 #define FILENAME "location.txt" char buf[CHARSINSET + 1]; char *getUniqueCharacters(int *charactersInFile) { int x; char *bufptr = buf; for (x = 0; x< CHARSINSET;x++) { if (charactersInFile[x] > 0) *bufptr++ = (char)x; } bufptr = '\0'; return buf; } int main() { FILE *fp; char c; int *charactersInFile = calloc(sizeof(int), CHARSINSET); if (NULL == (fp = fopen(FILENAME, "rt"))) { printf ("File not found.\n"); return 1; } while(1) { c = getc(fp); if (c == EOF) { break; } if (c != '\n' && c != '\r') charactersInFile[c]++; } fclose(fp); printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile)); return 0; }
Quick and dirty solution using grep (assuming the file name is "file"): for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do if [ ! -z "`grep -li $char file`" ]; then echo -n $char; fi; done; echo I could have made it a one-liner but just want to make it easier to read. (EDIT: forgot the -i switch to grep)
Well my friend, I think this is what you had in mind....At least this is the python version!!! f = open("location.txt", "r") # open file ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates f.close() print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :) I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :) import itertools, sys # read standard input into memory, split into characters, eliminate duplicates ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower())))) print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
This answer above mentioned using a dictionary. If so, the code presented there can be streamlined a bit, since the Python documentation states: It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten. Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway: if character not in letters: And that should make it a little faster.
Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code using System; using System.IO; using System.Collections; using System.Diagnostics; namespace ConsoleApplication { class Program { static void Main(string[] args) { FileInfo fileInfo = new FileInfo(#"C:/data.txt"); Console.WriteLine(fileInfo.Length); Stopwatch sw = new Stopwatch(); sw.Start(); Hashtable table = new Hashtable(); StreamReader sr = new StreamReader(#"C:/data.txt"); while (!sr.EndOfStream) { char c = Char.ToLower((char)sr.Read()); if (!table.Contains(c)) { table.Add(c, null); } } sr.Close(); foreach (char c in table.Keys) { Console.Write(c); } Console.WriteLine(); sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); } } } produces output 4093767 mytojevqlgbxsnidhzupkfawr c 889 Press any key to continue . . . The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.
s=open("text.txt","r").read() l= len(s) unique ={} for i in range(l): if unique.has_key(s[i]): unique[s[i]]=unique[s[i]]+1 else: unique[s[i]]=1 print unique
Python without using a set. file = open('location', 'r') letters = [] for line in file: for character in line: if character not in letters: letters.append(character) print(letters)
Old question, I know, but here's a fast solution, meaning it runs fast, and it's probably also pretty fast to code if you know how to copy/paste ;) BACKGROUND I had a huge csv file (12 GB, 1.34 million lines, 12.72 billion characters) that I was loading into postgres that was failing because it had some "bad" characters in it, so naturally I was trying to find a character not in that file that I could use as a quote character. 1. First try: Jay's C++ solution I started with #jay's C++ answer: (Note: all of these code examples were compiled with g++ -O2 uniqchars.cpp -o uniqchars) #include <iostream> #include <set> int main() { std::set<char> seen_chars; std::set<char>::const_iterator iter; char ch; /* ignore whitespace and case */ while ( std::cin.get(ch) ) { if (! isspace(ch) ) { seen_chars.insert(tolower(ch)); } } for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) { std::cout << *iter << std::endl; } return 0; } Timing for this one: real 10m55.026s user 10m51.691s sys 0m3.329s 2. Read entire file at once I figured it'd be more efficient to read in the entire file into memory at once, rather than all those calls to cin.get(). This reduced the run time by more than half. (I also added a filename as a command line argument, and made it print out the characters separated by spaces instead of newlines). #include <set> #include <string> #include <iostream> #include <fstream> #include <iterator> int main(int argc, char **argv) { std::set<char> seen_chars; std::set<char>::const_iterator iter; std::ifstream ifs(argv[1]); ifs.seekg(0, std::ios::end); size_t size = ifs.tellg(); fprintf(stderr, "Size of file: %lu\n", size); std::string str(size, ' '); ifs.seekg(0); ifs.read(&str[0], size); /* ignore whitespace and case */ for (char& ch : str) { if (!isspace(ch)) { seen_chars.insert(tolower(ch)); } } for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) { std::cout << *iter << " "; } std::cout << std::endl; return 0; } Timing for this one: real 4m41.910s user 3m32.014s sys 0m17.858s 3. Remove isspace() check and tolower() Besides the set insert, isspace() and tolower() are the only things happening in the for loop, so I figured I'd remove them. It shaved off another 1.5 minutes. #include <set> #include <string> #include <iostream> #include <fstream> #include <iterator> int main(int argc, char **argv) { std::set<char> seen_chars; std::set<char>::const_iterator iter; std::ifstream ifs(argv[1]); ifs.seekg(0, std::ios::end); size_t size = ifs.tellg(); fprintf(stderr, "Size of file: %lu\n", size); std::string str(size, ' '); ifs.seekg(0); ifs.read(&str[0], size); for (char& ch : str) { // removed isspace() and tolower() seen_chars.insert(ch); } for(iter = seen_chars.begin(); iter != seen_chars.end(); ++iter) { std::cout << *iter << " "; } std::cout << std::endl; return 0; } Timing for final version: real 3m12.397s user 2m58.771s sys 0m13.624s
The simple solution from #Triptych helped me already (my input was a file of 124 MB in size, so this approach to read the entire contents into memory still worked). However, I had a problem with encoding, python didn't interpret the UTF8 encoded input correctly. So here's a slightly modified version which works for UTF8 encoded files (and also sorts the collected characters in the output): import io with io.open("my-file.csv",'r',encoding='utf8') as f: text = f.read() print "Unique Characters: {%s}" % ''.join(sorted(set(text)))