Comparing strings with symbols from different alphabets - string

I want to compare two strings which contains symbols from different alphabets (e.g. Russian and English). I want that symbols which looks similarly is considered as equal to each other.
E.g. in the word "Mom" letter "o" is from English alphabet (code 043E in Unicode), and in the world "Mоm" letter "о" is from Russian alphabet (code 006F in Unicode). So ("Mom" = "Mоm") => false, but I want it would be true. Is there some standard SAS function or I should wright a macro to do it.
Thanks!

I would do like that:
First I would make map. I mean which letter in Russian language corresponds to what letter in English language. Example:
б = b
в = v
...
I would store this map in a separate table or as macroVars.
Then I would create a macro loop, with tranwrd function, which loops throug the map, which was created.
Example here could be like that.
data _null_;
stringBefore = "без";
stringAfter = tranwrd(stringBefore,"а","a");
stringAfter = tranwrd(stringAfter,"б","b");
stringAfter = tranwrd(stringAfter,"в","v");
...
run;
After this transformation I think You can compare your strings.

I also coded some functions to deal with keybord layout misprints. Here is code:
/***************************************************************************/
/* FUNCTION count_rus_letters RETURNS NUMBER OF CYRILLIC LETTERS IN STRING */
/***************************************************************************/
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION count_rus_letters(string $);
length letter $2;
rus_count=0;
len=klength(string);
do i=1 to len;
letter=ksubstr(string,i,1);
if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
"З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
"С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
"Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я")
then rus_count+1;
end;
return(rus_count);
endsub;
run;
/**************************************************************************/
/* FUNCTION count_eng_letters RETURNS NUMBER OF ENGLISH LETTERS IN STRING */
/**************************************************************************/
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION count_eng_letters(string $);
length letter $2;
eng_count=0;
len=klength(string);
do i=1 to len;
letter=ksubstr(string,i,1);
if rank('A') <= rank(letter) <=rank('z')
then eng_count+1;
end;
return(eng_count);
endsub;
run;
/**************************************************************************/
/* FUNCTION is_string_russian RETURNS 1 IF NUMBER OF RUSSIAN SYMBOLS IN */
/* STRING >= NUMBER OF ENGLISH SYMBOLS */
/**************************************************************************/
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION is_string_russian(string $);
length letter $2 result 8;
eng_count=0;
rus_count=0;
len=klength(string);
do i=1 to len;
letter=ksubstr(string,i,1);
if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
"З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
"С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
"Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я")
then rus_count+1;
if rank('A') <= rank(letter) <=rank('z')
then eng_count+1;
end;
if rus_count>=eng_count
then result=1;
else result=0;
return(result);
endsub;
run;
/**************************************************************************/
/* FUNCTION fix_layout_misprints REPLACES MISPRINTED SYMBOLS BY ANALYSING */
/* LANGUAGE OF THE STRING (FOR ENGLISH STRING RUSSIAN SYMBOLS ARE */
/* REPLACED BY ENGLISH COPIES AND FOR RUSSIAN STRING SYMBOLS ARE */
/* REPLACED BY RUSSIAN COPIES) */
/**************************************************************************/
proc fcmp outlib=sasuser.userfuncs.mystring;
FUNCTION fix_layout_misprints(string $) $ 1000;
length letter $2 result $1000;
eng_count=0;
rus_count=0;
len=klength(string);
do i=1 to len;
letter=ksubstr(string,i,1);
if letter in ("А","а","Б","б","В","в","Г","г","Д","д","Е","е","Ё","ё","Ж","ж"
"З","з","И","и","Й","й","К","к","Л","л","М","м","Н","н","О","о","П","п","Р","р",
"С","с","Т","т","У","у","Ф","ф","Х","х","Ц","ц","Ч","ч","Ш","ш","Щ","щ","Ъ","ъ"
"Ы","ы","Ь","ь","Э","э","Ю","ю","Я","я")
then rus_count+1;
if rank('A') <= rank(letter) <=rank('z')
then eng_count+1;
end;
if rus_count>=eng_count
then result=ktranslate(string,"АаВЕеКкМОоРрСсТХх","AaBEeKkMOoPpCcTXx");
else result=ktranslate(string,"AaBEeKkMOoPpCcTXx","АаВЕеКкМОоРрСсТХх");
return(result);
endsub;
run;
/***********/
/* EXAMPLE */
/***********/
options cmplib=sasuser.userfuncs;
data _null_;
good_str="Иванов";
err_str="Ивaнов";
fixed_str=fix_layout_misprints(err_str);
put "Good string=" good_str;
put "Error string=" err_str;
put "Fixed string=" fixed_str;
rus_count_in_err=count_rus_letters(err_str);
put "Count or Cyrillic symbols in error string=" rus_count_in_err;
eng_count_in_err=count_eng_letters(err_str);
put "Count or English symbols in error string=" eng_count_in_err;
is_error_str_russian=is_string_russian(err_str);
put "Is error string language Russian=" is_error_str_russian;
if (good_str ne err_str)
then put "Before clearing - strings are not equal to each other";
if (good_str = fixed_str)
then put "After clearing - strings are equal to each other";
run;

Related

Extra characters and symbols outputted when doing substitution in C

When I run the code using the following key, extra characters are outputted...
TERMINAL WINDOW:
$ ./substitution abcdefghjklmnopqrsTUVWXYZI
plaintext: heTUXWVI ii ssTt
ciphertext: heUVYXWJ jj ttUuh|
This is the instructions (cs50 substitution problem)
Design and implement a program, substitution, that encrypts messages using a substitution cipher.
Implement your program in a file called substitution.c in a ~/pset2/substitution directory.
Your program must accept a single command-line argument, the key to use for the substitution. The key itself should be case-insensitive, so whether any character in the key is uppercase or lowercase should not affect the behavior of your program.
If your program is executed without any command-line arguments or with more than one command-line argument, your program should print an error message of your choice (with printf) and return from main a value of 1 (which tends to signify an error) immediately.
If the key is invalid (as by not containing 26 characters, containing any character that is not an alphabetic character, or not containing each letter exactly once), your program should print an error message of your choice (with printf) and return from main a value of 1 immediately.
Your program must output plaintext: (without a newline) and then prompt the user for a string of plaintext (using get_string).
Your program must output ciphertext: (without a newline) followed by the plaintext’s corresponding ciphertext, with each alphabetical character in the plaintext substituted for the corresponding character in the ciphertext; non-alphabetical characters should be outputted unchanged.
Your program must preserve case: capitalized letters must remain capitalized letters; lowercase letters must remain lowercase letters.
After outputting ciphertext, you should print a newline. Your program should then exit by returning 0 from main.
My code:
#include <cs50.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(int argc,string argv[])
{
char alpha[26] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
string key = argv[1];
int totalchar = 0;
for (char c ='a'; c <= 'z'; c++)
{
for (int i = 0; i < strlen(key); i++)
{
if (tolower(key[i]) == c)
{
totalchar++;
}
}
}
//accept only singular 26 key
if (argc == 2 && totalchar == 26)
{
string plaint = get_string("plaintext: ");
int textlength =strlen(plaint);
char subchar[textlength];
for (int i= 0; i< textlength; i++)
{
for (int j =0; j<26; j++)
{
// substitute
if (tolower(plaint[i]) == alpha[j])
{
subchar[i] = tolower(key[j]);
// keep plaintext's case
if (plaint[i] >= 'A' && plaint[i] <= 'Z')
{
subchar[i] = (toupper(key[j]));
}
}
// if isn't char
if (!(isalpha(plaint[i])))
{
subchar[i] = plaint[i];
}
}
}
printf("ciphertext: %s\n", subchar);
return 0;
}
else
{
printf("invalid input\n");
return 1;
}
}
strcmp compares two strings. plaint[i] and alpha[j] are chars. The can be compared with "regular" comparison operators, like ==.

MIPS, Number of occurrences in a string located in the stack

I have an exercise to solve in MIPS assembly (where I have some doubts but other things are clear) but I have some problem to write it's code. The exercise ask me:
Write a programm that, obtained a string from keyboard, count the occurrences of the character with the higher number of occurrences and show it.
How I can check all the 26 characters and find who has the higher occurences?
Example:
Give me a string: Hello world!
The character with the higher occurrences is: l
Thanks alot for the future answer.
P.s.
This is my first part of the programm:
#First message
li $v0, 4
la $a0, mess
syscall
#Stack space allocated
addi $sp, $sp, -257
#Read the string
move $a0, $sp
li $a1, 257
li $v0, 8
syscall
Since this is your assignment I'll leave the MIPS assembly implementation to you. I'll just show you the logic for the code in a higher-level language:
// You'd keep these variables in some MIPS registers of your choice
int c, i, count, max_count=0;
char max_char;
// Iterate over all ASCII character codes
for (c = 0; c < 128; c+=1) {
count = 0;
// Count the number of occurences of this character in the string
for (i = 0; string[i]!=0; i+=1) {
if (string[i] == c) count++;
}
// Was is greater than the current max?
if (count > max_count) {
max_count = count;
max_char = c;
}
}
// max_char now hold the ASCII code of the character with the highest number
// of occurences, and max_count hold the number of times that character was
// found in the string.
#Michael, I saw you answered before I posted, I just want to repeat that with a more detailed answer. If you edit your own to add some more explanations, then I will delete mine. I did not edit yours directly, because I was already half-way there when you posted. Anyway:
#Marco:
You can create a temporary array of 26 counters (initialized to 0).
Each counter corresponds to each letter (i.e. the number each letter occurs). For example counter[0] corresponds to the number of occurences of letter 'a', counter[1] for letter 'b', etc...
Then iterate over each character in the input character-sequence and for each character do:
a) Obtain the index of the character in the counter array.
b) Increase counter["obtained index"] by 1.
To obtain the index of the character you can do the following:
a) First make sure the character is not capital, i.e. only 'a' to 'z' allowed and not 'A' to 'Z'. If it is not, convert it.
b) Substract the letter 'a' from the character. This way 'a'-'a' gives 0, 'b'-'a' gives 1, 'c'-'a' gives 2, etc...
I will demonstrate in C language, because it's your exercise on MIPS (I mean the goal is to learn MIPS Assembly language):
#include <stdio.h>
int main()
{
//Maximum length of string:
int stringMaxLength = 100;
//Create string in stack. Size of string is length+1 to
//allow the '\0' character to mark the end of the string.
char str[stringMaxLength + 1];
//Read a string of maximum stringMaxLength characters:
puts("Enter string:");
scanf("%*s", stringMaxLength, str);
fflush(stdin);
//Create array of counters in stack:
int counter[26];
//Initialize the counters to 0:
int i;
for (i=0; i<26; ++i)
counter[i] = 0;
//Main counting loop:
for (i=0; str[i] != '\0'; ++i)
{
char tmp = str[i]; //Storing of str[i] in tmp, to write tmp if needed,
//instead of writing str[i] itself. Optional operation in this particular case.
if (tmp >= 'A' && tmp <= 'Z') //If the current character is upper:
tmp = tmp + 32; //Convert the character to lower.
if (tmp >= 'a' && tmp <='z') //If the character is a lower letter:
{
//Obtain the index of the letter in the array:
int index = tmp - 'a';
//Increment its counter by 1:
counter[index] = counter[index] + 1;
}
//Else if the chacacter is not a lower letter by now, we ignore it,
//or we could inform the user, for example, or we could ignore the
//whole string itself as invalid..
}
//Now find the maximum occurences of a letter:
int indexOfMaxCount = 0;
int maxCount = counter[0];
for (i=1; i<26; ++i)
if (counter[i] > maxCount)
{
maxCount = counter[i];
indexOfMaxCount = i;
}
//Convert the indexOfMaxCount back to the character it corresponds to:
char maxChar = 'a' + indexOfMaxCount;
//Inform the user of the letter with maximum occurences:
printf("Maximum %d occurences for letter '%c'.\n", maxCount, maxChar);
return 0;
}
If you don't understand why I convert the upper letter to lower by adding 32, then read on:
Each character corresponds to an integer value in memory, and when you make arithmetic operations on characters, it's like you are making them to their corresponding number in the encoding table.
An encoding is just a table which matches those letters with numbers.
For example 'a' corresponds to number 97 in ASCII encoding/decoding/table.
For example 'b' corresponds to number 98 in ASCII encoding/decoding/table.
So 'a'+1 gives 97+1=98 which is the character 'b'. They are all numbers in memory, and the difference is how you represent (decode) them. The same table of the encoding, is also used for decoding of course.
Examples:
printf("%c", 'a'); //Prints 'a'.
printf("%d", (int) 'a'); //Prints '97'.
printf("%c", (char) 97); //Prints 'a'.
printf("%d", 97); //Prints '97'.
printf("%d", (int) 'b'); //Prints '98'.
printf("%c", (char) (97 + 1)); //Prints 'b'.
printf("%c", (char) ( ((int) 'a') + 1 ) ); //Prints 'b'.
//Etc...
//All the casting in the above examples is just for demonstration,
//it would work without them also, in this case.

Modifying strings in emu8086 assembly

I currently working on an intro assignment for a computer architecture course and i was asked to accomplish some string modifications. My question is not how to do it, but what should i be researching to be able to do it? Is there any functions that will make this easier, for example .reverse() is java.
What i need to accomplish is getting string input from the user, reverse the letters (while reversing numbers keep them where they are), add spaces whenever there is a vowel, and alternate the caps.
Example:
Input: AbC_DeF12
Output: f E d _ c B a 2 1
This is code i ripped from the lecture: http://pastebin.com/2E1UtGdD I put it in pastebin to avoid clutter. Anything used in this is fair game. (this code does have limitiations though, it only support ~9 characters and the looping doesn't work at the end of strings)
I would look at it like this.
Generate a function on paper of how you want to achieve this. This is notes and only a starting point.
Loop from 0 to string length.
if(byte >= 'A' || byte <= 'Z') then byte -= 'A' - 'a'; /* convert to lower case */
if(byte >= 'a' || byte <= 'z') then byte += 'A' - 'a'; /* convert to upper case */
/* Switch the letters only. */
a = 0; b = string length
Loop i from a to b. if((input >= 'A' && input <='Z') || (input >= 'a' && input <='z')) p = i
Loop j from b to a. if((input >= 'A' && input <='Z') || (input >= 'a' && input <='z')) q = j
c = input[i]; input[i] = input[j]; input[j] = c;
/* Regenerate the string and add spaces. */
loop i, 0 to string length
if(input[i] == 'A' 'a' 'E' 'e' ...) string2[j] = ' '; j++; string2[j] = input[i]; j++;
i++
After that if you don't know 8086 I would look at examples online of how to do each individual part. The most important bit is generating the code in your head and on paper on how it is going to work.

Char Array Returning Integers

I've been working through this exercise, and my output is not what I expect.
(Check substrings) You can check whether a string is a substring of another string
by using the indexOf method in the String class. Write your own method for
this function. Write a program that prompts the user to enter two strings, and
checks whether the first string is a substring of the second.
** My code compromises with the problem's specifications in two ways: it can only display matching substrings to 3 letters, and it cannot work on string literals with less than 4 letters. I mistakenly began writing the program without using the suggested method, indexOf. My program's objective (although it shouldn't entirely deviate from the assignment's objective) is to design a program that determines whether two strings share at least three consecutive letters.
The program's primary error is that it generates numbers instead of char characters. I've run through several, unsuccessful ideas to discover what the logical error is. I first tried to idenfity whether the char characters (which, from my understanding, are underwritten in unicode) were converted to integers, considering that the outputted numbers are also three letters long. Without consulting a reference, I know this isn't true. A comparison between java and javac outputted permutation of 312, and a comparison between abab and ababbab ouputted combinations of 219. j should be > b. My next thought was that the ouputs were indexes of the arrays I used. Once again, this isn't true. A comparison between java and javac would ouput 0, if my reasoning were true.
public class Substring {
public static char [] array;
public static char [] array2;
public static void main (String[]args){
java.util.Scanner input = new java.util.Scanner (System.in);
System.out.println("Enter your two strings here, the longer one preceding the shorter one");
String container1 = input.next();
String container2 = input.next();
char [] placeholder = container1.toCharArray();
char [] placeholder2 = container2.toCharArray();
array = placeholder;
array2 = placeholder2;
for (int i = 0; i < placeholder2.length; i++){
for (int j = 0; j < placeholder.length; j ++){
if (array[j] == array2[i]) matcher(j,i);
}
}
}
public static void matcher(int higher, int lower){
if ((higher < array.length - 2) && (lower < array2.length - 2))
if (( array[higher+1] == array2[lower+1]) && (array[higher+2] == array2[lower+2]))
System.out.println(array[higher] + array[higher+1] + array[higher+2] );
}
}
The + operator promotes shorts, chars, and bytes operands to ints, so
array[higher] + array[higher+1] + array[higher+2]
has type int, not type char which means that
System.out.println(...)
binds to
System.out.println(int)
which displays its argument as a decimal number, instead of binding to
System.out.println(char)
which outputs the given character using the PrintStream's encoding.

MIPS- Open a file with Strings of ints, convert them to ints and print

Given a file with strings of numbers , how can i convert them to INTS then print them in MIPS? Any tips on where to start?
Based on this (pseudocode):
open the file (syscall #13)
read a number (syscall #14) (are they separed by commas, newlines, ...?)
convert into int (see below for a good algo)
output that int (syscall #1)
go to step 2
int ToInteger(char *digit) // please note: destination base is 10!
{
int result = 0;
while (*digit) {
result = (result * 10) + (*digit - '0');
digit++;
}
return result;
}

Resources