C++ Replacing non-alpha/apostrophe with spaces in a string - string

I am reading in a text file and parsing the words into a map to count numbers of occurrences of each word on each line. I am required to ignore all non-alphabetic chars (punctuation, digits, white space, etc) except for apostrophes. I can figure out how to delete all of these characters using the following code, but that causes incorrect words, like "one-two" comes out as "onetwo", which should be two words, "one" and "two".
Instead, I am trying to now replace all of these values with spaces instead of simply deleting, but can't figure out how to do this. I figured the replace-if algorithm would be a good algorithm to use, but can't figure out the proper syntax to accomplish this. C++11 is fine. Any suggestions?
Sample output would be the following:
"first second" = "first" and "second"
"one-two" = "one" and "two"
"last.First" = "last" and "first"
"you're" = "you're"
"great! A" = "great" and "A"
// What I initially used to delete non-alpha and white space (apostrophe's not working currently, though)
// Read file one line at a time
while (getline(text, line)){
istringstream iss(line);
// Parse line on white space, storing values into tokens map
while (iss >> word){
word.erase(remove_if(word.begin(), word.end(), my_predicate), word.end());
++tokens[word][linenum];
}
++linenum;
}
bool my_predicate(char c){
return c == '\'' || !isalpha(c); // This line's not working properly for apostrophe's yet
}

bool my_predicate(char c){
return c == '\'' || !isalpha(c);
}
Here you're writing that you want to remove the char if it is and apostrophe or if it is not an alphabetical character.
Since you want to replace these, you should use std::replace_if() :
std::replace_if(std::begin(word), std::end(word), my_predicate, ' ');
And you should correct your predicate too :
return !isalpha(c) && c != '\'';

You could use std::replace_if to pre-process the input line before sending it to the istringstream. This will also simplify your inner loop.
while (getline(text, line)){
replace_if(line.begin(), line.end(), my_predicate, ' ');
istringstream iss(line);
// Parse line on white space, storing values into tokens map
while (iss >> word){
++tokens[word][linenum];
}
++linenum;
}

Related

Perl Morgan and a String?

I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character

How to compare upper and lowercase letters in a conditional in Swift

Apologies if this is a duplicate. I have a helper function called inputString() that takes user input and returns a String. I want to proceed based on whether an upper or lowercase character was entered. Here is my code:
print("What do you want to do today? Enter 'D' for Deposit or 'W' for Withdrawl.")
operation = inputString()
if operation == "D" || operation == "d" {
print("Enter the amount to deposit.")
My program quits after the first print function, but gives no compiler errors. I don't know what I'm doing wrong.
It's important to keep in mind that there is a whole slew of purely whitespace characters that show up in strings, and sometimes, those whitespace characters can lead to problems just like this.
So, whenever you are certain that two strings should be equal, it can be useful to print them with some sort of non-whitespace character on either end of them.
For example:
print("Your input was <\(operation)>")
That should print the user input with angle brackets on either side of the input.
And if you stick that line into your program, you'll see it prints something like this:
Your input was <D
>
So it turns out that your inputString() method is capturing the newline character (\n) that the user presses to submit their input. You should improve your inputString() method to go ahead and trim that newline character before returning its value.
I feel it's really important to mention here that your inputString method is really clunky and requires importing modules. But there's a way simpler pure Swift approach: readLine().
Swift's readLine() method does exactly what your inputString() method is supposed to be doing, and by default, it strips the newline character off the end for you (there's an optional parameter you can pass to prevent the method from stripping the newline).
My version of your code looks like this:
func fetchInput(prompt: String? = nil) -> String? {
if let prompt = prompt {
print(prompt, terminator: "")
}
return readLine()
}
if let input = fetchInput("Enter some input: ") {
if input == "X" {
print("it matches X")
}
}
the cause of the error that you experienced is explained at Swift how to compare string which come from NSString. Essentially, we need to remove any whitespace or non-printing characters such as newline etc.
I also used .uppercaseString to simplify the comparison
the amended code is as follows:
func inputString() -> String {
var keyboard = NSFileHandle.fileHandleWithStandardInput()
var inputData = keyboard.availableData
let str: String = (NSString(data: inputData, encoding: NSUTF8StringEncoding)?.stringByTrimmingCharactersInSet(
NSCharacterSet.whitespaceAndNewlineCharacterSet()))!
return str
}
print("What do you want to do today? Enter 'D' for Deposit or 'W' for Withdrawl.")
let operation = inputString()
if operation.uppercaseString == "D" {
print("Enter the amount to deposit.")
}

Get indexOf special characters in ActionScript3

In ActionScript3 i wanted to get the text between 2 quotes from some HTML using a input index value where i would simply increase the 2nd quote characters value by 1. This would be very simple however i have now noticed using indexOf does not seem to work correctly with quotes and other special characters.
So my question is if you have some HTML style text like this:
var MyText:String = '<div style="text-align:center;line-height:150%"><a href="http://www.website.com/page.htm">';
How can i correctly get the index of a quote " or other special character?
Currently i try this:
MyText.indexOf('"',1)
but after 0 it always returns the wrong index value.
Also a quick additional question would be is there a better way than using ' ' to store strings with characters like " inside? So if i had other ' characters etc it won't cause problems.
Edit -
This is the function i had created (usage = GetQuote(MyText,0) etc)
// GetQuote Function (Gets the content between quotes at a set index value)
function GetQuote(Input:String, Index:Number):String {
return String(Input.substr(Input.indexOf('"', Index), Input.indexOf('"', Index + 1)));
}
The return for GetQuote(MyText,0) is "text-align yet i need text-align:center;line-height:150% instead.
First off, index of the first quote is 11 and both MyString.indexOf('"') and MyString.indexOf('"',1) return the right value (the latter also works because you don't actually have a quote at the beginning of your string).
When you need to use an single quote inside another one or a double quote inside another one you need to escape the inner one(s) using backslashes. So to catch a single quote you would use it like '\''
There are several ways of stripping a value from a string. You can use the RegExp class or use standard String functions like indexOf, substr etc.
Now what exactly would you like the result to become? Your question is not obvious.
EDIT:
Using the RegExp class is much easier:
var myText:String = '<div style="text-align:center;line-height:150%"><a href="http://www.website.com/page.htm">';
function getQuote(input:String, index:int=0):String {
// I declared the default index as the first one
var matches:Array = [];
// create an array for the matched results
var rx:RegExp = /"(\\"|[^"])*"/g;
// create a RegExp rule to catch all grouped chars
// rule also includes escaped quotes
input.replace(rx,function(a:*) {
// if it's "etc." we want etc. only so...
matches.push(a.substr(1,a.length-2));
});
// above method does not replace anything actually.
// it just cycles in the input value and pushes
// captured values into the matches array.
return (index >= matches.length || index < 0) ? '' : matches[index];
}
trace('Index 0 -->',getQuote(myText))
trace('Index 1 -->',getQuote(myText,1))
trace('Index 2 -->',getQuote(myText,2))
trace('Index -1 -->',getQuote(myText,-1))
Outputs:
Index 0 --> text-align:center;line-height:150%
Index 1 --> http://www.website.com/page.htm
Index 2 -->
Index -1 -->

Programmatically determining the difference between a comma-separated list and a paragraph

I am working on a data migration where on the old system the users were allowed to enter their interests in a large text-field with no formatting instructions followed at all. As a result some wrote in bio format and others wrote in comma-separated list format. There are a few other formats, but these are the primary ones.
Now I know how to identify a comma-separated list (CSL). That is easy enough. But how about determining if a string is a CSL (maybe a short one with two terms or phrases) or just a paragraph someone wrote that contains a comma?
One thought that I have is to automatically ignore strings that contain punctuation and strings that don't contain commas. However, I am concerned that this won't be enough or will leave much to be desired. So I would like to query the community to see what you guys think. In the mean time I will try out my idea.
UPDATE:
Ok guys, I have my algorithm. Here it is below...
MY CODE:
//Process our interests text field and get the list of interests
function process_interests($interests)
{
$interest_list = array();
if ( preg_match('/(\.)/', $interests) 0 && $word_cnt > 0)
$ratio = $delimiter_cnt / $word_cnt;
//If delimiter is found with the right ratio then we can go forward with this.
//Should not be any more the 5 words per delimiter (ratio = delimiter / words ... this must be at least 0.2)
if (!empty($delimiter) && $ratio > 0 && $ratio >= 0.2)
{
//Check for label with colon after it
$interests = remove_colon($interests);
//Now we make our array
$interests = explode($delimiter, $interests);
foreach ($interests AS $val)
{
$val = humanize($val);
if (!empty($val))
$interest_list[] = $val;
}
}
}
return $interest_list;
}
//Cleans up strings a bit
function humanize($str)
{
if (empty($str))
return ''; //Lets not waste processing power on empty strings
$str = remove_colon($str); //We do this one more time for inline labels too.
$str = trim($str); //Remove unused bits
$str = ltrim($str, ' -'); //Remove leading dashes
$str = str_replace(' ', ' ', $str); //Remove double spaces, replace with single spaces
$str = str_replace(array(".", "(", ")", "\t"), '', $str); //Replace some unwanted junk
if ( strtolower( substr($str, 0, 3) ) == 'and')
$str = substr($str, 3); //Remove leading "and" from term
$str = ucwords(preg_replace('/[_]+/', ' ', strtolower(trim($str))));
return $str;
}
//Check for label with colon after it and remove the label
function remove_colon($str)
{
//Check for label with colon after it
if (strstr($str, ':'))
{
$str = explode(':', $str); //If we find it we must remove it
unset($str[0]); //To remove it we just explode it and take everything to the right of it.
$str = trim(implode(':', $str)); //Sometimes colons are still used elsewhere, I am going to allow this
}
return $str;
}
Thank you for all your help and suggestions!
You could, in addition to the filtering you mentioned, create a ratio of number of commas to string length. In CSLs, this ratio will tend to be high, in paragraphs low. You could set some kind of a threshold, and choose based on whether or not the entry has a high enough ratio. Ones with ratios close to the threshold could be marked as prone to error, and could then be check by a moderator.

Split Long Text into meaningful sentences with certain length using C#

i would like to split a text to a dot followed by a whitespace or a dot followed by new line (\n) at certain length.
e.g if I have Long text with total 3456 char. I want to split this text into three diff. text with 1000 or closest no. of chars but each text should end with full meaningful sentence.
Reason I want to do this is, I am using API which takes only 1000 or less char for data conversion but i have some text which is longer than 1000 char so I want to split into multiple text so I do not have any text more than 1000 char and each text is ended at full sentece. e.g text to a dot followed by a whitespace or a dot followed by new line (\n)
I'm working with c# .Net
Thanks in Advance.
Something like this. Obviously, replace "someText" with your data, and set the shardLength to 1000 for your example. This solution gives an error if there is a sentence larger than the block size.
It currently handles newlines by effectively ignoring them- it only splits on "."
This means that sentences that end in ".\n" will be split after the ".", and the "\n" will be at the start of the next sentence.
The advantage here is that if you pass this to your API, you should be able to concatenate the results and retain the newlines in the appropriate places (assuming the API handles newlines).
using System.Text.RegularExpressions;
public static void BlockSplitter()
{
String someText = #"This is some text.
The quick brown fox jumps over the lazy dog. Testing 1 2 3.
Sentence with no fullstop";
String[] sentences;
string delimiters = #"(?<=\.)";
sentences = Regex.Split(someText,delimiters);
String shard = String.Empty;
int shardLength = 45;
foreach (String sentence in sentences)
{
if (sentence.Length > shardLength)
{
//Raise an exception as the sentence
}
if ((shard.Length + sentence.Length) <= shardLength)
{
shard += sentence;
}
else
{
Console.WriteLine(shard);
shard = sentence;
}
}
Console.WriteLine(shard);
}

Resources