Why am I getting incorrect values for string length?

Why am I getting incorrect values for string length? - string

My professor is teaching us Scala using Horstmann's book "Scala for the impatient", and one of our homework exercises are straight from the book; Chapter 4, exercise 2.
We are expected to read in the eBook in text format, the professor has specified that the input file should be "Moby Dick", available for free from the Guttenberg project here: http://www.gutenberg.org/ebooks/2701.txt.utf-8
My code works, as far as counting instances of words. However, he has added the requirement that we must we must format the output in two two columns, with words left justified, and counts right justified. To do so, I am determining the longest word in the book so I can figure the width of the "word" column. However, the values I am getting for the length of the strings is just wrong. In fact, it tells me that all the strings are the same length. "a" is being reported as length 26, just as is "Whale", "Ishmael", etc...
Here's the code:
object Chapter4Exercise2 extends App {
//for sorting
import util.Sorting._
//grab the file
val inputFile = new java.util.Scanner(new java.io.File("moby.txt"))
//create a mutable map where key/values == word/count
val wordMap = collection.mutable.Map[String, Int]() withDefault (_ => 0)
//for formatting output (later), the longest word length is relevant
var longestWord = 0
var theWord: String = ""
//start reading each word in the input file
while (inputFile hasNext) {
//grab the next word for processing, convert it to lower case, trim spaces and punctuation
var nextWord = inputFile.next().toLowerCase().trim().filter(Character.isLetter(_))
//if it's the longest word, update both theWord and longestWord
if (nextWord.size > longestWord) longestWord = nextWord.size; theWord = nextWord; println(theWord + " " + longestWord)
//update the map value for the key with same value as nextWord
wordMap(nextWord) += 1
}
println("Longest word is " + theWord + " at " + longestWord + " Characters")
}
The output of these lines:
if (nextWord.size > longestWord) longestWord = nextWord.size; theWord = nextWord; println(theWord + " " + longestWord)
and
println("Longest word is " + theWord + " at " + longestWord + " Characters")
is way off. It's telling me that EVERY word in the input file is 26 characters long!
Here's a small sample of what's being output:
husks 26
on 26
a 26
surfbeaten 26
beach 26
and 26
then 26
diving 26
down 26
into 26
What am I missing/doing wrong?

if (nextWord.size > longestWord) longestWord = nextWord.size; theWord = nextWord; println(theWord + " " + longestWord)
You shouldn't write multiple statements on a single line like that. Let's write this out in multiple lines and properly indent it:
if (nextWord.size > longestWord)
longestWord = nextWord.size
theWord = nextWord
println(theWord + " " + longestWord)
Do you see the problem now?

Try putting { and } around your if statement alternatives.
You can avoid this kind of pitfall by formatting your code in a structured manner - always using braces around code blocks.
if (nextWord.size > longestWord)
{
longestWord = nextWord.size;
theWord = nextWord;
println(theWord + " " + longestWord);
}
Your current code is equivalent to
if (nextWord.size > longestWord)
{
longestWord = nextWord.size;
}
theWord = nextWord;
println(theWord + " " + longestWord);

Related

How do I count number of words in a string in Typescript without counting extraneous spaces?

I have seen many cases where people sometimes rely on whitespaces which causes some miscalculations.
For Example, take 2 strings;
const str1: string = 'I love stackoverflow'
const str2: string = 'I love stackoverflow'
Using the numOfWhitespaces + 1 thing gives wrong number of words in case of str2. The reason is obvious that it counts 6 number of spaces.
So what should be an easy and better alternative?

The shortest would be using: str1.split(/\s+/).length
But just in case any beginner want to do it with basic loop, here it is:
let str1: string = 'I love stackoverflow'
let numberOfSpaces: number = 0
for (let index = 1; index <= str1.length; index++) {
let lastChar: string = ''
let currentChar: string = ''
currentChar = str1.charAt(index)
lastChar = str1.charAt(index - 1)
if (currentChar === " " && lastChar !== " ") {
numberOfSpaces = numberOfSpaces+ 1
}
else if (currentChar === " " && lastChar === " ") { // This is a test String.
numberOfSpaces = numberOfSpaces + 0
}
//I have not added an else statement for the case if both current char and last char are not whitespaces.
//because I felt there was no need for that and it works perfectly.
}
const finalNumberOfWords: number = numberOfSpaces + 1
console.log(`Number of words final are = ${finalNumberOfWords}`)
So this might look similar to the counting whitespaces method, yes it is but this one doesn't count the extraneous spaces [space followed by a space].
A for loop runs throughout the length of the string. It compares the character at current position of str1[index]and its previous index. If both are whitespaces, it won't count but if previous character was non-null and current is blank, it increments the counter by one.
And finally we add 1 to the counter to display number of words.
Here's a screenshot:

An alternative solution would be to use a regex:
const str2: string = 'I love stackoverflow'
console.log(str2.split(/\s+/).length);
This will ensure that multiple spaces will be splitted.
Test:
console.log('I love stackoverflow'.split(/\s+/).length);
console.log('Ilovestackoverflow'.split(/\s+/).length);

CSV file parsing in C#

Wanted to check if there is any null value in column field in the csv file and also there shouldn't be null value after / in number column, if it is null the entire row should not be written to output file.
name,number,gender,country
iva 1/001 f Antartica
aaju 2/002 m russia
lax 3/ m brazil
ana 4/004 f Thailand
vis 5/005 m
for e.g. 3rd and 5th row should not be written to output file.
using (StreamWriter file = new StreamWriter(filepathop)) {
for (int i = 0; i < csv.Length; i++) {
{
if (i == 0) {
file.WriteLine(header + "," + "num" + "," + "serial" + "," + "date");
}
else {
var newline = new StringBuilder();
string[] words = csv[i].Split(',');
string[] no = words[1].Split('/');
string number = no[0];
string serial = no[1];
newline.Append(number + "," + serial + "," + tokens[0]);
file.WriteLine(csv[i] + "," + newline);
}
}
}
}
}
}
}

You can test for null columns with string.IsNullOrEmpty(column) or column.Length == 0 like so:
if (!string.IsNullOrEmpty(serial) && !string.IsNullOrEmpty(country))
file.WriteLine(csv[i] + "," + newline);
You might want to check and remove white space, too. Depends on your input.

Replace series of Unicode characters / Python / Twitter

I am pulling text from a tweet using the Twitter API and Python 3.3 and I'm running into the part of the tweet where the tweeter put three symbols in the tweet. They are shown below.
The two flags and the thumbs up seem to be causing the problem. The following is the plain text tweet.
RT #John_Hunt07: Just voted for #marcorubio is Florida! I am ready for a New American Century!! #FLPrimary \ud83c\uddfa\ud83c\uddf8\ud83c\uddfa\ud83c\uddf8\ud83d\udc4d
The following is the code I'm using.
import json
import mysql.connector
import sys
from datetime import datetime
from MySQLCL import MySQLCL
class Functions(object):
"""This is a class for Python functions"""
#staticmethod
def Clean(string):
temp = str(string)
temp = temp.replace("'", "").replace("(", "").replace(")", "").replace(",", "").strip()
return temp
#staticmethod
def ParseTweet(string):
for x in range(0, len(string)):
tweetid = string[x]["id_str"]
tweetcreated = string[x]["created_at"]
tweettext = string[x]["text"]
tweetsource = string[x]["source"]
tweetsource = tweetsource
truncated = string[x]["truncated"]
inreplytostatusid = string[x]["in_reply_to_status_id"]
inreplytouserid = string[x]["in_reply_to_user_id"]
inreplytoscreenname = string[x]["in_reply_to_screen_name"]
geo = string[x]["geo"]
coordinates = string[x]["coordinates"]
place = string[x]["place"]
contributors = string[x]["contributors"]
isquotestatus = string[x]["is_quote_status"]
retweetcount = string[x]["retweet_count"]
favoritecount = string[x]["favorite_count"]
favorited = string[x]["favorited"]
retweeted = string[x]["retweeted"]
if "possibly_sensitive" in string[x]:
possiblysensitive = string[x]["possibly_sensitive"]
else:
possiblysensitive = ""
language = string[x]["lang"]
#print(possiblysensitive)
print(Functions.UnicodeFilter(tweettext))
#print(inreplytouserid)
#print("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + Functions.UnicodeFilter(tweettext) + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + str(language) + "', '" + Functions.ToSQL(tweetcreated) + "', '" + Functions.ToSQL(tweetsource) + "', " + str(possiblysensitive) + ")")
#MySQLCL.Set("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + tweettext + "', " + str(truncated) + ", " + Functions.CheckNullNum(inreplytostatusid) + ", " + Functions.CheckNullNum(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + language + "', '" + str(Functions.FormatDate(tweetcreated)) + "', '" + str(Functions.UnicodeFilter(tweetsource)) + "', " + str(possiblysensitive) + ")")
#staticmethod
def ToBool(variable):
if variable.lower() == 'true':
return True
elif variable.lower() == 'false':
return False
#staticmethod
def CheckNullNum(var):
if var == None:
return "0"
else:
return str(var)
#staticmethod
def CheckNull(var):
if var == None:
return ""
else:
return var
#staticmethod
def ToSQL(var):
temp = var
temp = temp.replace("'", "")
return str(temp)
#staticmethod
def UnicodeFilter(var):
temp = var
temp = temp.replace(chr(0x2019), "")
temp = temp.replace(chr(0x003c), "(lessthan)")
temp = temp.replace(chr(0x003e), "(greaterthan)")
temp = temp.replace(chr(0xd83c), "")
temp = temp.replace(chr(0xddfa), "")
temp = temp.replace(chr(0xddf8), "")
temp = temp.replace(chr(0xd83d), "")
temp = temp.replace(chr(0xdc4d), "")
temp = Functions.ToSQL(temp)
return temp
#staticmethod
def FormatDate(var):
temp = var
dt = datetime.strptime(temp, "%a %b %d %H:%M:%S %z %Y")
retdt = str(dt.year) + "-" + str(dt.month) + "-" + str(dt.day) + "T" + str(dt.hour) + ":" + str(dt.minute) + ":" + str(dt.second)
return retdt
As you can see, I've been using the function UnicodeFilter in order to try to filter out the unicode characters in hex. The function works when dealing with single unicode characters, but when encountering multiple unicode characters placed together, this method fails and gives the following error:
'charmap' codec can't encode characters in position 107-111: character maps to 'undefined'
Do any of you have any ideas about how to get past this problem?
UPDATE: I have tried Andrew Godbehere's solution and I was still running into the same issues. However, I decided to see if there were any specific characters that were causing a problem, so I decided to print the characters to the console character by character. That gave me the error as follows:
'charmap' codec can't encode character '\U0001f1fa' in position 0: character maps to 'undefined'
Upon seeing this, I added this to the UnicodeFilter function and continued testing. I have run into multiple errors of the same kind while printing the tweets character by character. However, I don't want to have to keep making these exceptions. For example, see the revised UnicodeFilter function:
#staticmethod
def UnicodeFilter(var):
temp = var
temp = temp.encode(errors='ignore').decode('utf-8')
temp = temp.replace(chr(0x2019), "")
temp = temp.replace(chr(0x003c), "(lessthan)")
temp = temp.replace(chr(0x003e), "(greaterthan)")
temp = temp.replace(chr(0xd83c), "")
temp = temp.replace(chr(0xddfa), "")
temp = temp.replace(chr(0xddf8), "")
temp = temp.replace(chr(0xd83d), "")
temp = temp.replace(chr(0xdc4d), "")
temp = temp.replace(chr(0x2026), "")
temp = temp.replace(u"\U0001F1FA", "")
temp = temp.replace(u"\U0001F1F8", "")
temp = temp.replace(u"\U0001F44D", "")
temp = temp.replace(u"\U00014F18", "")
temp = temp.replace(u"\U0001F418", "")
temp = temp.replace(u"\U0001F918", "")
temp = temp.replace(u"\U0001F3FD", "")
temp = temp.replace(u"\U0001F195", "")
temp = Functions.ToSQL(temp)
return str(temp)
I don't want to have to add a new line for every character that causes a problem. Through this method, I have been able to pass multiple tweets, but this issue resurfaces with every tweet that contains different symbols. Is there not a solution that will filter out all these characters? Is it possible to filter out all characters not in the utf-8 character set?

Try the built-in unicode encode/decode error handling functionality: str.encode(errors='ignore')
For example:
problem_string = """\
RT #John_Hunt07: Just voted for #marcorubio is Florida! I am ready for a New American Century!! #FLPrimary \ud83c\uddfa\ud83c\uddf8\ud83c\uddfa\ud83c\uddf8\ud83d\udc4d
"""
print(problem_string.encode(errors='ignore').decode('utf-8'))
Ignoring errors removes problematic characters.
> RT #John_Hunt07: Just voted for #marcorubio is Florida! I am ready for a New American Century!! #FLPrimary
Other error handling options may be of interest.
xmlcharrefreplace for instance would yield:
> RT #John_Hunt07: Just voted for #marcorubio is Florida! I am ready for a New American Century!! #FLPrimary 🇺🇸🇺🇸👍
If you require custom filtering as implied by your UnicodeFilter function, see Python documentation on registering an error handler.

Python provides a useful stacktrace so you can tell where errors are coming from.
Using it, you will have found that your print is causing the exception.
print() is failing because you're running Python from the Windows console, which, by default only, supports your local 8bit charmap. You can add support with: https://github.com/Drekin/win-unicode-console
You can also just write your data straight to a text file. Open the file with:
open('output.txt', 'w', encoding='utf-8')

Found the answer. The issue was that there was a range of characters in the tweets that were causing problems. Once I found the correct Unicode range for the characters, I implemented the for loop to replace any occurrence of any Unicode character within that range. After implementing that, I was able to pull thousands of tweets without any formatting or MySQL errors at all.
#staticmethod
def UnicodeFilter(var):
temp = var
temp = temp.replace(chr(0x2019), "'")
temp = temp.replace(chr(0x2026), "")
for x in range(127381, 129305):
temp = temp.replace(chr(x), "")
temp = MySQLCL.ToSQL(temp)
return str(temp)

String search logic - not language specific

I'm having to check data entry on an address field. The client does not want users to use terms like Rd. or Rd for road, ave or ave. for avenue etc. I have no problem with most of the terms. Where I have issues is with 'Ave' lets say. If I look for ' AVE ', that's fine but it will not pick up on ' AVE' at the end of the string and if I look for ' AVE' it will get a false positive on ' Avenue' since it will find ' Ave' within that string. Anyone have an idea of how I can go about this?
Thank you for any help.
Norst
Although the Q: is not language specific, here is how I'm going about this in JS:
//function to check address for rd rd. ave ave. st st. etc
function checkaddy() {
//array of items to look for
var watchfor = Array();
watchfor[0] = " RD";
watchfor[1] = " RD.";
watchfor[2] = " AVE ";
watchfor[3] = " AVE.";
watchfor[4] = " ST ";
watchfor[5] = " ST.";
watchfor[6] = " BLVD.";
watchfor[7] = " CRT ";
watchfor[8] = " CRT.";
watchfor[9] = " CRES ";
watchfor[10] = " CRES.";
watchfor[11] = " E ";
watchfor[12] = " E.";
watchfor[13] = " W ";
watchfor[14] = " W.";
watchfor[15] = " N ";
watchfor[16] = " N.";
watchfor[17] = " S ";
watchfor[18] = " S.";
watchfor[19] = " PKWY ";
watchfor[20] = " PKWY.";
watchfor[21] = " DR ";
watchfor[22] = " DR.";
//get what the user has in the address box
var addcheck = $("#address").val();
//upper case the address to check
addcheck = addcheck.toUpperCase();
//check to see if any of these terms are in our address
watchfor.forEach(function(i) {
if (addcheck.search(watchfor[i]) > 0 ) {
alert("Found One!");
}
});
}

Perhaps you need to look for word boundary character \b. Here are some Ruby examples:
irb(main):002:0> " Avenue" =~ / AVE\b/i
=> nil
irb(main):003:0> " Ave" =~ / AVE\b/i
=> 0
irb(main):005:0> " Ave" =~ /\bAVE\b/i
=> 1
irb(main):006:0> " Ave" =~ /\bAVE\b/i
Notice how " Avenue" doesn't match while " AVE" does match. Also notice how the '\b' behaves and we get 0 and 1 respectively.
There are other characters classes as well in regular expressions. So all you need to do is formulate correct REs for your problem set.
I hope that helps.

Why would your client insist on spelling out names? The United States Postal Service actually encourages abbreviations. Not only that, they prefer addresses to be in all uppercase letters and no more than 5 lines. Such specifications are what their automated sorters were built for. But I digress.
To actually answer your question, you may consider the following code. There was a mistake in your forEach declaration. You were using i as an index, but, in fact, the forEach function uses the whole entry of the array. I modified it below. Also, because we're using a string expression in the constructor for the RegExp, the \ in the \b has to be escaped, so we add two \'s inside the string. Because we using the \b construct for word boundaries, we don't need to add extra periods into the test array. I hope you find this helpful.
//array of items to look for
var watchfor = ['RD','AVE','ST','BLVD','CRT','CRES','E','W','N','S','PKWY','DR'];
//function to check address for rd rd. ave ave. st st. etc
function checkaddy(address) {
//check to see if any of these terms are in our address
watchfor.forEach(function(entry) {
var patt1 = RegExp('.*\\b' + entry + '\\b.*','gim');
if (patt1.test(address)) {
document.write("Found " + entry);
}
});
}

How to go to next line while using a loop to setText in JTextArea?

This is my code
for (int m=0; m < i ; m++){
ta1.setText( s[m].getName().toString() + ", " + s[m].getProgramName().toString() + ", " + s[m].getUni1() + ", " + s[m].getUni2() + ", " + s[m].getUni3() + ", " );
}
It's supposed to print a line from an array of student ( called s) into a JTextArea ( called ta1 ). the problem is that it always only prints the last student in the array.
I need to print each student in a new line. could anyone help me sort it out?

When you set text on an element, the current position in the loop will take over the last one.
Try doing this.
String s = "";
for(int m = 0, m <i; m++){
s += s[m].getName.toString() + ", " + s[m].getprogramName().toString() + "\n;
}
ta1.setText(s);
Create a string and add each entry to it then add new line to end of each entry "\n"
Then do.
ta1.setText(s);

setText overwrites whatever is the current text.
You need append instead; you also need a "\n" at the end of a line.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why am I getting incorrect values for string length? - string

Related

How do I count number of words in a string in Typescript without counting extraneous spaces?

CSV file parsing in C#

Replace series of Unicode characters / Python / Twitter

String search logic - not language specific

How to go to next line while using a loop to setText in JTextArea?

Categories

Resources