Efficient splitting of elements in a field - string

I have a field in a text file exported from a database. The field contains addresses but sometimes they are quite long and the database allows them to contain multiple lines. When exported, the newline character gets replaced with a dollar sign like this:
first part of very long address$second part of very long address$third part of very long address
Not every address has multiple lines and no address contains more than three lines. The length of each line is variable.
I'm massaging the data for import into MS Access which is used for a mailmerge. I want to split the field on the $ sign if it's there but if the field only contains 1 line, I want to set my two extra output fields to a zero length string so that I don't wind up with blank lines in the address when it gets printed.
I have an awk file that's working correctly on all the other data in the textfile but I need to get this last bit working. I tried the below code. Aside from the fact that I get a syntax error at the else, I'm not sure this is a good way to do what I want. This is being done with gawk on Windows.
BEGIN { FS = "|" }
$1 != "HEADER" {
if ($6 ~ /\$/)
split($6, arr, "$")
address = arr[1]
addresstwo = arr[2]
addressthree = arr[3]
addressLength = length(address)
addressTwoLength = length(addresstwo)
addressThreeLength = length(addressthree)
else {
address = $6
addressLength = length($6)
addresstwo = ""
addressTwoLength = length(addresstwo)
addressthree = ""
addressThreeLength = length(addressthree)
}
printf("%*s\t%*s\t\%*s\n",
addressLength, address, addressTwoLength, addresstwo, addressThreeLength, addressthree)
}
EDIT:
Sorry about that. Here's a sample
HEADER|0000000130|0000527350|0000171250|0000058000|0000756600|0000814753|0000819455|100106
rec1|ILL/COLORADO COLLEGE$TUTT LIBRARY|1021 N CASCADE$COLORADO SPRINGS, CO 80903|
rec2|ILL /PIKES PEAK LIBRARY DISTRICT|20 N. CASCADE AVE. / PO BOX 1579$COLORADO SPRINGS, CO 80903|
rec3|DOE,JOHN|PO Box 8034|
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY|INFORMATION DELIVERY DEPT$704 CHERRY ST$ATLANTA, GA 30332-0900
I match only lines without HEADER in them. I need to split the textstrings on the $ signs. The string between the pipes should not be padded (which is why I was trying to get the length in my original code). For this example, there are 6 output fields and any field for which there is no data is simply an empty string (also what I was trying to do in the code).
rec1|ILL/COLORADO COLLEGE|TUTT LIBRARY|1021 N CASCADE|COLORADO SPRINGS, CO 80903||
rec2|ILL /PIKES PEAK LIBRARY DISTRICT||20 N. CASCADE AVE. / PO BOX 1579|COLORADO SPRINGS, CO 80903||
rec3|DOE,JOHN||PO Box 8034|||
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY||INFORMATION DELIVERY DEPT|704 CHERRY ST|ATLANTA, GA 30332-0900|
Hope that helps! Let me know if this still isn't clear.

BEGIN { FS = "|" }
$1 != "HEADER" {
for(i = gsub(/\$/, "\t", $6); i < 3; i++)
$6 = $6 "\t"
print $6
}
I'm not really sure if I got your requirements right though.

Related

AWK - enclose found strings with symbols in one command

I have a large body of text and I print only lines that contain one of several strings. Each line can contain more than one string.
Example of the rule:
(house|mall|building)
I want to mark the found string for making the result easier to read.
Example of the result I want:
New record: Two New York houses under contract for nearly $5 millionĀ each.
New record: Two New York #house#s under contract for nearly $5 million each.
I know I can find the location, trim, add marker, add string etc.
I am asking if there is a way to mark the found string in one command.
Thanks.
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression ...
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the
extended regular expression ERE in string in and return the number of
substitutions. An ampersand ( '&' ) appearing in the string repl shall
be replaced by the string from in that matches the ERE ...
BEGIN {
r = "house|mall|building"
s = "Two New York houses under contract for nearly $5 million each."
gsub(r, "#&#", s)
print s
}

Display lines containing duplicates within subset of string

How would I find duplicate lines by matching only one part of each line and not the whole line itself?
Take for example the following text:
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=152084(k152084) gid=10003(pemcln) groups=10003(pemcln) k152084
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
uid=154226(u154226) gid=10003(pemcln) groups=10003(pemcln) u154226
I would like to show only the 1st and 3rd lines only as the have the same duplicate UID value "154163"
The only ways I know how would match the whole line and not the subset of each one.
This code looks for the ID from each line. If any ID appears more than once, its lines are printed:
$ awk -F'[=(]' '{cnt[$2]++;lines[$2]=lines[$2]"\n"$0} END{for (k in cnt){if (cnt[k]>1)print lines[k]}}' file
uid=154163(j154163) gid=10003(pemcln) groups=10003(pemcln) j154163
uid=154163(b153999) gid=10003(pemcln) groups=10003(pemcln) b153999
How it works:
-F'[=(]'
awk separates input files into records (lines) and separates the records into fields. Here, we tell awk to use either = or ( as the field separator. This is done so that the second field is the ID.
cnt[$2]++; lines[$2]=lines[$2]"\n"$0
For every line that is read in, we keep a count, cnt, of how many times that ID has appeared. Also, we save all the lines associated with that ID in the array lines.
END{for (k in cnt){if (cnt[k]>1)print lines[k]}}
After we reach the end of the file, we go through each observed ID and, if it appeared more than once, its lines are printed.
Someone has already provided an awk script that will do what you need, assuming the files are small enough to fit into memory (they store all the lines until the end then decide what to output). There's nothing wrong with it, indeed it could be considered the canonical awk solution to this problem. I provide this answer really for those cases where awk may struggle with the storage requirements.
Specifically, if you have larger files that cause problems with that approach, the following awk script, myawkscript.awk, will handle it, provided you first sort the file so it can rely on the fact related lines are together. In order to ensure it's sorted and that you can easily get at the relevant key (using = and ( as field separators), you call it with:
sort <inputfile | awk -F'[=(]' -f myawkscript.awk
The script is:
state == 0 {
if (lastkey == $2) {
printf "%s", lastrec;
print;
state = 1;
};
lastkey = $2;
lastrec = $0"\n";
next;
}
state == 1 {
if (lastkey == $2) {
print;
} else {
lastkey = $2;
lastrec = $0"\n";
state = 0;
}
}
It's basically a state machine where state zero is scanning for duplicates and state one is outputting the duplicates.
In state zero, the relevant part of the current line is checked against the previous and, if there's a match, it outputs both and switches to state one. If there's no match, it simply moves on to the next line.
In state one, it checks each line against the original in the set and outputs it as long as it matches. When it finds one that doesn't match, it stores it and reverts to state zero.

Programmatically determining the difference between a comma-separated list and a paragraph

I am working on a data migration where on the old system the users were allowed to enter their interests in a large text-field with no formatting instructions followed at all. As a result some wrote in bio format and others wrote in comma-separated list format. There are a few other formats, but these are the primary ones.
Now I know how to identify a comma-separated list (CSL). That is easy enough. But how about determining if a string is a CSL (maybe a short one with two terms or phrases) or just a paragraph someone wrote that contains a comma?
One thought that I have is to automatically ignore strings that contain punctuation and strings that don't contain commas. However, I am concerned that this won't be enough or will leave much to be desired. So I would like to query the community to see what you guys think. In the mean time I will try out my idea.
UPDATE:
Ok guys, I have my algorithm. Here it is below...
MY CODE:
//Process our interests text field and get the list of interests
function process_interests($interests)
{
$interest_list = array();
if ( preg_match('/(\.)/', $interests) 0 && $word_cnt > 0)
$ratio = $delimiter_cnt / $word_cnt;
//If delimiter is found with the right ratio then we can go forward with this.
//Should not be any more the 5 words per delimiter (ratio = delimiter / words ... this must be at least 0.2)
if (!empty($delimiter) && $ratio > 0 && $ratio >= 0.2)
{
//Check for label with colon after it
$interests = remove_colon($interests);
//Now we make our array
$interests = explode($delimiter, $interests);
foreach ($interests AS $val)
{
$val = humanize($val);
if (!empty($val))
$interest_list[] = $val;
}
}
}
return $interest_list;
}
//Cleans up strings a bit
function humanize($str)
{
if (empty($str))
return ''; //Lets not waste processing power on empty strings
$str = remove_colon($str); //We do this one more time for inline labels too.
$str = trim($str); //Remove unused bits
$str = ltrim($str, ' -'); //Remove leading dashes
$str = str_replace(' ', ' ', $str); //Remove double spaces, replace with single spaces
$str = str_replace(array(".", "(", ")", "\t"), '', $str); //Replace some unwanted junk
if ( strtolower( substr($str, 0, 3) ) == 'and')
$str = substr($str, 3); //Remove leading "and" from term
$str = ucwords(preg_replace('/[_]+/', ' ', strtolower(trim($str))));
return $str;
}
//Check for label with colon after it and remove the label
function remove_colon($str)
{
//Check for label with colon after it
if (strstr($str, ':'))
{
$str = explode(':', $str); //If we find it we must remove it
unset($str[0]); //To remove it we just explode it and take everything to the right of it.
$str = trim(implode(':', $str)); //Sometimes colons are still used elsewhere, I am going to allow this
}
return $str;
}
Thank you for all your help and suggestions!
You could, in addition to the filtering you mentioned, create a ratio of number of commas to string length. In CSLs, this ratio will tend to be high, in paragraphs low. You could set some kind of a threshold, and choose based on whether or not the entry has a high enough ratio. Ones with ratios close to the threshold could be marked as prone to error, and could then be check by a moderator.

Find all ocurrences of a string in a sequence of strings in F#?

String processing in C# and VB.NET is easy for me, but understanding how to do the same in F# not so easy. I am reading two Apress F# books (Foundations and Expert). Most samples are number crunching and, I think, very little of string manipulation. In particular, samples of seq { sequence-expression } and Lists.
I have a C# program I want to convert to F#. Here is what it does:
Open a txt file
split file paragraphs, look for CRLF between paragraphs
split paragraph lines, look for . ! ? between lines
split line words, look for " " space between words
output number of paragraphs, lines and words
Loop the collection of words, find and count all ocurrences of a string within the collection, mark the locations of word found.
Here is a simple example of what I can do in C#, but not yet in F#.
Suppose this is a text file:
Order, Supreme Court, New York County
(Paul G Someone), entered March 18,
2008, which, in an action for personal
injuries sustained in a trip and fall
over a pothole allegedly created by
the negligence of defendants City or
Consolidated McPherson, and a third-party
action by Consolidated McPherson against
its contractor (Mallen), insofar as
appealed from, denied, as untimely,
Mallen's motion for summary judgment
dismissing the complaint and
third-party complaint, unanimously
affirmed, without costs.
Parties are afforded great latitude in
charting their procedural course
through the courts, by stipulation or
otherwise. Thus, we affirm the denial
of Mallen's motion as untimely since
Mallen offered no excuse for the late
filing.
I get this output:
2 Paragraphs
3 Lines
109 Words
Found Tokens: 2
Token insofar: ocurrence(s) 1: position(s): 52
Token thus: ocurrence(s) 1: position(s): 91
Lines should have been called Sentences :(
There are several tokens. I'd say more than 100 grouped by class. I have to iterate over the same text several times trying to match each and every token. Here is portions of the code. It shows how I split sentences, put them in ListBox, that helps easily get the item count. This works for paragraphs, sentences and tokens. And it also shows how I am relying in for and foreach. It is this approach I want to avoid by using if possible seq { sequence-expression } and Lists and seq.iter or List.iter and whatever match token ... with that are necessary.
/// <summary>
/// split the text into sentences and displays
/// the results in a list box
/// </summary>
private void btnParseText_Click(object sender, EventArgs e)
{
lstLines.Items.Clear();
ArrayList al = SplitLines(richTextBoxParagraphs.Text);
for (int i = 0; i < al.Count; i++)
//populate a list box
lstLines.Items.Add(al[i].ToString());
}
/// <summary>
/// parse a body of text into sentences
/// </summary>
private ArrayList SplitLines(string sText)
{
// array list tto hold the sentences
ArrayList al = new ArrayList();
// split the lines regexp
string[] splitLines =
Regex.Split(sText, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
// loop the sentences
for (int i = 0; i < splitLines.Length; i++)
{
string sOneLine =
splitLines[i].Replace(Environment.NewLine, string.Empty);
al.Add(sOneLine.Trim());
}
// update statistics
lblLineCount.Text = "Line Count: " +
GetLineCount(splitLines).ToString();
// words
lblWordCount.Text = "Word Count: " +
GetWordCount(al).ToString();
// tokens
lblTokenCount.Text = "Token Count: " +
GetTokenCount(al).ToString();
// return the arraylist
return al;
}
/// <summary>
/// count of all words contained in the ArrayList
/// </summary>
public int GetWordCount(ArrayList allLines)
{
// return value
int rtn = 0;
// iterate through list
foreach (string sLine in allLines)
{
// empty space is the split char
char[] arrSplitChars = {' '};
// create a string array and populate
string[] arrWords = sSentence.Split(arrSplitChars, StringSplitOptions.RemoveEmptyEntries);
rtn += arrWords.Length;
}
// return word count
return rtn;
}
In fact, it is a very simple Windows Application. A form with one RichTextBox and three ListBoxes(paragraphs, lines, tokens found), labels to display output and one button.
Brian has a good start, but functional code will focus more on "what" you're trying to do than "how".
We can start out in a similar same way:
open System
open System.Text.RegularExpressions
let text = #"Order, Supreme Court, New York County (Paul G Someone), entered..."
let lines = text.Split([|Environment.NewLine|], StringSplitOptions.None)
First, let's look at paragraphs. I like Brian's approach to count blank lines separating paragraphs. So we filter to find only blank lines, count them, then return our paragraph count based on that value:
let numParagraphs =
let blankLines = lines |> Seq.filter (fun line -> Regex.IsMatch(line, #"^\s*$"))
|> Seq.length
blankLines + 1
For sentences, we can view the full text as a sequence of characters and count the number of sentence-ending characters. Because it's F#, let's use pattern matching:
let numSentences =
let isSentenceEndChar c = match c with
| '.' | '!' | '?' -> true
| _ -> false
text |> Seq.filter isSentenceEndChar
|> Seq.length
Matching words can be as easy as a simple regular expression, but could vary with how you want to handle punctuation:
let words = Regex.Split(text, "\s+")
let numWords = words.Length
numParagraphs |> printfn "%d paragraphs"
numSentences |> printfn "%d sentences"
numWords |> printfn "%d words"
Finally, we define a function to print token occurences, which is easily applied to a list of tokens:
let findToken token =
let tokenMatch (word : string) = word.Equals(token, StringComparison.OrdinalIgnoreCase)
words |> Seq.iteri (fun n word ->
if tokenMatch word then
printfn "Found %s at word %d" word n
)
let tokensToFind = ["insofar"; "thus"; "the"]
tokensToFind |> Seq.iter findToken
Note that this code does not find "thus" because of its trailing comma. You will likely want to adjust either how words is generated or tokenMatch is defined.
You should post your C# code in the question (sounds a bit like homework, people will have more faith if you demonstrate you've already done the effort in one language and are really trying to learn more about another).
There isn't necessarily much F#-specific here, you can do this pretty similarly in any .Net language. There are a number of strategies you can use, for example below I use regular expressions for lexing out the words... only a couple F# idioms below, though.
open System
open System.Text.RegularExpressions
let text = #"Order, Supreme Court, New York County (Paul G Someone), entered
March 18, 2008, which, in an action for personal injuries sustained in a
trip and fall over a pothole allegedly created by the negligence of
defendants City or Consolidated McPherson, and a third-party action by
Consolidated McPherson against its contractor (Mallen), insofar as appealed
from, denied, as untimely, Mallen's motion for summary judgment dismissing
the complaint and third-party complaint, unanimously affirmed, without costs.
Parties are afforded great latitude in charting their procedural course
through the courts, by stipulation or otherwise. Thus, we affirm the denial
of Mallen's motion as untimely since Mallen offered no excuse for the late
filing."
let lines = text.Split([|'\n'|])
// If was in file, could use
//let lines = System.IO.File.ReadAllLines(#"c:\path\filename.txt")
// just like C#. For this example, assume have giant string above
let fullText = String.Join(" ", lines)
let numParagraphs =
let mutable count = 1
for line in lines do
// look for blank lines, assume each delimits another paragraph
if Regex.IsMatch(line, #"^\s*$") then
count <- count + 1
count
let numSentences =
let mutable count = 1
for c in fullText do
if c = '.' || c = '!' || c = '?' then
count <- count + 1
count
let words =
let wordRegex = new Regex(#"\b(\w+)\b")
let fullText = String.Join(" ", lines)
[| for m in wordRegex.Matches(fullText) do
yield m.Value |]
printfn "%d paragraphs" numParagraphs
printfn "%d sentences" numSentences
printfn "%d words" words.Length
let Find token =
words |> Seq.iteri (fun n word ->
if 0=String.Compare(word, token,
StringComparison.OrdinalIgnoreCase) then
printfn "Found %s at word %d" word n
)
let tokensToFind = ["insofar"; "thus"; "the"]
for token in tokensToFind do
Find token
Could you post your C#-program? (Edit your question)
I think you can implement this in a very similar way in F# unless your original code is heavily based on changing variables (for which I don't see reasons in your problem description).
In case you used String.Split in C#: It's basically the same thing:
open System
let results = "Hello World".Split [|' '|]
let results2 = "Hello, World".Split ([| ", "|], StringSplitOptions.None)
In order to concatenate the resulting sequences, you can combine yield and yield!.
Abstract example
let list = [ yield! [1..8]; for i in 3..10 do yield i * i ]

C# Formatting - How to correctly format a name? I.e. forename or surname

I'm currently being visually assaulted by all of the names that are being displayed and entered on one of my systems. Basically, users have use of an on-screen keyboard and don't tend to write things in neatly! I.e. John Smith ends up getting entered as JOHN SMITH or john smith.
I want a way to neatly enter names and display them. I've written a method that goes through all the names and does just this, but it's about 20 lines of code and not very efficient.
Is there a good way of achieving this? I have tried .ToTitleCase(), but it doesn't work for cases such as O'Brien and McCarthy? Is there anything out there than can do this, nicely? My code at the moment basically has a list of special cases and goes through and manipulates them if they contain the special case... It's not the most efficient thing in the world though.
Thanks in advance.
Does it really matter? If a user doesn't care whether their name is all upper case or all lower case then I'd suggest that you don't need to worry about that either.
Users who do care about how their name is capitalised will presumably enter their name with care.
If you start to mess around with capitalisation then there's the risk of getting it wrong and offending a user.
Surely there are other aspects of the system that warrant more attention...
As you've already suggested there is no real easy way to achieve this without having to handle the special cases that always get thrown up with names.
This question has several suggestions that may be of help to you:
How do I capitalize first letter of first name and last name in C#?
I had a client insistent on some level of auto format.
The following code resolved all of the previous posters examples correctly.
We only run it once and set flags on the form so the user IF frustrated can override the auto settings.
Feedback has actually been very positive.
Hope this helps someone out there.
public string FormalFormat(string inString)
{
string outString = string.Empty;
string _ErrorMessage = string.Empty;
try
{
// Formal Format is made for names and addresses to assure
// proper formatting and capitalization
if (string.IsNullOrEmpty(inString))
{
return string.Empty;
}
inString = inString.Trim();
if (string.IsNullOrEmpty(inString))
{
return string.Empty;
}
// see if this is a word or a series of words
//if(inString.IndexOf(" ") > 0)
//{
// Break out each word in the string.
char[] charSep = { ' ' };
string[] aWords = inString.Split(charSep);
int i = 0;
int CapAfterHyphen = 0;
for (i = 0; i < aWords.Length; i++)
{
string Word = aWords[i].Trim();
CapAfterHyphen = Word.IndexOf("-");
char[] chars = Word.ToCharArray();
if (chars.Length > 3)
{
if (Char.IsLower(chars[1]) && Char.IsUpper(chars[2]))
{
Word = Word.Substring(0, 1).ToUpper() + Word.Substring(1, 1).ToLower() + Word.Substring(2, 1).ToUpper() + Word.Substring(3).ToLower();
}
else
{
Word = Word.Substring(0, 1).ToUpper() + Word.Substring(1).ToLower();
}
}
if (CapAfterHyphen > 0)
{
Word = Word.Substring(0, CapAfterHyphen + 1) + Word.Substring(CapAfterHyphen + 1, 1).ToUpper() + Word.Substring(CapAfterHyphen + 2);
}
if (i > 0)
{
outString += " " + Word;
}
else
{
outString = Word;
}
}
}
catch (Exception e)
{
outString = inString;
_ErrorMessage = e.Message;
}
return outString;
}
Probleme will arise when Chinese, Japanese, Arabic, Hebrew and many other people from everywhere will join your system...
Think global. It's already hard to do with Irish people :P
Perhaps show to your users how there name will be displayed, they'll be a bit more careful.
Also consider that for some cultures, it's surname first followed by given name, simply not having a surname.
Additionally some names are more than just two or three words. Consider:
Manuel de la Cruz
Shawn van DeMark
Alice St. Claire
Anna Eastman-Smith
Po Yin
Ho Chi Minh
Julia Running Bear
Talks to Spirits
(The last was a Native American given name without a surname.)
I believe formatting matters. Badly formatted names look bad in a list or on an envelope (at least in my eyes). I always feel the urge to format them correctly. Its almost an obsession of mine.
So DanD, this is how I do it: When a user enters their full name in my app, i trim() all the names, remove double spaces and then format all the names in Proper Case. I then display then names to the user with a prompt like: "Did we format your names correctly?" and give the user the opportunity to correct formatting. After this, I just save the names.
Good luck.

Resources