Split Long Text into meaningful sentences with certain length using C# - c#-4.0

i would like to split a text to a dot followed by a whitespace or a dot followed by new line (\n) at certain length.
e.g if I have Long text with total 3456 char. I want to split this text into three diff. text with 1000 or closest no. of chars but each text should end with full meaningful sentence.
Reason I want to do this is, I am using API which takes only 1000 or less char for data conversion but i have some text which is longer than 1000 char so I want to split into multiple text so I do not have any text more than 1000 char and each text is ended at full sentece. e.g text to a dot followed by a whitespace or a dot followed by new line (\n)
I'm working with c# .Net
Thanks in Advance.

Something like this. Obviously, replace "someText" with your data, and set the shardLength to 1000 for your example. This solution gives an error if there is a sentence larger than the block size.
It currently handles newlines by effectively ignoring them- it only splits on "."
This means that sentences that end in ".\n" will be split after the ".", and the "\n" will be at the start of the next sentence.
The advantage here is that if you pass this to your API, you should be able to concatenate the results and retain the newlines in the appropriate places (assuming the API handles newlines).
using System.Text.RegularExpressions;
public static void BlockSplitter()
{
String someText = #"This is some text.
The quick brown fox jumps over the lazy dog. Testing 1 2 3.
Sentence with no fullstop";
String[] sentences;
string delimiters = #"(?<=\.)";
sentences = Regex.Split(someText,delimiters);
String shard = String.Empty;
int shardLength = 45;
foreach (String sentence in sentences)
{
if (sentence.Length > shardLength)
{
//Raise an exception as the sentence
}
if ((shard.Length + sentence.Length) <= shardLength)
{
shard += sentence;
}
else
{
Console.WriteLine(shard);
shard = sentence;
}
}
Console.WriteLine(shard);
}

Related

How to indent and outdent on line beginning and end with Flutter TextFormField

I am using a Flutter TextFormField Component to capture some multiline text. I would like to process that text string by breaking it down into a list of words using the following.
List<String> splitText = text.split(' ');
However, I realised that if the user returns to the next line down the split by space doesn't work anymore and the last word on the line and first word on the next line are considered a single word.
If I can detect the last word on the line and first work in a line and indent/outdent with a space, my issue is solved, but I don't know how to detect end and start of lines in a textFormForm
I came up with one option which is probably not ideal, but maybe someone can comment.
Convert the String to runes, then look for the return rune (10) and add the space rune (32) before and after for good measure. Problem is I still need to figure out how to convert it back.
List<int> inputTextRunes = text.runes as List;
List<int> outputTextInRunes = [];
for (var i = 0; i < inputTextRunes.length; i++) {
outputTextInRunes.add(inputTextRunes[i]);
if (inputTextRunes[i] == 10) {
outputTextInRunes.insert(i - 1, 32);
outputTextInRunes.insert(i + 1, 32);
}
}
String outputText = String.fromCharCodes(outputTextInRunes);
Question: How to indent and outdent on line beginning and line end
within Flutter TextFormField
You should just perform the original split targeting more than just space. In this instance, you could use a regular expression that matches all forms of whitespace, including tab characters or newlines.
List<String> splitText = text.split(RegExp(r'\s'));

Split text file by word ( character) count

In short: How to split 10000 word TXT file into 100 word TXT file each.
I am trying to split a large text file by word count.
The best option so far i can find is this software (Gsplit), but it's only have option to split by line.
enter image description here
This app have option to split by pattern.
Exp for split by line is: "0x0D0x0A".
So is there any pattern to split the text file by word count ( by this Gsplit app or another way) ?
Since you have tagged "java" with your question, here is a simple java code to count the number of words by splitting by spaces.
public class Splitting {
public static void main(String[] args) {
String inputLines = "This class is for split words by spaces and count the number of words";
String[] words = inputLines.split(" ");
System.out.println(Arrays.toString(words));
System.out.println("No of words: "+words.length); }
}
output:
[This, class, is, for, split, words, by, spaces, and, count, the, number, of, words]
No of words: 14
Hope this would help somehow!

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance
Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

C++ Replacing non-alpha/apostrophe with spaces in a string

I am reading in a text file and parsing the words into a map to count numbers of occurrences of each word on each line. I am required to ignore all non-alphabetic chars (punctuation, digits, white space, etc) except for apostrophes. I can figure out how to delete all of these characters using the following code, but that causes incorrect words, like "one-two" comes out as "onetwo", which should be two words, "one" and "two".
Instead, I am trying to now replace all of these values with spaces instead of simply deleting, but can't figure out how to do this. I figured the replace-if algorithm would be a good algorithm to use, but can't figure out the proper syntax to accomplish this. C++11 is fine. Any suggestions?
Sample output would be the following:
"first second" = "first" and "second"
"one-two" = "one" and "two"
"last.First" = "last" and "first"
"you're" = "you're"
"great! A" = "great" and "A"
// What I initially used to delete non-alpha and white space (apostrophe's not working currently, though)
// Read file one line at a time
while (getline(text, line)){
istringstream iss(line);
// Parse line on white space, storing values into tokens map
while (iss >> word){
word.erase(remove_if(word.begin(), word.end(), my_predicate), word.end());
++tokens[word][linenum];
}
++linenum;
}
bool my_predicate(char c){
return c == '\'' || !isalpha(c); // This line's not working properly for apostrophe's yet
}
bool my_predicate(char c){
return c == '\'' || !isalpha(c);
}
Here you're writing that you want to remove the char if it is and apostrophe or if it is not an alphabetical character.
Since you want to replace these, you should use std::replace_if() :
std::replace_if(std::begin(word), std::end(word), my_predicate, ' ');
And you should correct your predicate too :
return !isalpha(c) && c != '\'';
You could use std::replace_if to pre-process the input line before sending it to the istringstream. This will also simplify your inner loop.
while (getline(text, line)){
replace_if(line.begin(), line.end(), my_predicate, ' ');
istringstream iss(line);
// Parse line on white space, storing values into tokens map
while (iss >> word){
++tokens[word][linenum];
}
++linenum;
}

remove fragments in a sentence [puzzle]

Question:
Write a program to remove fragment that occur in "all" strings,where a fragment
is 3 or more consecutive word.
Example:
Input::
s1 = "It is raining and I want to drive home.";
s2 = "It is raining and I want to go skiing.";
s3 = "It is hot and I want to go swimming.";
Output::
s1 = "It is raining drive home.";
s2 = "It is raining go skiing.";
s3 = "It is hot go swimming.";
Removed fragment = "and i want to"
The program will be tested again large files.
Efficiency will be taken into consideration.
Assumptions: Ignore capitalization ,punctuation. but preserve in output.
Note: Take care of cases like
a a a a a b c b c b c b c where removing would create more fragments.
My Solution: (which i think is not the most efficient)
Hash three word phrases into an int and store them in an array, for all strings.
reduces to array of numbers like
1 2 3 4 5
3 5 7 9 8
9 3 1 7 9
Problem reduces to intersection of arrays.
sort the arrays. (k * nlogn)
keep k pointers. if all equal match found. else increment the pointer pointing to least value.
To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.
Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?
First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list.
So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.
The first step is as already suggested by izomorphius:
Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
For the second you don't need to know the longest common substring - you just want to erase it from all the strings.
Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common.
To do that you can use a hash table (storing key value pairs).
Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1.
Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2.
Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3.
...and so on.
At the end the keys that have the value of k are the common 3-substrings.
Now just iterate once more over all the strings and remove those 3-substrings that are common.
import java.io.*;
import java.util.*;
public class remove_unique{
public static void main(String args[]){
String s1 = "Everyday I do exercise if";
String s2 = "Sometimes I do exercise if i feel stressed";
String s3 = "Mostly I do exercise on morning";
String[] words1=s1.split("\\s");
String[] words2=s2.split("\\s");
String[] words3=s3.split("\\s");
StringBuilder sb = new StringBuilder();
for(int i=0;i<words1.length;i++){
for(int j=0;j<words2.length;j++){
for(int k=0;k<words3.length;k++){
if(words1[i].equals(words2[j]) && words2[j].equals(words3[k])
&&words3[k].equals(words1[i])){
//Concatenating the returned Strings
sb.append(words1[i]+" ");
}
}
}
}
System.out.println(s1.replaceAll(sb.toString(), ""));
System.out.println(s2.replaceAll(sb.toString(), ""));
System.out.println(s3.replaceAll(sb.toString(), ""));
}
}
//LAKSHMI ARJUNA
My solution would be something like,
F = all fragments with length > 3 shared by the first 2 lines, avoid overlaps
for each line from the 3rd line and up
remove fragments in F which do not exist in line, or cause overlaps
return sentences with fragments in F removed
I assume finding/matching fragments in sentences can be done with some known algo. but in terms of the time complexity for n lines this is O(n)

Resources