Split text file by word ( character) count - text

In short: How to split 10000 word TXT file into 100 word TXT file each.
I am trying to split a large text file by word count.
The best option so far i can find is this software (Gsplit), but it's only have option to split by line.
enter image description here
This app have option to split by pattern.
Exp for split by line is: "0x0D0x0A".
So is there any pattern to split the text file by word count ( by this Gsplit app or another way) ?

Since you have tagged "java" with your question, here is a simple java code to count the number of words by splitting by spaces.
public class Splitting {
public static void main(String[] args) {
String inputLines = "This class is for split words by spaces and count the number of words";
String[] words = inputLines.split(" ");
System.out.println(Arrays.toString(words));
System.out.println("No of words: "+words.length); }
}
output:
[This, class, is, for, split, words, by, spaces, and, count, the, number, of, words]
No of words: 14
Hope this would help somehow!

Related

How to print a list containing strings with a specific number of spaces between each string?

I'm working on making code that prints strings from a list, however I need to have it so that their are a specific number of spaces between each word. I'm also working with java.
You can do this like so:
for (String item : list) {
System.out.print(item + " ");
}
System.out.println();
assuming the number of spaces is low you can just add it to the end.

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance
Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

AWK - enclose found strings with symbols in one command

I have a large body of text and I print only lines that contain one of several strings. Each line can contain more than one string.
Example of the rule:
(house|mall|building)
I want to mark the found string for making the result easier to read.
Example of the result I want:
New record: Two New York houses under contract for nearly $5 millionĀ each.
New record: Two New York #house#s under contract for nearly $5 million each.
I know I can find the location, trim, add marker, add string etc.
I am asking if there is a way to mark the found string in one command.
Thanks.
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression ...
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the
extended regular expression ERE in string in and return the number of
substitutions. An ampersand ( '&' ) appearing in the string repl shall
be replaced by the string from in that matches the ERE ...
BEGIN {
r = "house|mall|building"
s = "Two New York houses under contract for nearly $5 million each."
gsub(r, "#&#", s)
print s
}

Split Long Text into meaningful sentences with certain length using C#

i would like to split a text to a dot followed by a whitespace or a dot followed by new line (\n) at certain length.
e.g if I have Long text with total 3456 char. I want to split this text into three diff. text with 1000 or closest no. of chars but each text should end with full meaningful sentence.
Reason I want to do this is, I am using API which takes only 1000 or less char for data conversion but i have some text which is longer than 1000 char so I want to split into multiple text so I do not have any text more than 1000 char and each text is ended at full sentece. e.g text to a dot followed by a whitespace or a dot followed by new line (\n)
I'm working with c# .Net
Thanks in Advance.
Something like this. Obviously, replace "someText" with your data, and set the shardLength to 1000 for your example. This solution gives an error if there is a sentence larger than the block size.
It currently handles newlines by effectively ignoring them- it only splits on "."
This means that sentences that end in ".\n" will be split after the ".", and the "\n" will be at the start of the next sentence.
The advantage here is that if you pass this to your API, you should be able to concatenate the results and retain the newlines in the appropriate places (assuming the API handles newlines).
using System.Text.RegularExpressions;
public static void BlockSplitter()
{
String someText = #"This is some text.
The quick brown fox jumps over the lazy dog. Testing 1 2 3.
Sentence with no fullstop";
String[] sentences;
string delimiters = #"(?<=\.)";
sentences = Regex.Split(someText,delimiters);
String shard = String.Empty;
int shardLength = 45;
foreach (String sentence in sentences)
{
if (sentence.Length > shardLength)
{
//Raise an exception as the sentence
}
if ((shard.Length + sentence.Length) <= shardLength)
{
shard += sentence;
}
else
{
Console.WriteLine(shard);
shard = sentence;
}
}
Console.WriteLine(shard);
}

What is an easy way to tell if a list of words are anagrams of each other?

How would you list words that are anagrams of each other?
I was asked this question when I applied for my current job.
orchestra can be rearranged into carthorse with all original letters used exactly once therefore the words are anagrams of each other.
Put all the letters in alphabetical order in the string (sorting algorithm) and then compare the resulting string.
Good thing we all live in the C# reality of in-place sorting of short words on quad core machines with oozles of memory. :-)
However, if you happen to be memory constrained and can't touch the original data and you know that those words contain characters from the lower half of the ASCII table, you could go for a different algorithm that counts the occurrence of each letter in each word instead of sorting.
You could also opt for that algorithm if you want to do it in O(N) and don't care about the memory usage (a counter for each Unicode char can be quite expensive).
Sort each element (removing whitespace) and compare against the previous. If they are all the same, they're all anagrams.
Interestingly enough, Eric Lippert's Fabulous Adventures In Coding Blog dealt with a variation on this very problem on February 4, 2009 in this post.
The following algorithm should work:
Sort the letters in each word.
Sort the sorted lists of letters in each list.
Compare each element in each list for equality.
Well Sort the words in the list.
if abc, bca, cab, cba are the inputs, then the sorted list will be abc, abc, abc, abc.
Now all of their Hash codes are equal. Compare the HashCodes.
Sort the letters and compare (letter by letter, string compare, ...) is the first things that comes to mind.
compare length (if not equal, not a chance)
make a bit vector of the length of the strings
for each char in the first string find occurrences of it in the second
set the bit for the first unset occurrence
if you can find one stop with fail
public static void main(String[] args) {
String s= "abc";
String s1="cba";
char[] aArr = s.toLowerCase().toCharArray();
char[] bArr = s1.toLowerCase().toCharArray();
// An array to hold the number of occurrences of each character
int[] counts = new int[26];
for (int i = 0; i < aArr.length; i++){
counts[aArr[i]-97]++; // Increment the count of the character at respective position
counts[bArr[i]-97]--; // Decrement the count of the character at respective position
}
// If the strings are anagrams, then counts array will be full of zeros not otherwise
for (int i = 0; i<26; i++){
if (counts[i] != 0)
return false;
}
Tried hashcode logic for anagram gives me false output
public static Boolean anagramLogic(String s,String s2){
char[] ch1 = s.toLowerCase().toCharArray();
Arrays.sort(ch1);
char[] ch2= s2.toLowerCase().toCharArray();
Arrays.sort(ch2);
return ch1.toString().hashCode()==ch2.toString().hashCode(); //wrong
}
to rectify this code, below is the only option I see,appreciate any recommendations
char[] ch1 = s.toLowerCase().toCharArray();
Arrays.sort(ch1);
char[] ch2= s2.toLowerCase().toCharArray();
Arrays.sort(ch2);
return Arrays.equals(ch1,ch2);
}

Resources