how to split a 4000 symbol text into 4 parts? - text

I have a long text, one paragraph, around 4000 symbols.
I want to split it into smaller paragraphs with 1000 or fewer symbols. The text splitters online are cutting the text right in the middle of the sentences, which is not the thing I want. I want the sentences to be considered too. Is there any online tool that does what I ask for? (no a programmer myself)

Related

Replace specific characters in a paragraph of text based on its numbered position with randomly generated characters

I've dabbled a bit with JavaScript years ago but I couldn't quite grasp the logic behind it. I still have some understanding of the basics but not enough to achieve what I'd like to. I don't have the time to research how to write the code myself, but if you could point me to already-coded, individual functions which achieve the results I'm looking for, perhaps I could play around with them and then ask for further help after that when needed.
I've got a paragraph of text (It could be anything) about 300 characters long, including spaces, capitalization, and punctuation. I would like a function which generates a random number based on the length of characters in the paragraph, i.e. the function counts the number of characters such that the generated number would never be higher than the number of characters in the paragraph) and then replaces that character with a randomly generated character based on a list of characters which appear in the paragraph (e.g. a-z,A-Z,and punctuation).
For example, if the number generated is 34, then the 34th character (whatever it may be) will be replaced by whatever character is randomly generated.
And finally, a function to input how many times this process should repeat, e.g. 10 times, 100 times, etc. before stopping, and one can view how the resulting paragraph of text has changed.
Any suggestions will be appreciated. Thanks.
Sorry, I've not tried anything yet as I'd like to get advice, first.

Wrap Text in openPD

Is there a way to wrap text around an image in openPDF? or around other text? I'm trying to mesh two texts onto one page (texts which may contain pictures), but one text may be larger or smaller then the other. They could look like one of these two pictures:
I was going to make a column text, and test the length of text 2 against the length of the text of text 1 that remains after the first few lines (see the picture), but I couldn't return the remaining text with columnText.go(true).
Is there an easy way to wrap text around a picture or other variable object (aka text)?

Preprocessing text so that two words without a separating space (or hyphen separated) are detected

Let's say I have a text corpus with inconsistently written bi-grams. An example would be "bi gram", "bi-gram", "bigram". Is there any standard text preprocessing method to normalize all these as the same thing? i.e. replace all such occurrences by "bigram". I should also mention that I have no prior knowledge of what exact bi-grams are present in the corpus.
Another thing I'm curious about - spell correction of standard words like common nouns is easy. But what about spell correction of proper nouns? I'm assuming that the correct spelling occurs more frequently than the incorrect spelling - so maybe I have a pandas series of text, in which majority of the rows contains "California", but there are some occurrences of "Califonria" as well.

What is the best algorithms to compare Strings and put the similar together?

I'm trying to group redundancies in a dataset for some analysis. My primary tool for analysis are their titles.
I might have things like "blue bird" "big blue bird" "brown dog" "red dog", etc.
In this case, I want to group "blue bird" and "big blue bird" together but none of the other elements should be grouped.
I know about String Metrics but, in general, how effective are they on phrases as opposed to single words or noisy strings and which would be an effective solution for this problem?
You could use the same logic that people usually put in programs to sort an array, fix a variable (in this case would be a string that we would use the first word) and compare it with the strings that you have, always looking for an equal word, if it is equal you should place in a separate vector or in a specific order.
However , doing so you would spend a lot of time and probably not the best way to go because it would go phrase by phrase, word by word, letter by letter. Otherwise it may seem helpful to separate the strings by the initial letter of the first word in large groups. This way, you spend less time in your search for repeated words, which would optimize the use of memory.
I found this paper from Carnegie Mellon University, it seems very interesting, it talks about this problem, you should take a better look:
String Metric
String metrics don't care if your words contain empty spaces or not. Thus phrases are mostly just longer strings than words (in this regard), so string metrics work just as well if you are performing a fuzzy search (allthough you might want to search for every word individually).
Since you seem to be looking for exact matches though, i would recommend building a suffix tree from the concatenation of your titles. You can then search that tree for each of your title and build title-groups if you got more than one match. However you will need to decide what you want to do with combinations like
blue bird
big blue bird
small blue bird
Following the brown/red dog example, you would not want to group "big blue bird" with "small blue bird", but "blue bird" would be grouped with both of these.

Source for word weights?

I am building a very basic result ranking algorithm, and one thing I'd like is a way to determine which words are generally more important in a given phrase. It doesn't have to be exact, just general.
Obviously dropping any word under 4 letters, identifying names. But what other ways can I pick out the 3 most significant words in a sentence?
In the absence of any other information, it is fair to assume that important words are rare words. Count how many times each word appears in your set of documents. The words with the lowest counts are more important, while the words with the highest counts are less important (if not nearly useless).
Related reading:
http://en.wikipedia.org/wiki/Stop_words
http://en.wikipedia.org/wiki/Googlewhack
http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases

Resources