Is there any way to get the output of Spark's Dataset.show() method as a string? - apache-spark

The Spark Dataset.show() method is useful for seeing the contents of a dataset, particularly for debugging (it prints out a nicely-formatted table). As far as I can tell, it only prints to the console, but it would be useful to be able to get this as a string. For example, it would be nice to be able to write it to a log, or see it as the result of an expression when debugging with, say, IntelliJ.
Is there any way to get the output of Dataset.show() as a string?

The corresponding method behind show isn't visible from outside the sql package. I've taken the corresponding method and changed it such that a dataframe can be passed as parameter (code taken from Dataset.scala) :
def showString(df:DataFrame,_numRows: Int = 20, truncate: Int = 20): String = {
val numRows = _numRows.max(0)
val takeResult = df.take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)
// For array values, replace Seq and Array with square brackets
// For cells that are beyond `truncate` characters, replace it with the
// first `truncate-3` and "..."
val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
row.toSeq.map { cell =>
val str = cell match {
case null => "null"
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
case array: Array[_] => array.mkString("[", ", ", "]")
case seq: Seq[_] => seq.mkString("[", ", ", "]")
case _ => cell.toString
}
if (truncate > 0 && str.length > truncate) {
// do not show ellipses for strings shorter than 4 characters.
if (truncate < 4) str.substring(0, truncate)
else str.substring(0, truncate - 3) + "..."
} else {
str
}
}: Seq[String]
}
val sb = new StringBuilder
val numCols = df.schema.fieldNames.length
// Initialise the width of each column to a minimum value of '3'
val colWidths = Array.fill(numCols)(3)
// Compute the width of each column
for (row <- rows) {
for ((cell, i) <- row.zipWithIndex) {
colWidths(i) = math.max(colWidths(i), cell.length)
}
}
// Create SeparateLine
val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()
// column names
rows.head.zipWithIndex.map { case (cell, i) =>
if (truncate > 0) {
StringUtils.leftPad(cell, colWidths(i))
} else {
StringUtils.rightPad(cell, colWidths(i))
}
}.addString(sb, "|", "|", "|\n")
sb.append(sep)
// data
rows.tail.map {
_.zipWithIndex.map { case (cell, i) =>
if (truncate > 0) {
StringUtils.leftPad(cell.toString, colWidths(i))
} else {
StringUtils.rightPad(cell.toString, colWidths(i))
}
}.addString(sb, "|", "|", "|\n")
}
sb.append(sep)
// For Data that has more than "numRows" records
if (hasMoreData) {
val rowsString = if (numRows == 1) "row" else "rows"
sb.append(s"only showing top $numRows $rowsString\n")
}
sb.toString()
}

Related

How reverse words in string and keep punctuation marks and upper case symbol

private def reverseHelper(word: String): String = {
var result = new StringBuilder(word)
if (word.head.isUpper) {
result.setCharAt(0, word.head.toLower)
result.setCharAt(word.length - 1, word.last.toUpper)
}
result.reverse.result()
}
val formatString = str
.split("[.,!?: ]+")
.map(result => str.replaceFirst(result, reverseHelper(result)))
.foreach(println)
Example:
Input: What is a sentence?
Ouput: Tahw si a ecnetnes?
but i have Array[String]: Tahw is a sentence?, What si a sentence?, What is a sentence?, What is a ecnetnes?
How i can write this in right format?
Restoring the original capitalization is a bit tricky.
def reverser(s:Seq[Char], idx:Int = 0) :String = {
val strt = s.indexWhere(_.isLetter, idx)
if (strt < 0) s.mkString
else {
val end = s.indexWhere(!_.isLetter, strt)
val len = end - strt
val rev = Range(0,len).map{ x =>
if (s(strt+x).isUpper) s(end-1-x).toUpper
else s(end-1-x).toLower
}
reverser(s.patch(strt,rev,len), end)
}
}
testing:
reverser( "What, is A sEntence?")
//res0: String = Tahw, si A eCnetnes?
You can first split your string at a list of special characters and then reverse each individual word and store it in a temporary string. After that traverse the original string and temporary string and replace word matching any special characters with current character in temporary string.
private def reverseHelper(word: String): String = {
var result = new StringBuilder(word)
if (word.head.isUpper) {
result.setCharAt(0, word.head.toLower)
result.setCharAt(word.length - 1, word.last.toUpper)
}
result.reverse.result()
}
val tempStr = str
.split("[.,!?: ]+")
.map(result => reverseHelper(result))
.mkString("")
val sList = "[.,!?: ]+".toList
var curr = 0
val formatString = str.map(c => {
if(!sList.contains(c)) {
curr = curr + 1
tempStr(curr-1)
}
else c
})
Here's one approach that uses a Regex pattern to generate a list of paired strings of Seq(word, nonWord), followed by reversal and positional uppercasing of the word strings:
def reverseWords(s: String): String = {
val pattern = """(\w+)(\W*)""".r
pattern.findAllMatchIn(s).flatMap(_.subgroups).grouped(2).
map{ case Seq(word, nonWord) =>
val caseList = word.map(_.isUpper)
val newWord = (word.reverse zip caseList).map{
case (c, true) => c.toUpper
case (c, false) => c.toLower
}.mkString
newWord + nonWord
}.
mkString
}
reverseWords("He likes McDonald's burgers. I prefer In-and-Out's.")
//res1: String = "Eh sekil DlAnodcm's sregrub. I referp Ni-dna-Tuo's."
A version using split on word boundaries:
def reverseWords(string: String): String = {
def revCap(s: String): String =
s.headOption match {
case Some(c) if c.isUpper =>
(c.toLower +: s.drop(1)).reverse.capitalize
case Some(c) if c.isLower =>
s.reverse
case _ => s
}
string
.split("\\b")
.map(revCap)
.mkString("")
}

Single Speechmarks not added to numbers

I have the below Code which works, except for if there is a number in the text field so a single speech mark does not get added around say 1 but would be around one.
As an aside I don't want speechmarks on the first column (the ID value)
SEP = ", "
QUOTE = "\'"
NEWLINE = System.getProperty("line.separator")
KEYWORDS_LOWERCASE = com.intellij.database.util.DbSqlUtil.areKeywordsLowerCase(PROJECT)
KW_INSERT_INTO = KEYWORDS_LOWERCASE ? "insert into " : "INSERT INTO "
KW_VALUES = KEYWORDS_LOWERCASE ? ") values (" : ") VALUES ("
KW_NULL = KEYWORDS_LOWERCASE ? "null" : "NULL"
def record(columns, dataRow) {
OUT.append(KW_INSERT_INTO)
if (TABLE == null) OUT.append("MY_TABLE")
else OUT.append(TABLE.getParent().getName()).append(".").append(TABLE.getName())
OUT.append(" (")
columns.eachWithIndex { column, idx ->
OUT.append(column.name()).append(idx != columns.size() - 1 ? SEP : "")
}
OUT.append(KW_VALUES)
columns.eachWithIndex { column, idx ->
def value = dataRow.value(column)
def stringValue = value != null ? FORMATTER.format(dataRow, column) : KW_NULL
if (DIALECT.getDbms().isMysql())
stringValue = stringValue.replace("\\", "\\\\")
OUT.append(skipQuote ? "": QUOTE).append(stringValue.replace(QUOTE, QUOTE + QUOTE))
.append(skipQuote ? "": QUOTE).append(idx != columns.size() - 1 ? SEP : "")
}
OUT.append(");").append(NEWLINE)
}
ROWS.each { row -> record(COLUMNS, row) }
Not 100% sure what and why you are trying to achieve, but I would write down something like that in idiomatic groovy:
LinkedHashMap.metaClass.value = { delegate.get it } // mock value()
TABLE = [ parent:[ name:'OTHER_TABLE' ] ] // fake TABLE
KEYWORDS_LOWERCASE = true // false
KW_INSERT_INTO = 'INSERT INTO'
KW_VALUES = 'VALUES'
if( KEYWORDS_LOWERCASE ){
KW_INSERT_INTO = KW_INSERT_INTO.toLowerCase()
KW_VALUES = KW_VALUES.toLowerCase()
}
COLUMNS = [ 'a', 'nullllll', 'c', 'numberString' ]
String record(columns, dataRow) {
List values = columns.collect{
def v = dataRow.value it
switch( v ){
case Number:
case ~/\d+/: return v
case String: return "'$v'"
default: return 'null'
}
}
"$KW_INSERT_INTO ${TABLE?.parent?.name ?: 'MY_TABLE'} (${columns.join( ', ' )}) $KW_VALUES (${values.join( ', ' )});\n"
}
String res = record( COLUMNS, [ a:'aa', c:42, numberString:'84' ] )
assert res == "insert into OTHER_TABLE (a, nullllll, c, numberString) values ('aa', null, 42, 84);\n"
In switch statement the values are getting formatted.
You can try out yourself at https://groovyconsole.appspot.com/script/5151418931478528

How to find a string within another, ignoring some characters?

Background
Suppose you wish to find a partial text from a formatted phone number, and you wish to mark the finding.
For example, if you have this phone number: "+972 50-123-4567" , and you search for 2501 , you will be able to mark the text within it, of "2 50-1".
More examples of a hashmap of queries and the expected result, if the text to search in is "+972 50-123-45678", and the allowed characters are "01234567890+*#" :
val tests = hashMapOf(
"" to Pair(0, 0),
"9" to Pair(1, 2),
"97" to Pair(1, 3),
"250" to Pair(3, 7),
"250123" to Pair(3, 11),
"250118" to null,
"++" to null,
"8" to Pair(16, 17),
"+" to Pair(0, 1),
"+8" to null,
"78" to Pair(15, 17),
"5678" to Pair(13, 17),
"788" to null,
"+ " to Pair(0, 1),
" " to Pair(0, 0),
"+ 5" to null,
"+ 9" to Pair(0, 2)
)
The problem
You might think: Why not just use "indexOf" or clean the string and find the occurrence ?
But that's wrong, because I want to mark the occurrence, ignoring some characters on the way.
What I've tried
I actually have the answer after I worked on it for quite some time. Just wanted to share it, and optionally see if anyone can write a nicer/shorter code, that will produce the same behavior.
I had a solution before, which was quite shorter, but it assumed that the query contains only allowed characters.
The question
Well there is no question this time, because I've found an answer myself.
However, again, if you can think of a more elegant and/shorter solution, which is as efficient as what I wrote, please let me know.
I'm pretty sure regular expressions could be a solution here, but they tend to be unreadable sometimes, and also very inefficient compared to exact code. Still could also be nice to know how this kind of question would work for it. Maybe I could perform a small benchmark on it too.
OK so here's my solution, including a sample to test it:
TextSearchUtil.kt
object TextSearchUtil {
/**#return where the query was found. First integer is the start. The second is the last, excluding.
* Special cases: Pair(0,0) if query is empty or ignored, null if not found.
* #param text the text to search within. Only allowed characters are searched for. Rest are ignored
* #param query what to search for. Only allowed characters are searched for. Rest are ignored
* #param allowedCharactersSet the only characters we should be allowed to check. Rest are ignored*/
fun findOccurrenceWhileIgnoringCharacters(text: String, query: String, allowedCharactersSet: HashSet<Char>): Pair<Int, Int>? {
//get index of first char to search for
var searchIndexStart = -1
for ((index, c) in query.withIndex())
if (allowedCharactersSet.contains(c)) {
searchIndexStart = index
break
}
if (searchIndexStart == -1) {
//query contains only ignored characters, so it's like an empty one
return Pair(0, 0)
}
//got index of first character to search for
if (text.isEmpty())
//need to search for a character, but the text is empty, so not found
return null
var mainIndex = 0
while (mainIndex < text.length) {
var searchIndex = searchIndexStart
var isFirstCharToSearchFor = true
var secondaryIndex = mainIndex
var charToSearch = query[searchIndex]
secondaryLoop# while (secondaryIndex < text.length) {
//skip ignored characters on query
if (!isFirstCharToSearchFor)
while (!allowedCharactersSet.contains(charToSearch)) {
++searchIndex
if (searchIndex >= query.length) {
//reached end of search while all characters were fine, so found the match
return Pair(mainIndex, secondaryIndex)
}
charToSearch = query[searchIndex]
}
//skip ignored characters on text
var c: Char? = null
while (secondaryIndex < text.length) {
c = text[secondaryIndex]
if (allowedCharactersSet.contains(c))
break
else {
if (isFirstCharToSearchFor)
break#secondaryLoop
++secondaryIndex
}
}
//reached end of text
if (secondaryIndex == text.length) {
if (isFirstCharToSearchFor)
//couldn't find the first character anywhere, so failed to find the query
return null
break#secondaryLoop
}
//time to compare
if (c != charToSearch)
break#secondaryLoop
++searchIndex
isFirstCharToSearchFor = false
if (searchIndex >= query.length) {
//reached end of search while all characters were fine, so found the match
return Pair(mainIndex, secondaryIndex + 1)
}
charToSearch = query[searchIndex]
++secondaryIndex
}
++mainIndex
}
return null
}
}
Sample usage to test it :
MainActivity.kt
class MainActivity : AppCompatActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
//
val text = "+972 50-123-45678"
val allowedCharacters = "01234567890+*#"
val allowedPhoneCharactersSet = HashSet<Char>(allowedCharacters.length)
for (c in allowedCharacters)
allowedPhoneCharactersSet.add(c)
//
val tests = hashMapOf(
"" to Pair(0, 0),
"9" to Pair(1, 2),
"97" to Pair(1, 3),
"250" to Pair(3, 7),
"250123" to Pair(3, 11),
"250118" to null,
"++" to null,
"8" to Pair(16, 17),
"+" to Pair(0, 1),
"+8" to null,
"78" to Pair(15, 17),
"5678" to Pair(13, 17),
"788" to null,
"+ " to Pair(0, 1),
" " to Pair(0, 0),
"+ 5" to null,
"+ 9" to Pair(0, 2)
)
for (test in tests) {
val result = TextSearchUtil.findOccurrenceWhileIgnoringCharacters(text, test.key, allowedPhoneCharactersSet)
val isResultCorrect = result == test.value
val foundStr = if (result == null) null else text.substring(result.first, result.second)
when {
!isResultCorrect -> Log.e("AppLog", "checking query of \"${test.key}\" inside \"$text\" . Succeeded?$isResultCorrect Result: $result found String: \"$foundStr\"")
foundStr == null -> Log.d("AppLog", "checking query of \"${test.key}\" inside \"$text\" . Succeeded?$isResultCorrect Result: $result")
else -> Log.d("AppLog", "checking query of \"${test.key}\" inside \"$text\" . Succeeded?$isResultCorrect Result: $result found String: \"$foundStr\"")
}
}
//
Log.d("AppLog", "special cases:")
Log.d("AppLog", "${TextSearchUtil.findOccurrenceWhileIgnoringCharacters("a", "c", allowedPhoneCharactersSet) == Pair(0, 0)}")
Log.d("AppLog", "${TextSearchUtil.findOccurrenceWhileIgnoringCharacters("ab", "c", allowedPhoneCharactersSet) == Pair(0, 0)}")
Log.d("AppLog", "${TextSearchUtil.findOccurrenceWhileIgnoringCharacters("ab", "cd", allowedPhoneCharactersSet) == Pair(0, 0)}")
Log.d("AppLog", "${TextSearchUtil.findOccurrenceWhileIgnoringCharacters("a", "cd", allowedPhoneCharactersSet) == Pair(0, 0)}")
}
}
If I want to highlight the result, I can use something like that:
val pair = TextSearchUtil.findOccurrenceWhileIgnoringCharacters(text, "2501", allowedPhoneCharactersSet)
if (pair == null)
textView.text = text
else {
val wordToSpan = SpannableString(text)
wordToSpan.setSpan(BackgroundColorSpan(0xFFFFFF00.toInt()), pair.first, pair.second, Spannable.SPAN_EXCLUSIVE_EXCLUSIVE)
textView.setText(wordToSpan, TextView.BufferType.SPANNABLE)
}

Parallel Merge Sort in Scala

I have been trying to implement parallel merge sort in Scala. But with 8 cores, using .sorted is still about twice as fast.
edit:
I rewrote most of the code to minimize object creation. Now it runs about as fast as the .sorted
Input file with 1.2M integers:
1.333580 seconds (my implementation)
1.439293 seconds (.sorted)
How should I parallelize this?
New implementation
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
val THRESHOLD = 30
def inssort[A](a: Array[A], left: Int, right: Int): Array[A] = {
for (i <- (left+1) until right) {
var j = i
val item = a(j)
while (j > left && comp.lt(item,a(j-1))) {
a(j) = a(j-1)
j -= 1
}
a(j) = item
}
a
}
def mergesort_merge[A](a: Array[A], temp: Array[A], left: Int, right: Int, mid: Int) : Array[A] = {
var i = left
var j = right
while (i < mid) { temp(i) = a(i); i+=1; }
while (j > mid) { temp(i) = a(j-1); i+=1; j-=1; }
i = left
j = right-1
var k = left
while (k < right) {
if (comp.lt(temp(i), temp(j))) { a(k) = temp(i); i+=1; k+=1; }
else { a(k) = temp(j); j-=1; k+=1; }
}
a
}
def mergesort_split[A](a: Array[A], temp: Array[A], left: Int, right: Int): Array[A] = {
if (right-left == 1) a
if ((right-left) > THRESHOLD) {
val mid = (left+right)/2
mergesort_split(a, temp, left, mid)
mergesort_split(a, temp, mid, right)
mergesort_merge(a, temp, left, right, mid)
}
else
inssort(a, left, right)
}
def mergesort[A: ClassTag](a: Array[A]): Array[A] = {
val temp = new Array[A](a.size)
mergesort_split(a, temp, 0, a.size)
}
Previous implementation
Input file with 1.2M integers:
4.269937 seconds (my implementation)
1.831767 seconds (.sorted)
What sort of tricks there are to make it faster and cleaner?
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
val StartNano = System.nanoTime
def dbg(msg: String) = println("%05d DBG ".format(((System.nanoTime - StartNano)/1e6).toInt) + msg)
def time[T](work: =>T) = {
val start = System.nanoTime
val res = work
println("%f seconds".format((System.nanoTime - start)/1e9))
res
}
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
def merge[A](left: List[A], right: List[A]): Stream[A] = (left, right) match {
case (x :: xs, y :: ys) if comp.lteq(x, y) => x #:: merge(xs, right)
case (x :: xs, y :: ys) => y #:: merge(left, ys)
case _ => if (left.isEmpty) right.toStream else left.toStream
}
def sort[A](input: List[A], length: Int): List[A] = {
if (length < 100) return input.sortWith(comp.lt)
input match {
case Nil | List(_) => input
case _ =>
val middle = length / 2
val (left, right) = input splitAt middle
merge(sort(left, middle), sort(right, middle + length%2)).toList
}
}
def msort[A](input: List[A]): List[A] = sort(input, input.length)
//=====================================================================================================================
// PARALLELIZATION
//val cores = Runtime.getRuntime.availableProcessors
//dbg("Detected %d cores.".format(cores))
//lazy implicit val ec = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(cores))
def futuremerge[A](fa: Future[List[A]], fb: Future[List[A]])(implicit order: Ordering[A], ec: ExecutionContext) =
{
for {
a <- fa
b <- fb
} yield merge(a, b).toList
}
def parallel_msort[A](input: List[A], length: Int)(implicit order: Ordering[A]): Future[List[A]] = {
val middle = length / 2
val (left, right) = input splitAt middle
if(length > 500) {
val fl = parallel_msort(left, middle)
val fr = parallel_msort(right, middle + length%2)
futuremerge(fl, fr)
}
else {
Future(msort(input))
}
}
//=====================================================================================================================
// MAIN
val results = time({
val src = Source.fromFile("in.txt").getLines
val header = src.next.split(" ").toVector
val lines = if (header(0) == "i") src.map(_.toInt).toList else src.toList
val f = parallel_msort(lines, lines.length)
Await.result(f, concurrent.duration.Duration.Inf)
})
println("Sorted as comparison...")
val sorted_src = Source.fromFile(input_folder+"in.txt").getLines
sorted_src.next
time(sorted_src.toList.sorted)
val writer = new PrintWriter("out.txt", "UTF-8")
try writer.print(results.mkString("\n"))
finally writer.close
}
My answer is probably going to be a bit long, but i hope that it will be useful for both you and me.
So, first question is: "how scala is doing sorting for a List?" Let's have a look at the code from scala repo!
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val b = newBuilder
if (len == 1) b ++= this
else if (len > 1) {
b.sizeHint(len)
val arr = new Array[AnyRef](len) // Previously used ArraySeq for more compact but slower code
var i = 0
for (x <- this) {
arr(i) = x.asInstanceOf[AnyRef]
i += 1
}
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
i = 0
while (i < arr.length) {
b += arr(i).asInstanceOf[A]
i += 1
}
}
b.result()
}
So what the hell is going on here? Long story short: with java. Everything else is just size justification and casting. Basically this is the line which defines it:
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
Let's go one level deeper into JDK sources:
public static <T> void sort(T[] a, Comparator<? super T> c) {
if (c == null) {
sort(a);
} else {
if (LegacyMergeSort.userRequested)
legacyMergeSort(a, c);
else
TimSort.sort(a, 0, a.length, c, null, 0, 0);
}
}
legacyMergeSort is nothing but single threaded implementation of merge sort algorithm.
The next question is: "what is TimSort.sort and when do we use it?"
To my best knowledge default value for this property is false, which leads us to TimSort.sort algorithm. Description can be found here. Why is it better? Less comparisons that in merge sort according to comments in JDK sources.
Moreover you should be aware that it is all single threaded, so no parallelization here.
Third question, "your code":
You create too many objects. When it comes to performance, mutation (sadly) is your friend.
Premature optimization is the root of all evil -- Donald Knuth. Before making any optimizations (like parallelism), try to implement single threaded version and compare the results.
Use something like JMH to test performance of your code.
You should not probably use Stream class if you want to have the best performance as it does additional caching.
I intentionally did not give you answer like "super-fast merge sort in scala can be found here", but just some tips for you to apply to your code and coding practices.
Hope it will help you.

Truncate text to get preview in Scala

I need to truncate a text to get a preview. The preview is the text prefix of ~N chars (but not more) and it should not split words in the middle.
preview("aaa", 10) = "aaa"
preview("a b c", 10) = "a b c"
preview("aaa bbb", 5) = "aaa"
preview("a b ccc", 3) = "a b"
I coded a function as follows:
def preview(s:String, n:Int) =
if (s.length <= n) s else s.take(s.lastIndexOf(' ', n))
Would you change or fix it ?
Now I am thinking how to handle the case when the text words are separated by one or more white spaces (including \n,\t, etc.) rather than just a single space. How would you improve the function to handle this case ?
How about the following:
def preview(s: String, n: Int) = if (s.length <= n) {
s
} else {
s.take(s.lastIndexWhere(_.isSpaceChar, n + 1)).trim
}
This function will:
For the strings shorter or equal n return the string (no preview required)
Otherwise find the the last space character in the n + 1 first characters (this will indicate whether the last world is being split, as if it's not than n + 1 will be a space chracter and otherwise a non-space character) and take a string up to this point
Note: The usage of isSpaceChar will not only provide support for space, but also new line or paragraph, which is what I believe you're after (and you can replace it with isWhitespace if you're after even more extended set of word separators).
I propose next one:
-- UPDATED--
def ellipsize(text : String, max : Int): String = {
def ellipsize0(s : String): String =
if(s.length <= max) s
else {
val end = s.lastIndexOf(" ")
if(end == -1) s.take(max)
else ellipsize0(s.take(end))
}
ellipsize0("\\s+".r.replaceAllIn(text, " "))
}
Or your (modified):
def preview(str : String, n : Int) = {
(s : String) => if (s.length <= n) s else s.take(s.lastIndexOf(' ', n))
}.apply( "\\s+".r.replaceAllIn(str, " "))
How about this
def preview(s:String, n:Int) =
if (s.length <= n) s
else s.take(n).takeWhile(_ != ' ')
Try it here: http://scalafiddle.net/console/a05d886123a54de3ca4b0985b718fb9b
This seems to work:
// find the last word that is not split by n, then take to its end
def preview(text: String, n: Int): String =
text take (("""\S+""".r findAllMatchIn text takeWhile (_.end <= n)).toList match {
case Nil => n
case ms => ms.last.end
})
An alternative take (pun intended) but doesn't like input of all whitespace:
text take (("""\S+""".r findAllMatchIn text takeWhile (m => m.start == 0 || m.end <= n)).toList.last.end min n)
Extensionally:
object Previewer {
implicit class `string preview`(val text: String) extends AnyVal {
// find the last word that is not split by n, then take to its end
def preview(n: Int): String =
text take (("""\S+""".r findAllMatchIn text takeWhile (_.end <= n)).toList match {
case Nil => n
case ms => ms.last.end
})
}
}
Looks nice that way:
class PreviewTest {
import Previewer._
#Test def shorter(): Unit = {
assertEquals("aaa", "aaa" preview 10)
}
#Test def spacey(): Unit = {
assertEquals("a b c", "a b c" preview 10)
}
#Test def split(): Unit = {
assertEquals("abc", "abc cba" preview 5)
}
#Test def onspace(): Unit = {
assertEquals("a b", "a b cde" preview 3)
}
#Test def trimming(): Unit = {
assertEquals("a b", "a b cde" preview 5)
}
#Test def none(): Unit = {
assertEquals(" " * 5, " " * 8 preview 5)
}
#Test def prefix(): Unit = {
assertEquals("a" * 5, "a" * 10 preview 5)
}
}

Resources