Search a string and list all sentences matching that string

Search a string and list all sentences matching that string - string

I am trying to code in Scala for the following use case:
Search a string in a text file and list only the sentences that has a match for this string.
I tried using the following:
val fileContents = Source.fromFile("/Users/sc/Documents/Scala_Code/input.txt").getLines.mkString
val sentence = fileContents.filter(line => fileContents.contains("string to search"))
This lists the entire text file even if there is one match. I need just the sentences that has a match.
Appreciate if someone can provide some inputs.

I think it's kind of hard to be sure to describe a sentence in regex. Nevertheless, here's my suggestion:
for all sentences (in case you want to pattern match on them):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString) //.toSeq
For a specific string (for example you):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString).toIterator.withFilter(_.contains("you")) //.toSeq
toSeq (or toList) is useful for checking on small amount of data...
You can test it here: https://scalafiddle.io/sf/0znMzyi/8
Hope it helps.

Related

Groovy - characters loss with stream.getText

I have this Groovy script that I'm testing:
InputStream is = awsS3Stream.getObjectContent();
def lines = is.getText("UTF-8");
println "lines:"+lines;
Pattern pattern = ~/type\"\:\"[A-Z][a-z]*\"/;
Matcher matcher = pattern.matcher(lines);
...
I noticed that depending on the size of the awsS3Stream object, variable lines may not have all of the text - the end of it is missing. I was hoping that using StringBuffer instead of String would solve the issue, but it did not. I hope someone may know a Groovy based solution to it as I'm not terribly familiar with Groovy... much appreciate your time.
P.S The issues I'm seeing is not related to the pattern - I don't need pattern there to see that the variable lines doesn't always have all of the data.

Are you trying to match alphabetic strings with just one initial uppercase letter? If not, the problem is with your regexp. To match camel case strings with any number of capital letters, use this:
Pattern pattern = ~/type\"\:\"[A-Za-z]*\"/;

The issue was with the data going into s3, not how I retrieve it.

Checking if values in List is part of String

I have a string like this:
val a = "some random test message"
I have a list like this:
val keys = List("hi","random","test")
Now, I want to check whether the string a contains any values from keys. How can we do this using the in built library functions of Scala ?
( I know the way of splitting a to List and then do a check with keys list and then find the solution. But I'm looking a way of solving it more simply using standard library functions.)

Something like this?
keys.exists(a.contains(_))
Or even more idiomatically
keys.exists(a.contains)

The simple case is to test substring containment (as remarked in rarry's answer), e.g.
keys.exists(a.contains(_))
You didn't say whether you actually want to find whole word matches instead. Since rarry's answer assumed you didn't, here's an alternative that assumes you do.
val a = "some random test message"
val words = a.split(" ")
val keys = Set("hi","random","test") // could be a List (see below)
words.exists(keys contains _)
Bear in mind that the list of keys is only efficient for small lists. With a list, the contains method typically scans the entire list linearly until it finds a match or reaches the end.
For larger numbers of items, a set is not only preferable, but also is a more true representation of the information. Sets are typically optimised via hashcodes etc and therefore need less linear searching - or none at all.

Dynamically create all sentences of given type based on word substitutions

I have a C# problem at the moment that I haven't been able to get my head around. Essentially I need to generate a list or array of strings which are sentences based on the fact that in the sentence, one or more words may have different spellings or uses. I intend to have a number of different possibilities for (potentially) each word in the sentence.
For example if I define that the word 'are' could be written as 'are' or 'r' and the word 'you' be written 'you' or 'u', expected output for passing something along the lines of "how are you" would be:
"how are you"
"how r you"
"how are u"
"how r u"
I've considered that I could use Enums for word types, e.g:
public enum Word
{
Are,
You
}
and return an array of possible uses using some kind of helper method:
public static string GetVariants(Word w)
{
switch(w)
{
case Word.Are:
return new string[] { "are", "r" };
case Word.You:
return new string[] { "you", "u"};
}
But I cannot seem to find a decent way to define a sentence using a mix of fixed strings and these variable word type identifiers and create possible combinations.
The words need to be in the right order as they would be written, I just need to be able to generate a number of different ways of writing the same thing. Once I've got something going with this, I'd also like for it to be not just applicable to this particular sentence structure. Is this possible?

Incase anybody comes across a similar problem, I eventually got this working as I wanted using a hack of Eric Lippert's Cartesian Product solution:
http://blogs.msdn.com/b/ericlippert/archive/2010/06/28/computing-a-cartesian-product-with-linq.aspx

Select substring between two characters in Scala

I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?

Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")

As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.

Sphinx sentence like query

my task is to find a similar sentence in database collection.
Could you advise me which query type to use?
Sample:
Search: Welcome to the first sample code.
And let say the following sentences are fine for my query:
Dbase:
...
Welcome in first movie ...
This is first sample code ...
Welcome!
...
Thanks

If I got it correctly, each sentence in the DB, which includes one or more words from the search query, is fine.
In this case, you have to use the SPH_MATCH_ANY mode or SPH_MATCH_EXTENDED2 with | (OR) operator.
Matching modes...
Extended query syntax...
If you want to exclude such words as "to", "the" and other short words, you have several options:
1) If you are sure that each word which is less than 4 letters should be excluded, add the following line to your sphinx.conf file:
min_word_len = 4
Read more...
2) if you want to exclude specific words, use the stopwords file(s).
Add the following lines to sphinx.conf:
#path to txt file with words to be excluded (space separated)
stopwords = /usr/local/sphinx/configuration/stopwords.txt
Read more...
And the last thing you should know is that I just provided very basic things which are clearly explained in the documentation and my examples are also taken from there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search a string and list all sentences matching that string - string

Related

Groovy - characters loss with stream.getText

Checking if values in List is part of String

Dynamically create all sentences of given type based on word substitutions

Select substring between two characters in Scala

Sphinx sentence like query

Categories

Resources