Looking up a list of words into a String in Scala - string

I need to find if any word from a word list (which could be a Set or List or another structure) is contained (as a sub-string) in another String and I need the best possible performance.
This could be an example:
val query = "update customer set id=rand() where id=1000000009;"
val wordList = Set("NOW(", "now(", "LOAD_FILE(", "load_file(", "UUID(", "uuid(", "UUID_SHORT(",
"uuid_short(", "USER(", "user(", "FOUND_ROWS(", "found_rows(", "SYSDATE(", "sysdate(", "GET_LOCK(", "get_lock(",
"IS_FREE_LOCK(", "is_free_lock(", "IS_USED_LOCK(", "is_used_lock(", "MASTER_POS_WAIT(", "master_pos_wait(",
"RAND(", "rand(", "RELEASE_LOCK(", "release_lock(", "SLEEP(", "sleep(", "VERSION(", "version(")
What is the best option to achieve the best performance? I have read about the contains method but it doesn't work for sub-strings. Is the only option to iterate through the list and to use the method indexOf or there is a better option?

For Scala collections, the method to use in order to answer a question like "is there an item in this collection that satisfies my condition?" is exists (scroll up slightly when you get there because the scaladoc pages are weird about linking directly to methods).
Your condition is "does the string (query) contain this item (word)?" For this, you can use String's contains method, which comes from Java.
Putting it together, you'll get
wordList.exists { word => query.contains(word) }
// or, with some syntax sugar
wordList exists { query.contains }
You can also use .find instead of .exists, which will return an Option containing the first match that was found, instead of just a Boolean indicating whether or not something was found.
scala> wordList.exists(query.contains)
res1: Boolean = true
scala> wordList.find(query.contains)
res2: Option[String] = Some(rand()

This is advice for solution:
Check that you need to optimize it. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
Array is collection with fastest access to element. Use it to increase access's speed.
Sometimes use a ParArray may increase performance.
If it's acceptable, for best performance first cast string to lower case, and remove all UPPER_CASE from set.
You can use own "contains" method to find any of substring. E.g., you can group some words by their prefixes (or suffixes) and don't pass all group if next (prev) symbol is different.
Use native Java to increase performance (Scala can wrap array)
First find all positions of (, because all variants related to it. Than you can check last word's symbol.
Sorry for my English. It is not best advice, but I know small amount of people (e.g. on acm.timus.ru) which can write more faster functions at Scala.

Related

using list values in an if statement to compare with a string

I have this list and string
list_var=(['4 3/4','6','6 3/4'])
string_var='there are 6 inches of cement inside'
i want to check through and find if any of the elements of the list is in the string
if list_var in strin_var (wont work of course due to the fact its a list but this is what i want to achieve is to compare all of the values and if any match return true)
i tried using any function but this not seem to work i would also like to achieve this without using loops as the code i will be using this in contains extracting data from a lot of files so execution speed is an issue
you will have to iteratre over the list at least once as you are trying a search operation.
is_found = any(True for i in list_var if i in string_var)
works perfectly for this scenario.

Separating fields out of a string in Hive

I have the following problem...
I work with Hive and want to add a file with several (different) rows of Strings. Those contain fields with a fixed size, like this:
A20130420bcd 34 fgh
where the fields have the length 1,8,6,4,3.
Separated it would look like this:
"A,20130420,bcd,fgh"
Is there any possibility to read the String and sort it into a field besides getting it as a substring for every field like
substring(col_value,1,1) Field1
etc?
I would imagine that cutting the already read part of the string would increase the performance, but i could think of any way to do this with the given functions here.
Secondly, as stated before, there are different types of strings, ordered and identified by the first character.right now just check those with the WHERE-Statement, but it's horrible, as it runs through the whole file just to find only the first String. Is there any way to read specific lines by their number? If i know, that the first string will be of a certain kind, can read it directly?
right it looks like this:
insert overwrite table TEST
SELECT
substring(col_value,1,1) field1,
...
substring(col_value,10,3) field 5
from temp_data WHERE substring(col_value,1,1) = 'A';
any ideas on this?
I would love to hear some ideas =)
You need to write yours generic-UDF parser that output the struct or map or whatever appropriate. you can refer to UDF that output multi-values.
then you can write
insert overwrite table output
select parsed.first, parsed.second
from (
select parse(taget)
from input
) parsed
where first='X';
About second question,you may need to check "explain" command of hive to see if hive do filter push-down for you.(just see how many map reduce it takes, theoretically it should be one map, depending on 1.hive version,
2.output table format
.)
In general sense, this is why database is popular -- take optimization into consideration for you .

Checking if values in List is part of String

I have a string like this:
val a = "some random test message"
I have a list like this:
val keys = List("hi","random","test")
Now, I want to check whether the string a contains any values from keys. How can we do this using the in built library functions of Scala ?
( I know the way of splitting a to List and then do a check with keys list and then find the solution. But I'm looking a way of solving it more simply using standard library functions.)
Something like this?
keys.exists(a.contains(_))
Or even more idiomatically
keys.exists(a.contains)
The simple case is to test substring containment (as remarked in rarry's answer), e.g.
keys.exists(a.contains(_))
You didn't say whether you actually want to find whole word matches instead. Since rarry's answer assumed you didn't, here's an alternative that assumes you do.
val a = "some random test message"
val words = a.split(" ")
val keys = Set("hi","random","test") // could be a List (see below)
words.exists(keys contains _)
Bear in mind that the list of keys is only efficient for small lists. With a list, the contains method typically scans the entire list linearly until it finds a match or reaches the end.
For larger numbers of items, a set is not only preferable, but also is a more true representation of the information. Sets are typically optimised via hashcodes etc and therefore need less linear searching - or none at all.

Select substring between two characters in Scala

I'm getting a garbled JSON string from a HTTP request, so I'm looking for a temp solution to select the JSON string only.
The request.params() returns this:
[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,
callback=jQuery1707229194729661704_1329793018352
I would like everything from the start of the '{' to the end of the '}'.
I found lots of examples of doing similar things with other languages, but the purpose of this is not to only solve the problem, but also to learn Scala. Will someone please show me how to select that {....} part?
Regexps should do the trick:
"\\{.*\\}".r.findFirstIn("your json string here")
As Jens said, a regular expression usually suffices for this. However, the syntax is a bit different:
"""\{.*\}""".r
creates an object of scala.util.matching.Regex, which provides the typical query methods you may want to do on a regular expression.
In your case, you are simply interested in the first occurrence in a sequence, which is done via findFirstIn:
scala> """\{.*\}""".r.findFirstIn("""[{"insured_initials":"Tt","insured_surname":"Test"}=, _=1329793147757,callback=jQuery1707229194729661704_1329793018352""")
res1: Option[String] = Some({"insured_initials":"Tt","insured_surname":"Test"})
Note that it returns on Option type, which you can easily use in a match to find out if the regexp was found successfully or not.
Edit: A final point to watch out for is that the regular expressions normally do not match over linebreaks, so if your JSON is not fully contained in the first line, you may want to think about eliminating the linebreaks first.

How to make this Groovy string search code more efficient?

I'm using the following groovy code to search a file for a string, an account number. The file I'm reading is about 30MB and contains 80,000-120,000 lines. Is there a more efficient way to find a record in a file that contains the given AcctNum? I'm a novice, so I don't know which area to investigate, the toList() or the for-loop. Thanks!
AcctNum = 1234567890
if (testfile.exists())
{
lines = testfile.readLines()
words = lines.toList()
for (word in words)
{
if (word.contains(AcctNum)) { done = true; match = 'YES' ; break }
chunks += 1
if (done) { break }
}
}
Sad to say, I don't even have Groovy installed on my current laptop - but I wouldn't expect you to have to call toList() at all. I'd also hope you could express the condition in a closure, but I'll have to refer to Groovy in Action to check...
Having said that, do you really need it split into lines? Could you just read the whole thing using getText() and then just use a single call to contains()?
EDIT: Okay, if you need to find the actual line containing the record, you do need to call readLines() but I don't think you need to call toList() afterwards. You should be able to just use:
for (line in lines)
{
if (line.contains(AcctNum))
{
// Grab the results you need here
break;
}
}
When you say efficient you usually have to decide which direction you mean: whether it should run quickly, or use as few resources (memory, ...) as possible. Often both lie on opposite sites and you have to pick a trade-off.
If you want to search memory-friendly I'd suggest reading the file line-by-line instead of reading it at once which I suspect it does (I would be wrong there, but in other languages something like readLines reads the whole file into an array of strings).
If you want it to run quickly I'd suggest, as already mentioned, reading in the whole file at once and looking for the given pattern. Instead of just checking with contains you could use indexOf to get the position and then read the record as needed from that position.
I should have explained it better, if I find a record with the AcctNum, I extract out other information on the record...so I thought I needed to split the file into multiple lines.
if you control the format of the file you are reading, the solution is to add in an index.
In fact, this is how databases are able to locate records so quickly.
But for 30MB of data, i think a modern computer with a decent harddrive should do the trick, instead of over complicating the program.

Resources