Separating fields out of a string in Hive - string

I have the following problem...
I work with Hive and want to add a file with several (different) rows of Strings. Those contain fields with a fixed size, like this:
A20130420bcd 34 fgh
where the fields have the length 1,8,6,4,3.
Separated it would look like this:
"A,20130420,bcd,fgh"
Is there any possibility to read the String and sort it into a field besides getting it as a substring for every field like
substring(col_value,1,1) Field1
etc?
I would imagine that cutting the already read part of the string would increase the performance, but i could think of any way to do this with the given functions here.
Secondly, as stated before, there are different types of strings, ordered and identified by the first character.right now just check those with the WHERE-Statement, but it's horrible, as it runs through the whole file just to find only the first String. Is there any way to read specific lines by their number? If i know, that the first string will be of a certain kind, can read it directly?
right it looks like this:
insert overwrite table TEST
SELECT
substring(col_value,1,1) field1,
...
substring(col_value,10,3) field 5
from temp_data WHERE substring(col_value,1,1) = 'A';
any ideas on this?
I would love to hear some ideas =)

You need to write yours generic-UDF parser that output the struct or map or whatever appropriate. you can refer to UDF that output multi-values.
then you can write
insert overwrite table output
select parsed.first, parsed.second
from (
select parse(taget)
from input
) parsed
where first='X';
About second question,you may need to check "explain" command of hive to see if hive do filter push-down for you.(just see how many map reduce it takes, theoretically it should be one map, depending on 1.hive version,
2.output table format
.)
In general sense, this is why database is popular -- take optimization into consideration for you .

Related

Can we write a Insert query Lookup Activity in ADF

At the end of the pipeline I wanted to write the below query
INSERT INTO [TestTable] (Job_name, status) VALUES (Job_name, current_timestamp()).
Jobname will be passed as a parameter.
Can this be written in Lookup, please let me know.
Definitely possible, but should you write massive inserts/scripts within a lookup... probably not a great idea, but see below (Truncate Example, but will work the same with an insert)
I use this method for small things like truncating a table, but never with big code that should be stored as source in the DB.
EDIT:
If you need to pass parameters or Variables into the lookup you should use string interpolation like so:
INSERT INTO dbo.MyTable
SELECT '#{variables('YourVariable')}' as Variable1,
'#{pipeline().Pipeline} as PipelineName

Is it possible to split array in Gherkin to the next line

I have step where I am have String Array, something like this:
Then Drop-dow patient_breed contains ['Breed1', 'Breed2',.... Breed20']
I need to split this text on two lines. I know that in Gherkin there is expression """. I try something like this:
Then Drop-dow patient_breed contains ['Breed1',
"""
'Breed2',.... Breed20']
"""
It didn't help. Is there any solution?
What do you gain by putting this string in your scenario. IMO all you are doing is making the scenario harder to read!
What do you lose by putting this string in your scenario?
Well first of all you now have to have at least two things the determine the exact contents of the string, the thing in the application that creates it and the hardcoded string in your scenario. So you are repeating yourself.
In addition you've increased the cost of change. Lets say we want our strings to change from 'Breedx' to 'Breed: x'. Now you have to change every scenario that looks at the drop down. This will take much much longer than changing the code.
So what can you do instead?
Change your scenario step so that it becomes Then I should see the patient breeds and delegate the HOW of the presentation of the breeds and even the sort of control that the breeds are presented in to something that is outside of Cucumber e.g. a helper method called by a step definition, or perhaps even something in your codebase.
Try with a datatable approach. You will have to add a DataTable argument in the stepdefinition.
Then Drop-dow patient_breed contains
'Breed1'
'Breed2'
...
...
...
'Breed20']
For a multiline approach try the below. In this you will have to add a String argument to the stepdefinition.
Then Drop-dow patient_breed contains
"""
['Breed1','Breed2',.... Breed20']
"""
I would read the entire string and then split it using Java after it has been passed into the step. In order to keep my step as a one or two liner, I would use a helper method that I implemented myself.

Checking if values in List is part of String

I have a string like this:
val a = "some random test message"
I have a list like this:
val keys = List("hi","random","test")
Now, I want to check whether the string a contains any values from keys. How can we do this using the in built library functions of Scala ?
( I know the way of splitting a to List and then do a check with keys list and then find the solution. But I'm looking a way of solving it more simply using standard library functions.)
Something like this?
keys.exists(a.contains(_))
Or even more idiomatically
keys.exists(a.contains)
The simple case is to test substring containment (as remarked in rarry's answer), e.g.
keys.exists(a.contains(_))
You didn't say whether you actually want to find whole word matches instead. Since rarry's answer assumed you didn't, here's an alternative that assumes you do.
val a = "some random test message"
val words = a.split(" ")
val keys = Set("hi","random","test") // could be a List (see below)
words.exists(keys contains _)
Bear in mind that the list of keys is only efficient for small lists. With a list, the contains method typically scans the entire list linearly until it finds a match or reaches the end.
For larger numbers of items, a set is not only preferable, but also is a more true representation of the information. Sets are typically optimised via hashcodes etc and therefore need less linear searching - or none at all.

ABAP startRFC.exe UTF-8 diacritics text transfer

I have a function module (FM) in SAP and I call it externally using startRFC. The only output of FM is one internal table. This table has only 1 column of type char(100) and I need to get it to text file. StartRFC works well, but if there is diacritics (for example Czech language: ěščřžýáíé) instead of these characters only hashes # appear.
Have someone ever solved similar issue?
If I call the same algorithm manually and write strings on screen in SAP, everything is ok. But startRFC somehow destroys it. The problem may be in the data transfer between SAP and startRFC. But I don't know how this transfer works.
I found a solution but it is terribly slow. It converts string to hexadecimal string using "gcl_conv_to_x->write" and "gcl_conv_to_x->get_buffer" than calls "SCMS_XSTRING_TO_BINARY" and you need a binary table. But it takes 5minutes to do all this stuff. Without this conversion my algorithm takes 15 seconds.
So finally a solution...
You need to create XSTRING variable and fill it with your text. To convert STRING to XSTRING use FM: SCMS_STRING_TO_XSTRING.
Then you will need an internal table with row type BAPICONTEN. It already contains component (column) of type SDOK_SDATX (RAW 1022).
And you just append a new line to this table like this:
data: my_table_row LIKE LINE OF my_table.
my_table_row-line = my_xstring.
APPEND my_table_row INTO my_table.
This table (my_table) can be returned via RFC and will contain Cyrillic, German characters etc..
I am just a beginner, so do not ask me how to create the table, please :)

Cassandra comments data model

I am trying to store very simple comments in a wide row, but the problem is that i want to have top comments.
So at first I have tried to use UTF8 comparator type and each column name would begin by likes amount and would be followed by timestamp, for example:
Comments_CF = {
parent:{
8_timestamp: comment,
5_timestamp: comment,
1_timestamp: comment,
...
}
...
}
The problem with this approach is that for example 2_timestamp > 19_timestamp because lexicographically 2 is bigger than 19
I could probably store top comments in a separate CF but then i would need to do two queries instead of one so i would really like to avoid that, any suggestions?
2 queries instead of one is usually not a big deal. You could also just do a composite value(number of likes+the comment) and sort the comments yourself....From stuff I have seen there is never alot of comments except a few posts anyways so that would be very quick.
There are other patterns that might spark ideas here as well...
https://github.com/deanhiller/playorm/wiki/Patterns-Page
Use a composite, where the first component is a long and the second is whatever type is appropriate for your timestamp format. This way the sorting will be correct.

Resources