Using Large Look up table - apache-spark

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .

However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

Related

Pyspark filter or split based on row value

I'm dealing with some data in pyspark. We have the issue, that metadata and actual data are mixed. This means we have Strings which are interrupted by a "STOP" string. The number of Strings between the "STOP" is variable and we would like to filter out short occurrences.
An example dataframe where we have ints instead of Strings and 0 is the stop signal is below:
df = spark.createDataFrame([*[(1,),(0,),(3,),(0,),(4,),(4,),(5,),(0,)]])
My goal would now be to have a filter function, where I can say how many elements between two stop signals need to be, in order for the data to be kept. E.g. if min_length was two, we would end up with the dataframe:
df = spark.createDataFrame([(4,),(4,),(5,),(0,)]])
My idea was to create a seperate column and create a group in there:
df.select("_1", F.when(df["_1"]==2, 0).otherwise(get_counter())).show()
The get_counter function should count how many times we've already seen "Stop" (or 0 in the example df). Due to the distributed nature of Spark that does not work though.
Is it somehow possible to easily achive this by filtering? Or is it maybe possible to split the dataframes, everytime "STOP" occurs? I could then delete to short dataframes and merge them again.
Preferably this would be solved in pyspark or sql-spark. But if someone knows how to do this with the spark-shell, I'd also be curious :)
Spark sql implementation:
with t2 as (
select
monotonically_increasing_id() as id
, _1
, case when _1 = 0 then 1 else 0 end as stop
from
t1
)
, t3 as (
select
*
, row_number() over (partition by stop order by id) as stop_seq
from
t2
)
select * from t3 where stop_seq > 2

What is the industry standard Deduping method in Dataflows?

So Deduping is one of the basic and imp Datacleaning technique.
There are a number of ways to do that in dataflow.
Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName'
$$ _____first($$) in aggregate tab.
The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.
Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?
I tried to repro the deduplication process using dataflow. Below is the approach.
List of columns that needs to be grouped by are given in dataflow parameters.
In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'
Source is taken as in below image.
(Group By columns: col1, col2, col3;
Aggregate column: col4)
Then Aggregate transform is taken and in group by,
sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn
In Aggregates, + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression.
Output of aggregation function
Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.
Then using select transformation, groupbycolumn from above output can be removed before copying to sink.
Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

Google BigQuery nested select subquery in cross join

I have the following code:
SELECT ta.application as koekkoek, ta.ipc, ipc_count/ipc_tot as ipc_share, t3.sfields FROM (
select t1.appln_id as application, t1.ipc_subclass_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
FROM temp.tls209_small t1
CROSS JOIN
(SELECT appln_id, FROM temp.tls209_small group by appln_id ) t2
where t1.appln_id = t2.appln_id
GROUP BY application, ipc
) as ta
CROSS JOIN thesis.ifris_ipc_concordance t3
WHERE ta.ipc LIKE t3.ipc+'%'
AND ta.ipc NOT LIKE t3.not_ipc+'%'
AND t3.not_appln_id NOT IN
(SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application)
Giving the folllowing error:
Field 'ta.application' not found.
I have tried numerous notations for the field, but BigQuery doesn't seem to recognize any reference to other tables in the subquery.
The purpose of the code is as to assign new technology classifications to records based on a concordance table:
I have got two tables:
One large table with application id's, classifications and some other stuff tls209_small:
And a concordance table with some exception rules ifris_ipc_concordance:
In the end I need to assign the sfields label for each row in tls209 (300 million rows). The rules are that ipc_class_symbol+'%' from the first table should be like ipcin the second table, but not like not_ipc.
In addition, the not_appln_id value, if present, should not be associated with the same appln_id in the first table.
So a small example, say this is the input of the query:
appln_id | ipc_class_symbol
1 | A1
1 | A2
1 | A3
1 | C3
sfields | ipc | not_ipc | not_appln_id
X | A | A2 | null
Y | A | null | A3
appln_id 1 should get two times sfields X because ipc=A, not_ipc matches A1 and A3.
Y should not be assigned at all as A3 occurs in appln_id 1.
In the results, I also need the share of the ipc_class_symbol for a single application (1 for 328100001, 0.5 for 32100009 etc.)
Without the last condition (AND t3.not_appln_id NOT IN (SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application) ) the query works fine:
Any suggestions on how to get the subquery to recognize the application id (ta.application), or other ways to introduce the last condition to the query?
I realize my explanation of the problem may not be very straightforward, so if anything is not clear please indicate so, I'll try to clarify the issues.
The query you're performing is doing an anti-join. You can re-write this as an explicit join, but it is a little verbose:
SELECT *
FROM [x.z] as z
LEFT OUTER JOIN EACH [x.y] as y ON y.appln_id = z.application
WHERE y.not_appln_id is NULL
A working solution for the problem was achieved by first generating a table my matching only the ipc_class_symbol from the first table, to the ipc column of the second, but also including the not_ipc, and not_appln_id columns from the second. In addition, a list of all ipc class labels assigned to each appln_id was added using the GROUP_CONCAT method.
Finally, with help from Pentium10, the resulting table was filtered based on the exeption rules as also discussed in this question.
In the final query, the GROUP BY and JOIN arguments needed EACH modifiers to allow the large tables to be processed:
SELECT application as appln_id, ipc as ipc_class, ipc_share, sfields as ifris_class FROM (
SELECT * FROM (
SELECT ta.application as application, ta.ipc as ipc, ipc_count/ipc_tot as ipc_share, t3.sfields as sfields, t3.ipc as yes_ipc, t3.not_ipc as not_ipc, t3.not_appln_id as exclude, t4.classes as other_classes FROM (
SELECT t1.appln_id as application, t1.ipc_class_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
FROM thesis.tls209_appln_ipc t1
FULL OUTER JOIN EACH
(SELECT appln_id, FROM thesis.tls209_appln_ipc GROUP EACH BY appln_id ) t2
ON t1.appln_id = t2.appln_id
GROUP EACH BY application, ipc
) AS ta
LEFT JOIN EACH (
SELECT appln_id, GROUP_CONCAT(ipc_class_symbol) as classes FROM [thesis.tls209_appln_ipc]
GROUP EACH BY appln_id) t4
ON ta.application = t4.appln_id
CROSS JOIN thesis.ifris_ipc_concordance t3
WHERE ta.ipc CONTAINS t3.ipc
) as tx
WHERE (not ipc contains not_ipc or not_ipc is null)
AND (not other_classes contains exclude or exclude is null or other_classes is null)
)

Cassandra CQL selecting rows with has values different on two columns

I create table:
CREATE TABLE T (
I int PRIMARY KEY,
A text,
B text
);
Than I add two columns X and Y using:
ALTER TABLE T ADD X int;
CREATE INDEX ON T (X);
ALTER TABLE T ADD Y int;
CREATE INDEX ON T (Y);
I put some data and now I would like to count rows which has different values on X and Y (even X < Y would be fine). I tried something like this:
select COUNT(*) from T where X < Y ;
This also doesn't work without COUNT - just simple *.
But I'm getting error no viable alternative at input ';'
Do you have some suggestions how to overcome this error?
I tried using counters instead of integers but they forced me to put all non-counter data to primary key which wasn't good idea in my case ...
I'm using Cassandra 1.2.6 and CQL 3.
PS can I perform UPDATE on all rows? without WHERE clause or with some dummy one?
As Cassandra prefers simple reads the Cassandra-way to do this is to insert a boolean flagged column on update/insert. With an (secondary) index you may query the reads faster as well.

Resources