Finding duplicates from large data set using Apache Spark

Finding duplicates from large data set using Apache Spark - apache-spark

Lets assume that we have a big csv/excel file where there are big number of records against following fields.
1.Email
2.First Name
3.Last Name
4.Phone Number etc.
Among these records, we need to identify the duplicate records in terms of matching criteria of Email,First Name and Last Name.
For duplicate calculation,some custom rules are defined which gives a score against an individual record.
For example ,
1.If email is exact match then score is 100,else 0.
2.For First Name,Last Name etc. the edit distance is the score.
For example, lets assume that search parameter is like the following
Email:xyz#gmail.com,First Name: ABCD,Last Name:EFGH
The rows/records are like
1.Email:xyz#gmail.com,First Name: ABC,Last Name:EFGH
2.Email:123xyz#gmail.com,First Name: ABC,Last Name:EFGH
For record1, score = 100(for email) + 75 (for first name) + 100 (for Last name)=275, i.e.91.6%
For record2, score = 0(for email) + 75 (for first name) + 100 (for Last name)=175, i.e.58%
Duplicate detection threshold is 75%,so record 1 is duplicate and record 2 is not.This is fairly simple to implement when we have input parameters and using them we want to determine the duplicates from a file.
But how to apply this logic when we have all the records in a file and for all of them we need to find out which are the duplicate ones ?
Here no input parameter is defined and we need to compare one record with all other records in order to find scoring relevance .
How to achieve this in Apache Spark ?

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark
val df = sc.textFile("hdfs path of data");
df.mapToPair("email", <whole_record>)
.groupBy(//will be done based on key)
.map(//will run on each bag )

Related

create multiple csv files out of one csv dataframe with different parameters in the columns

Suppose I have a csv containing columns with parametric information written as a string, something like "Title, temperature=25, voltage=0.8V, x=0" for each column. the rows are statistical data, lets say 10 samples for each set of title, temperature, voltage and x.
x is some variable (0 to 10 for example) that I want to use as my x axis of the new tables
I want to find a fay to group all columns with the same title, temperature and voltage into separate tables.
each column of the new table is an iteration of the statistical data, each row is the x values.
my input csv is:
my output should be 2 csv's (because there are 2 different sets of title, temperature and voltage)
csv out 1
csv out 2
In general there can be more sets so there is no limit on how many outputs I will get.
I was thinking of doing some kind of a loop that goes through all the columns, writes down all the different combinations (excluding the x value) into a set. so in our example it will be
test_set = set()
#first loop to create all the table names
for i in range(csv_in.shape[1]):
col_name = csv_in.columns[s]
#here i need some string manipulation to extract only the title, temp and voltage from the column name
test_set.add(col_name)
then another loop goes over that set and create a sub-table , then pivot it so that the x value is now the rows of the table, and saves the file with the appropriate name using the title, temp and voltage parameters.
practically I don't really know where to start since I'm pretty new to python so I would like some suggestions and any pointers you can give me would be great :)
thanks!

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.

It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

Excel Power Query -- Select value in column specified in related table -- INDEX+MATCH alternative

Problem
I have two queries, one contains product data (data_query), the other (recode_query) contains product names from within the data_query and assigns them specific id_tags. id_tags are also column names within the data_query.
What I need to achieve and fail at
I need the data_query to look at the id_tag of the specific product name within the data_query, as parsed from the recode_query (this is already working and in place) and input the retrieved value within the specific custom column cell. In Excel, I would be using INDEX/MATCH combo:
{=INDEX(data_query[#Data];; MATCH(data_query[#id_tag]; data_query[#Headers]; 0))}
I have searched near and far, but I probably can't even spot the solution, even if I have come across it, as I am not that deep in the data manipulation and power query myself.

Is this what you're wanting?
let
DataQuery = Table.FromColumns({{1,2,3}, {"Boxed", "Bagged", "Rubberbanded"}}, {"ID","Pkg"}),
RecodeQuery = Table.FromColumns({{"Squirt Gun", "Coffee Maker", "Trenching Tool"}, {1,2,3}}, {"Prod Name", "ID2"}),
Rzlt = Table.Join(DataQuery, "ID", RecodeQuery, "ID2", JoinKind.Inner)
in
Rzlt

range queries in cassandra

The following is working as expected. But who do I execute range queries like "where age > 40 and age < 50"
create keyspace Keyspace1;
use Keyspace1;
create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
set Users[jsmith][first] = 'John';
set Users[jsmith][last] = 'Smith';
set Users[jsmith][age] = long(42);
get Users[jsmith];
=> (column=age, value=42, timestamp=1341375850335000)
=> (column=first, value=John, timestamp=1341375827657000)
=> (column=last, value=Smith, timestamp=1341375838375000)

The best way to do this in Cassandra varies depending on your requirements, but the approaches are fairly similar for supporting these types of range queries.
Basically, you will take advantage of the fact that columns within a row are sorted by their names. So, if you use an age as the column name (or part of the column name), the row will be sorted by ages.
You will find a lot of similarities between this and storing time-series data. I suggest you take a look at Basic Time Series with Cassandra for the fundamentals, and the second half of an intro to the latest CQL features that gives an example of a somewhat more powerful approach.
The built-in secondary indexes are basically designed like a hash table, and don't work for range queries unless that range expression accompanies an equality expression on an indexed column. So, you could ask for select * from users where name = 'Joe' and age > 54, but not simply select * from users where age > 54, because that would require a full table scan. See Secondary Indexes doc for more details.

You have to create a Secondary index on the column age:
update column family Users with column_metadata=[{column_name: age, validation_class: LongType, index_type: KEYS}];
Then use:
get Users where age > 40 and age < 50
Note: I think: Exclusive operators are not supported since 1.2.
Datastax has a good documentation about that: http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes Or you can create and maintain your own secondary index. This is a good link about that:
http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html

Complicated condition

I have predefined item combination (for example brand1|brand2|brand3 etc) in the table.
i like to collect brands and check against with predefined table data.
For example i collected brand1|brand2|brand3 then i can do get some value form that predefined table(it meets the condition).
How can i check?
brands would be unlimited. also brand1|brand2|brand3 of brand1|brand2| exist then returns true.

Okay, taking a wild guess at what you're asking, you have a delimited field with brands in them separated by a | character. You want to return any row that has the right combination of the brands in there, but don't want to return rows with, for example, brand "testify" in them when you search for "test".
You have four search conditions (looking for brand3):
the brand exists by itself: "brand3"
the brand starts the delimited field: "brand3|brand4|brand6"
the brand is in the middle of the field: "brand1|brand3|brand6"
the brand is at the end of the field: "brand1|brand2|brand3"
so, in SQL:
SELECT *
FROM MyTable
WHERE BrandField = 'brand3'
OR BrandField LIKE 'brand3|%'
OR BrandField LIKE '%|brand3|%'
OR BrandField LIKE '%|brand3'
Repeat as required for multiple brands.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string