Deduplicate Delta Table while Merging - apache-spark

I have a delta table in Azure and a dataframe. The dataframe has duplicates in it and want it so that those duplicates aren't added to the delta table when they are merged. I know that I can remove the duplicates before the merger, but I want to know if it's possible to do it during the merger. I have an example below.
Org Delta Table
1, "abc", 45678
2, "def", 89076
3, "lop", 90678
New incoming data - dataframe
4, "abc", 98067 --- partition 1
4, "abc", 98067 --- partition 2
4, "abc", 98067 --- partition 3
Merge two tables as shown
Final table
1, "abc", 45678
2, "def", 89076
3, "lop", 90678
4, "abc", 98067 --- THIS ROW SHOULD EXIST ONLY ONCE
Edit: I have tried omitting the “whenMatched” when merging but it keeps the duplicates and when i add “when MatchedDelete” it deletes the whole table.

Related

Check if each Column in pandas DF is only values [ 0-9]

I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:

Subsets of a data frame where certain columns satisfy a condition

I have a dataset catalog with 3 columns: product id, brand name and product class.
import pandas as pd
catalog = {'product_id': [1, 2, 3, 1, 2, 4, 3, 5, 6],
'brand_name': ['FW', 'GW', 'FK','FW','GW','WU','FK','MU', 'AS'],
'product_class': ['ACCESSORIES', 'DRINK', 'FOOD', 'ACCESSORIES', 'DRINK', 'FURNITURE','FOOD', 'ELECTRONICS', 'APPAREL']}
df = pd.DataFrame(data=catalog)
Assume I have a list of product id prod = [1,3,4]. Now, with Python, I want to list all the brand names corresponding to this list prod based on the product_id. How can I do this using only groupby() and get_group() functions? I can do this using pd.DataFrame() combined with the zip() function, but it is too inefficient, as I would need to obtain each column individually.
Expected output (in dataframe)
Product_id Brand_name
1 'FW'
3 'FK'
4 'WU'
Can anyone give some help on this?
You can use pandas functions isin() and drop_duplicates() to achieve this:
prod = [1,3,4]
print(df[df.product_id.isin(prod)][["product_id", "brand_name"]].drop_duplicates())
Output:
product_id brand_name
0 1 FW
2 3 FK
5 4 WU

Adding a list of unique values to an existing DataFrame column

I want to add a list of unique values to a DataFrame column. There is the code:
IDs = set(Remedy['Ticket ID'])
log['ID Incidencias'] = IDs
But I obtain the following error:
ValueError: Length of values does not match length of index
Any idea about how could I add a list of unique values to an existing DataFrame column?
Thanks
Not sure if this is what you really need, but to add a list or set of values to each row of an existing dataframe column you can use:
log['ID Incidencias'] = [IDs] * len(log)
Example:
df = pd.DataFrame({'col1': list('abc')})
IDs = set((1,2,3,4))
df['col2'] = [IDs] * len(df)
print(df)
# col1 col2
#0 a {1, 2, 3, 4}
#1 b {1, 2, 3, 4}
#2 c {1, 2, 3, 4}

Subsetting a pandas dataframe based on column names stored in a vector

I am scraping data from a website which builds a Pandas dataframe with different column names dependent on the data available on the site. I have a vector of column names, say:
colnames = ['column1', 'column2', 'column3', 'column5']
which are the columns of a postgres database for which I wish to store the scraped data in.
The problem I am having is that the way I have had to set up the scraping to get all the data I require, I end up grabbing some columns for which I have no use and which aren't in my postgres database. These columns will not have the same names each time, as some pages have extra data, so I can't simply exclude the column names I don't want, as I don't know what all of these will be. There will also be columns in my postgres database for which the data will not be scraped every time.
Hence, when I try and upload the resulting dataframe to postgres, I get the error:
psycopg2.errors.UndefinedColumn: column "column4" of relation "my_db" does not exist
This leads to my question:
How do I subset the resulting pandas dataframe using the column names I have stored in the vector, given some of the columns may not exist in the dataframe? I have tried my_dt = my_dt[colnames], which returns the error:
KeyError: ['column1', 'column2', 'column3'] not in index
Reproducible example:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
test = df[subset_columns]
Any help will be appreciated.
You can simply do:
colnames = ['column1', 'column2', 'column3', 'column5']
df[df.columns & colnames]
I managed to find a fix, though I still don't understand what was causing the initial 'Key Error' to come out as a vector rather than just the elements which weren't columns of my dataframe:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
column_match = set(subset_columns) & set(df.columns)
df = df[column_match]
Out[69]:
column2 column1 column3
0 2 1 3
1 6 5 7

extract second row values in spark data frame

I have spark dataframe for table (1000000x4) sorted by second column
I need to get 2 values second row, column 0 and second row, column 3
How can I do it?
If you just need the values it's pretty simple, just use the DataFrame's internal RDD. You didn't specify the language, so I will take this freedom to show you how to achieve this using python2.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
requiredRows = [0, 2]
data = (df.rdd.zipWithIndex()
.filter(lambda ((name, age, balance), index): index in requiredRows)
.collect())
And now you can manipulate the variables inside the data list. By the way, I didn't remove the index inside every tuple just to provide you another idea about how this works.
print data
#[(Row(name=u'Bonsanto', age=20, balance=2000.0), 0),
# (Row(name=u'Mises', age=60, balance=1000.0), 2)]

Resources