Databricks Deduplicaton PySpark Code Removing All Rows in Table - apache-spark

I was given assistance with dedupe question I had to remove duplicate rows in databricks with Pyspark.
The dataframe that I execute the code on is as follows:
Id
createdon
SinkModifiedOn
title
caltype
e64650d3-94fb-ec11-82e6-0022481b14fc
04/07/2022
16/01/2023 14:37:14
Partner
Enterprise
8b97aa35-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
bd97aa35-1d81-e811-a95c-00224800c9ff
06/07/2018
17/01/2023 18:08:12
Partner
Enterprise
42b5f518-1d81-e811-a95c-00224800c9ff
06/07/2018
17/01/2023 18:08:12
Partner
Enterprise
8ab2abf9-6169-ec11-8943-000d3a870a5b
30/12/2021
16/01/2023 14:37:14
Partner
Basic
2d51010f-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
40e1feb7-efa5-e811-a96b-00224800cdc2
22/08/2018
16/01/2023 14:37:14
Partner
Enterprise
d3320875-1e81-e811-a95c-00224800cc97
06/07/2018
17/01/2023 18:08:12
Partner
Enterprise
1ea0055a-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
81cf613f-1e81-e811-a95c-00224800cc97
06/07/2018
17/01/2023 18:08:12
Partner
Enterprise
fb50010f-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
af4a3b88-1d81-e811-a95c-00224800c4f1
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
0551010f-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
8c20f753-5a78-e811-a95b-00224800c3e8
25/06/2018
16/01/2023 14:37:14
Partner
Enterprise
e597aa35-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
0adad377-b039-e911-a999-00224800c9f4
26/02/2019
16/01/2023 14:37:14
Partner
Enterprise
3315cd94-1d81-e811-a95c-00224800c4f1
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
151ca1a8-a586-e811-a95c-00224800c4f1
13/07/2018
16/01/2023 14:37:14
Partner
Enterprise
55ce798d-2081-e811-a95c-00224800cc97
06/07/2018
17/01/2023 18:08:12
Partner
Enterprise
55ce798d-2081-e811-a95c-00224800cc97
06/07/2018
16/01/2023 13:37:09
Partner
Enterprise
6fa9fe08-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
8197aa35-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
4151010f-1d81-e811-a95c-00224800c9ff
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
74cf613f-1e81-e811-a95c-00224800cc97
06/07/2018
16/01/2023 14:37:14
Partner
Enterprise
The Pyspark code is as follows:
The PySpark is as follows:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("Id","CreatedOn").orderBy("SinkModifiedOn")))
df3 = df2.filter("rn = 1").drop("rn")
After executing the code absolutely no rows are returned.
Can someone take a look at the code at let me know why after deduping no rows are returned?
My guess is that it has something to do with the SinkModified on field

Use dropDuplicates to remove duplicate rows :
df = partdf.dropDuplicates(subset=['Id', 'createdon', 'SinkModifiedOn', 'caltype'])
Or
in this specific case, just simply:
partdf.drop_duplicates()

Related

AADSTS50042: The salt required to generate a pairwise identifier is missing in the principal

I created a web application to process data inside Office 365 organization. The application is registered in Azure AD portal. I use OAuth 2.0 authorization code flow to request tokens from Microsoft. It works fine in case of interaction with default Office 365 deployment. However, I can't obtain tokens while working with germany Office 365 setup.
When I execute the next request
I get the next error
{
"error": "server_error",
"error_description": "AADSTS50042: The salt required to generate a pairwise identifier is missing in the principal.\r\nTrace ID: 2700e09a-b4a8-4081-b1b5-3e598ad01800\r\nCorrelation ID: 2ff3954f-066f-41e4-b72a-7039aca83497\r\nTimestamp: 2021-02-12 13:40:21Z",
"error_codes": [
50042
],
"timestamp": "2021-02-12 13:40:21Z",
"trace_id": "2700e09a-b4a8-4081-b1b5-3e598ad01800",
"correlation_id": "2ff3954f-066f-41e4-b72a-7039aca83497",
"error_uri": "https://login.microsoftonline.de/error?code=50042"
}
What is the reason of this behavior? How can it be fixed?

Set Signing Options of Enterprise Application using Graph API

does anyone of you know how I can set the 'Signing Option' of an azure enterprise app single sign-on to 'Sign SAML response and assertion' using the Graph API?
If you want to use Microsoft Graph API to set the 'Signing Option' of an azure enterprise app single sign-on, please refer to the following steps
Get the azure enterprise app token Issuance Policy
Get https://graph.microsoft.com/v1.0/servicePrincipals/<the enterprise app object id>/tokenIssuancePolicies
Update token Issuance Policy
Patch https://graph.microsoft.com/v1.0/policies/tokenIssuancePolicies/{id}
Content-type: application/json
{
"definition":["{\r\n \"TokenIssuancePolicy\": {\r\n \"Version\": 1,\r\n \"SigningAlgorithm\": \"http://www.w3.org/2001/04/xmldsig-more#rsa-sha256\",\r\n \"TokenResponseSigningPolicy\": \"ResponseAndToken\",\r\n \"SamlTokenVersion\": \"2.0\"\r\n }\r\n}"]
}
The origin setting
The new setting
For more details, please refer to here and here

Token-based authentication support for Azure SQL DB using Azure AD auth

According to this page.
https://learn.microsoft.com/en-us/archive/blogs/sqlsecurity/token-based-authentication-support-for-azure-sql-db-using-azure-ad-auth
AAD Token-based authentication to access Azure SQL DB is supported only if client is under windows environment.
Could MacOS and Linux support AAD Token-based authentication to access Azure SQL DB?
https://github.com/mkleehammer/pyodbc/issues/228
token = context.acquire_token_with_client_credentials(
database_url,
azure_client_id,
azure_client_secret
)
print(token)
tokenb = bytes(token["accessToken"], "UTF-8")
exptoken = b''
for i in tokenb:
exptoken += bytes({i})
exptoken += bytes(1)
tokenstruct = struct.pack("=i", len(exptoken)) + exptoken
tokenstruct
SQL_COPT_SS_ACCESS_TOKEN = 1256
CONNSTRING = "DRIVER={};SERVER={};DATABASE={}".format("ODBC Driver 17 for SQL Server", prod_server, prod_db)
db_connector = pyodbc.connect(CONNSTRING, attrs_before={SQL_COPT_SS_ACCESS_TOKEN: tokenstruct})
This is the code I run under MacOS and it is python.
I keep getting this issue
pyodbc.InterfaceError: ('28000', "[28000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Login failed for user ''. (18456) (SQLDriverConnect)")
Anyone has a idea?
It seems that you have not added your application service principal to your Azure SQL database .
What you need to do is to:
1. Enable AAD authentication for your Azure SQL Server. Please select an AAD user in this step.
2. Connect to your Azure SQL Database with the user account you set in step 1.
3. Add your application service principal to your SQL Server, and alert appropriate role to it.
CREATE USER [Azure_AD_principal_name] FROM EXTERNAL PROVIDER;
EXEC sp_addrolemember 'db_owner', 'Azure_AD_principal_name';
Here, the Azure_AD_principal_name should be the application's name.
4. Connect to your Azure SQL Database with AAD

How to make Google Apps iDP for Office 365

I am trying to have our Google Apps users to sign in Office 365 with the Google credentials.
I am struggling with 2 things.
1. setup a federated domain with Azure AD. Can anyone match the required variables from the Google iDP Meta data below?
Below are the variables of Microsoft to set a federated domain from their help pages.
$dom = "contoso.com"
$BrandName - "Sample SAML 2.0 IDP"
$LogOnUrl = "https://WS2012R2-0.contoso.com/passiveLogon"
$LogOffUrl = "https://WS2012R2-0.contoso.com/passiveLogOff"
$ecpUrl = "https://WS2012R2-0.contoso.com/PAOS"
$MyURI = "urn:uri:MySamlp2IDP"
$MySigningCert = #" MIIC7jCCAdag......NsLlnPQcX3dDg9A==" "#
$uri = "http://WS2012R2-0.contoso.com/adfs/services/trust"
$Protocol = "SAMLP"
Set-MsolDomainAuthentication ` -DomainName $dom -FederationBrandName $dom -Authentication Federated -PassiveLogOnUri $MyURI -ActiveLogOnUri $ecpUrl -SigningCertificate $MySigningCert -IssuerUri $uri -LogOffUri $url -PreferredAuthenticationProtocol $Protocol
This is the Google iDP metadata where it suppose to have all the info
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<md:EntityDescriptor xmlns:md="urn:oasis:names:tc:SAML:2.0:metadata" entityID="https://accounts.google.com/o/saml2?idpid=C01gs" validUntil="2021-08-31T11:57:42.000Z">
<md:IDPSSODescriptor WantAuthnRequestsSigned="false" protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol">
<md:KeyDescriptor use="signing">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:X509Data>
<ds:X509Certificate>MIIDdDCCAlygAwIBAgI
MTE1NzQyWhcNMjEwODM.....yVlPqeevZ6Ij
f7LcIuZHffg1JV6pOB3A7afVp7JBbzZZOeuhl5nUhr96</ds:X509Certificate>
</ds:X509Data>
</ds:KeyInfo>
</md:KeyDescriptor>
<md:NameIDFormat>urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress</md:NameIDFormat>
<md:SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect" Location="https://accounts.google.com/o/saml2/idp?idpid=C02gs"/>
<md:SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST" Location="https://accounts.google.com/o/saml2/idp?idpid=C03gs"/>
</md:IDPSSODescriptor>
</md:EntityDescriptor>
2. After a successful federated domain (I succeeded, but it didn't work, so the variables I provided powershell was wrong). The admin portal for Office 365 does NOT allow one to add users from a federated domain. So, how to add users?
Hope someone can help me with this puzzle.
Google now provides documentation on setting up SSO with Office 365.
Corresponding documentation on adding Google as an IdP from Microsoft is also available; however, I found no use there and kept getting failed sign-ins after following Microsoft's instructions (SSO didn't work).
Instead, I followed instructions from James Winegar which contains PowerShell commands (runnable in the Cloud Shell provided in Microsoft's admin portal, or locally) that let me get federation up and running in an hour or so, including time for user sync to start from Google's end.
This support from Microsoft seems relatively new—obviously it's less than 4 years old, based on the age of this question. But at least it now exists. By the time anyone reads this answer, there might be better tutorials available.
Microsoft Azure AD (the identity system behind Office 365) only support federation with a handful of Identity Federation Providers:
https://azure.microsoft.com/en-us/documentation/articles/active-directory-aadconnect-federation-compatibility/
Google Apps is not in this list and is not a supported federation system. You cannot use Google Apps users to sign-in to Office 365 with their Google Apps Credentials.
What you can do however, is to allow your Office 365 users to single-sign-on to their Google Apps (which is the other way around):
https://azure.microsoft.com/en-us/documentation/articles/active-directory-saas-google-apps-tutorial/
EDIT
After the provided link, it seems that is possible to use Google Work Apps Ids for SSO with Office 365, but with the intermediate player Windows Server Active Directory and ADFS. Which quite an overhead.

How to manage customer's usage-based subscription programmatically?

Let me first describe a "manual" scenario. I login to Partner Center as a partner and go to customer list (https://partnercenter.microsoft.com/en-us/pcv/customers/list). For any customer it is possible to manage all its usage-based subscriptions in Azure portal using All resources (Azure portal) link:
In particular, I can add a co-admin to subscription (i.e. add a user with role Owner):
How to automate this management of customer's subscriptions?
My efforts: I have some experience of CREST API and RBAC API. This is limitation of an Azure Active Directory (AAD) application described in docs:
You can only grant access to resource in your subscription for applications in the same directory as your subscription.
Due to each customer's subscription exists in separate customer's AAD, it seems RBAC API cann't help:
It requires an AAD application-based token (i.e. based on TenantId,
ClientId, ClientSecret), and there is no way to
programmatically create an AAD application in customer's directory.
An application located in partner's AAD cann't get access to
customer's subscription.
Does any way to programmatically add an admin/co-admin/owner to customer's subscription exist?
With Patrick Liang help on MSDN forums, finally I've come up with a solution: enable Pre-consent feature for a partner's AAD app to grant access to customers subscriptions. Let me describe it:
1. Partner Center Explorer project
https://github.com/Microsoft/Partner-Center-Explorer/
It's a web application similar to partnercenter.microsoft.com and it's a good example of various Microsoft APIs usage. Most important, this project is a complete example of accessing customer's subscription from partner AAD app. However, it suggests user interaction (OAuth authentification to login.live.com as a partner) and I faced some issues when tried to avoid it. Below I describe how to connect to customer's subscription with all credentials already in code.
2. Partner AAD app
Create native AAD app instead of web AAD app but configure its "Permissions to other applications" the same way. Skip steps which are not applicable to native app (for example, skip client_secret obtaining and skip manifest update).
3. PowerShell script
Last step of app configuring is to run this script:
Connect-MsolService
$g = Get-MsolGroup | ? {$_.DisplayName -eq 'AdminAgents'}
$s = Get-MsolServicePrincipal | ? {$_.AppPrincipalId -eq 'INSERT-CLIENT-ID-HERE'}
Add-MsolGroupMember -GroupObjectId $g.ObjectId -GroupMemberType ServicePrincipal -GroupMemberObjectId $s.ObjectId
It's required to install several modules to execute these comandlets. If you get an error during "Microsoft Online Services Sign-In Assistant for IT Professionals" install, try to install BETA module:
Microsoft Online Services Sign-In Assistant for IT Professionals BETA
And you probably will need this one:
Microsoft Online Services Module for Windows PowerShell 64-bit
4. Code
Finally we are ready to authenticate and create a role assignment:
public async void AssignRoleAsync()
{
var token = await GetTokenAsync();
var response = await AssignRoleAsync(token.AccessToken);
}
public async Task<AuthenticationResult> GetTokenAsync()
{
var authContext = new AuthenticationContext($"https://login.windows.net/{CustomerId}");
return await authContext.AcquireTokenAsync(
"https://management.core.windows.net/"
, ApplicationId
, new UserCredential(PartnerUserName, PartnerPassword));
}
public async Task<HttpResponseMessage> AssignRoleAsync(string accessToken)
{
string newAssignmentId = Guid.NewGuid().ToString();
string subSegment = $"subscriptions/{CustomerSubscriptionId}/providers/Microsoft.Authorization";
string requestUri = $"https://management.azure.com/{subSegment}/roleAssignments/{newAssignmentId}?api-version=2015-07-01";
string roleDefinitionId = "INSERT_ROLE_GUID_HERE";
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("Authorization", "Bearer " + accessToken);
var body = new AssignRoleRequestBody();
body.properties.principalId = UserToAssignId;
body.properties.roleDefinitionId = $"/{subSegment}/roleDefinitions/{roleDefinitionId}";
var httpRequest = HttpHelper.CreateJsonRequest(body, HttpMethod.Put, requestUri);
return await client.SendAsync(httpRequest);
}
}
To obtain role definition IDs, just make a request to get all roles per subscription scope.
Useful links:
MSDN: How to manage customer's usage-based subscription programmatically?
MSDN: When will auto-stamping/implicit consent be available for CREST customers?
Managing Role-Based Access Control with the REST API

Resources