Concatenate columns based on certain group - python-3.x

I have dataframe df1 ike this:
Schema table Name temp
0 schema1 table1 col1 INT(1,2) NOT NULL
1 schema1 table1 col2 INT(3,2) NOT NULL
2 schema1 table1 col3 SMALLINT(6,2) NULL
3 schema1 table1 col4 SMALLINT(9,2) NULL
4 schema2 table2 col6 CHAR(20,2) NULL
5 schema2 table2 col7 CHAR(20,4) NULL
6 schema2 table2 col8 CHAR(6,5) NULL
7 schema2 table2 col9 CHAR(6,3) NULL
In this dataframe I have two different schemas and tables(table1 and table2). I want to build create table statement out of this.
So, from the above dataframe I need a new dataframe which will have 2 rows (since 2 different tables in df1 ) and the value would be
df2:
ddl_statement
0 create table schema1.table1 (col1 INT(1,2) NOT NULL,col2 INT(3,2) NOT NULL,col3_Nbr SMALLINT(6,2) NULL,col4 SMALLINT(9,2) NULL)
1 create table schema2.ITEM_DESC2 (col6 CHAR(20,2) NULL,col7 CHAR(20,4) NULL,Col8 CHAR(6,5) NULL,col9 CHAR(6,3) NULL)
How can I achieve this with out using loop?

Use groupby and f-strings:
df2 = df.groupby(['Database/Schema Name', 'entity Name'])['temp'] \
.apply(lambda x: f"create table {x.name[0]}.{x.name[1]} ({', '.join(x)})") \
.reset_index(drop=True).to_frame('ddl_statement')
Output:
>>> df2
ddl_statement
0 create table schema1.ITEM_DESC1 (Item_Nbr INT(1,2) NOT NULL, Old_Nbr INT(3,2) NOT NULL, Order_Dept_Nbr SMALLINT(6,2) NULL, Acct_Dept_Nbr SMALLINT(9,2) NULL)
1 create table schema2.ITEM_DESC2 (Primary_Desc CHAR(20,2) NULL, Secondary_Desc CHAR(20,4) NULL, Color_Desc CHAR(6,5) NULL, Size_Desc CHAR(6,3) NULL)

Related

Split corresponding column values in pyspark

Below table would be the input dataframe
col1
col2
col3
1
12;34;56
Aus;SL;NZ
2
31;54;81
Ind;US;UK
3
null
Ban
4
Ned
null
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1
col2
col3
1
12
Aus
1
34
SL
1
56
NZ
2
31
Ind
2
54
US
2
81
UK
3
null
Ban
4
Ned
null
You can use the pyspark function split() to convert the column with multiple values into an array and then the function explode() to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer().
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode() and then filter() to the rows where the positions are corresponding.
Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output

How to transform one row into multiple columns in ADF?

I have TABLE as a source with just 1 row, like this in Azure data factory with 336 columns:
1
2
3
4
5
6
7
8
9
value1
value2
value3
value4
value5
value6
value7
value8
value9
And a want to combine every 3 columns into the first 3:
1
2
3
value1
value2
value3
value4
value5
value6
value7
value8
value9
What is the alternative to using Select on every 3 columns and then Join as it is long process with this many columns?
If your datasource is Azure SQL DB, you could conventional SQL to transform the row with a combination of UNVPIVOT, PIVOT and some of the ranking functions to help group the data. A simple example:
DROP TABLE IF EXISTS #tmp;
CREATE TABLE #tmp (
col1 VARCHAR(10),
col2 VARCHAR(10),
col3 VARCHAR(10),
col4 VARCHAR(10),
col5 VARCHAR(10),
col6 VARCHAR(10),
col7 VARCHAR(10),
col8 VARCHAR(10),
col9 VARCHAR(10)
);
INSERT INTO #tmp
VALUES ( 'value1', 'value2', 'value3', 'value4', 'value5', 'value6', 'value7', 'value8', 'value9' )
SELECT [1], [2], [0] AS [3]
FROM
(
SELECT
NTILE(3) OVER( ORDER BY ( SELECT NULL ) ) nt,
ROW_NUMBER() OVER( ORDER BY ( SELECT NULL ) ) % 3 groupNumber,
newCol
FROM #tmp
UNPIVOT ( newCol for sourceCol In ( col1, col2, col3, col4, col5, col6, col7, col8, col9 ) ) uvpt
) x
PIVOT ( MAX(newCol) For groupNumber In ( [1], [2], [0] ) ) pvt;
Tweak the NTILE value depending on the number of columns you have - it should be the total number of columns you have divided by 3. For example if you have 300 columns, the NTILE value should be 100, if you have 336 columns it should be 112. A bigger example with 336 columns is available here.
Present the data to Azure Data Factory (ADF) either as a view or use the Query option in the Copy activity for example.
My results:
If you are using Azure Synapse Analytics then another fun way to approach this would be using Synapse Notebooks. With just three lines of code, you can get the table from the dedicated SQL pool, unpivot all 336 columns using the stack function and write it back to the database. This simple example is in Scala:
val df = spark.read.synapsesql("someDb.dbo.pivotWorking")
val df2 = df.select( expr("stack(112, *)"))
// Write it back
df2.write.synapsesql("someDb.dbo.pivotWorking_after", Constants.INTERNAL)
I have to admire the simplicity of it.

Identify the relationship between two columns and its respective value count in pandas

I have a Data frame as below :
Col1 Col2 Col3 Col4
1 111 a Test
2 111 b Test
3 111 c Test
4 222 d Prod
5 333 e Prod
6 333 f Prod
7 444 g Test
8 555 h Prod
9 555 i Prod
Expected output :
Column 1 Column 2 Relationship Count
Col2 Col3 One-to-One 2
Col2 Col3 One-to-Many 3
Explanation :
I need to identify the relationship between Col2 & Col3 and also the value counts.
For Eg. 111(col2) is repeated 3 times and has 3 different respective values a,b,c in Col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1
222(col2) is not repeated and has only one respective value d in col3.
This means col2 and col3 has one-to-one relationshipt - count_2 : 1
333(col2) is repeated twice and has 2 different respective values e,f in col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1+1 ( increment this count for every one-to-many relationship)
Similarly for other column values increment the respective counter and display the final results as the expected dataframe.
If you only need to check the relationship between col2 and col3, you can do:
(
df.groupby(by='Col2').Col3
.apply(lambda x: 'One-to-One' if len(x)==1 else 'One-to-Many')
.to_frame('Relationship')
.groupby('Relationship').Relationship
.count().to_frame('Count').reset_index()
.assign(**{'Column 1':'Col2', 'Column 2':'Col3'})
.reindex(columns=['Column 1', 'Column 2', 'Relationship', 'Count'])
)
Output:
Column 1 Column 2 Relationship Count
0 Col2 Col3 One-to-Many 3
1 Col2 Col3 One-to-One 2

Filter a dataframe with NOT and AND condition

I know this question has been asked multiple times, but for some reason it is not working for my case.
So I want to filter the dataframe using the NOT and AND condition.
For example, my dataframe df looks like:
col1 col2
a 1
a 2
b 3
b 4
b 5
c 6
Now, I want to use a condition to remove where col1 has "a" AND col2 has 2
My resulting dataframe should look like:
col1 col2
a 1
b 3
b 4
b 5
c 6
I tried this: Even though I used & but it removes all the rows which have "a" in col1 .
df = df[(df['col1'] != "a") & (df['col2'] != "2")]
To remove cells where col1 is "a" AND col2 is 2 means to keep cells where col1 isn't "a" OR col2 isn't 2 (negation of A AND B is NOT(A) OR NOT(B)):
df = df[(df['col1'] != "a") | (df['col2'] != 2)] # or "2", depending on whether the `2` is an int or a str

Proper way to update pandas dataframe with values from another

What is the proper way to update multiple columns in one dataframe with values from another dataframe?
Say I have these two dataframes:
import pandas as pd
df1 = pd.DataFrame([['4', 'val1', 'val2.4', 'val3.4'],
['5', 'val1', 'val2.5', 'val3.5'],
['6', 'val1', 'val2.6', 'val3.6'],
['7', 'val1', 'val2.7', 'val3.7']],
columns=['account_id', 'field1', 'field2', 'field3'])
df2 = pd.DataFrame([['6', 'VAL2.6', 'VAL3.6'],
['5', 'VAL2.5', 'VAL3.5']],
columns=['account_id', 'field2', 'field3'])
Of note, df2 has only a subset of d1's rows (in some random order) and columns.
I'd like to replace values in df1 with values from df2 (where they exist, joining on account_id, ala an SQL UPDATE).
One solution is something like
cols_to_update = ['field2', 'field3']
df1.loc[df1.account_id.isin(df2.account_id), cols_to_update] = df2[cols_to_update].values
But that doesn't handle the join and results in
account_id field1 field2 field3
0 4 val1 val2.4 val3.4
1 5 val1 VAL2.6 VAL3.6
2 6 val1 VAL2.5 VAL3.5
3 7 val1 val2.7 val3.7
where account_id 6 now has the wrong values.
My questions are:
How do I use indexes to make something like that work?
Is there a merge() or join() solution that isn't so tedious with combining duplicate columns?
Sort the values of df2 before assigning i.e
cols_to_update = ['field2', 'field3']
df1.loc[df1.account_id.isin(df2.account_id), cols_to_update] = df2.sort_values(['account_id'])[cols_to_update].values
account_id field1 field2 field3
0 4 val1 val2.4 val3.4
1 5 val1 VAL2.5 VAL3.5
2 6 val1 VAL2.6 VAL3.6
3 7 val1 val2.7 val3.7
I would suggest you to use the function update of panda's dataframe:
df = pd.DataFrame({'A': [1, 2, 3],'B': [400, 500, 600]})
new_df = pd.DataFrame({'B': [4, 5, 6],'C': [7, 8, 9]})
df.update(new_df)
df
A B
0 1 4
1 2 5
2 3 6

Resources