I have TABLE as a source with just 1 row, like this in Azure data factory with 336 columns:
1
2
3
4
5
6
7
8
9
value1
value2
value3
value4
value5
value6
value7
value8
value9
And a want to combine every 3 columns into the first 3:
1
2
3
value1
value2
value3
value4
value5
value6
value7
value8
value9
What is the alternative to using Select on every 3 columns and then Join as it is long process with this many columns?
If your datasource is Azure SQL DB, you could conventional SQL to transform the row with a combination of UNVPIVOT, PIVOT and some of the ranking functions to help group the data. A simple example:
DROP TABLE IF EXISTS #tmp;
CREATE TABLE #tmp (
col1 VARCHAR(10),
col2 VARCHAR(10),
col3 VARCHAR(10),
col4 VARCHAR(10),
col5 VARCHAR(10),
col6 VARCHAR(10),
col7 VARCHAR(10),
col8 VARCHAR(10),
col9 VARCHAR(10)
);
INSERT INTO #tmp
VALUES ( 'value1', 'value2', 'value3', 'value4', 'value5', 'value6', 'value7', 'value8', 'value9' )
SELECT [1], [2], [0] AS [3]
FROM
(
SELECT
NTILE(3) OVER( ORDER BY ( SELECT NULL ) ) nt,
ROW_NUMBER() OVER( ORDER BY ( SELECT NULL ) ) % 3 groupNumber,
newCol
FROM #tmp
UNPIVOT ( newCol for sourceCol In ( col1, col2, col3, col4, col5, col6, col7, col8, col9 ) ) uvpt
) x
PIVOT ( MAX(newCol) For groupNumber In ( [1], [2], [0] ) ) pvt;
Tweak the NTILE value depending on the number of columns you have - it should be the total number of columns you have divided by 3. For example if you have 300 columns, the NTILE value should be 100, if you have 336 columns it should be 112. A bigger example with 336 columns is available here.
Present the data to Azure Data Factory (ADF) either as a view or use the Query option in the Copy activity for example.
My results:
If you are using Azure Synapse Analytics then another fun way to approach this would be using Synapse Notebooks. With just three lines of code, you can get the table from the dedicated SQL pool, unpivot all 336 columns using the stack function and write it back to the database. This simple example is in Scala:
val df = spark.read.synapsesql("someDb.dbo.pivotWorking")
val df2 = df.select( expr("stack(112, *)"))
// Write it back
df2.write.synapsesql("someDb.dbo.pivotWorking_after", Constants.INTERNAL)
I have to admire the simplicity of it.
Related
Below table would be the input dataframe
col1
col2
col3
1
12;34;56
Aus;SL;NZ
2
31;54;81
Ind;US;UK
3
null
Ban
4
Ned
null
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1
col2
col3
1
12
Aus
1
34
SL
1
56
NZ
2
31
Ind
2
54
US
2
81
UK
3
null
Ban
4
Ned
null
You can use the pyspark function split() to convert the column with multiple values into an array and then the function explode() to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer().
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode() and then filter() to the rows where the positions are corresponding.
Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output
I have following table.
col1
col2
col3
col4
col5
Key1
value1
value2
value3
value4
key2
I want to merge rows in the way it shows normal table
col1
col2
col3
col4
col5
Key1
value1
value2
value3
value4
key2
value6
value7
value8
value9
how is that possible with pandas?
EDIT:
As in comments mentioned, I upload the dataset here.
the dataset is some multiple choice questions and answers, some of the questions are multiple lines (rows) and I have to merge these rows first.
Thanks
I have dataframe df1 ike this:
Schema table Name temp
0 schema1 table1 col1 INT(1,2) NOT NULL
1 schema1 table1 col2 INT(3,2) NOT NULL
2 schema1 table1 col3 SMALLINT(6,2) NULL
3 schema1 table1 col4 SMALLINT(9,2) NULL
4 schema2 table2 col6 CHAR(20,2) NULL
5 schema2 table2 col7 CHAR(20,4) NULL
6 schema2 table2 col8 CHAR(6,5) NULL
7 schema2 table2 col9 CHAR(6,3) NULL
In this dataframe I have two different schemas and tables(table1 and table2). I want to build create table statement out of this.
So, from the above dataframe I need a new dataframe which will have 2 rows (since 2 different tables in df1 ) and the value would be
df2:
ddl_statement
0 create table schema1.table1 (col1 INT(1,2) NOT NULL,col2 INT(3,2) NOT NULL,col3_Nbr SMALLINT(6,2) NULL,col4 SMALLINT(9,2) NULL)
1 create table schema2.ITEM_DESC2 (col6 CHAR(20,2) NULL,col7 CHAR(20,4) NULL,Col8 CHAR(6,5) NULL,col9 CHAR(6,3) NULL)
How can I achieve this with out using loop?
Use groupby and f-strings:
df2 = df.groupby(['Database/Schema Name', 'entity Name'])['temp'] \
.apply(lambda x: f"create table {x.name[0]}.{x.name[1]} ({', '.join(x)})") \
.reset_index(drop=True).to_frame('ddl_statement')
Output:
>>> df2
ddl_statement
0 create table schema1.ITEM_DESC1 (Item_Nbr INT(1,2) NOT NULL, Old_Nbr INT(3,2) NOT NULL, Order_Dept_Nbr SMALLINT(6,2) NULL, Acct_Dept_Nbr SMALLINT(9,2) NULL)
1 create table schema2.ITEM_DESC2 (Primary_Desc CHAR(20,2) NULL, Secondary_Desc CHAR(20,4) NULL, Color_Desc CHAR(6,5) NULL, Size_Desc CHAR(6,3) NULL)
I have a Data frame as below :
Col1 Col2 Col3 Col4
1 111 a Test
2 111 b Test
3 111 c Test
4 222 d Prod
5 333 e Prod
6 333 f Prod
7 444 g Test
8 555 h Prod
9 555 i Prod
Expected output :
Column 1 Column 2 Relationship Count
Col2 Col3 One-to-One 2
Col2 Col3 One-to-Many 3
Explanation :
I need to identify the relationship between Col2 & Col3 and also the value counts.
For Eg. 111(col2) is repeated 3 times and has 3 different respective values a,b,c in Col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1
222(col2) is not repeated and has only one respective value d in col3.
This means col2 and col3 has one-to-one relationshipt - count_2 : 1
333(col2) is repeated twice and has 2 different respective values e,f in col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1+1 ( increment this count for every one-to-many relationship)
Similarly for other column values increment the respective counter and display the final results as the expected dataframe.
If you only need to check the relationship between col2 and col3, you can do:
(
df.groupby(by='Col2').Col3
.apply(lambda x: 'One-to-One' if len(x)==1 else 'One-to-Many')
.to_frame('Relationship')
.groupby('Relationship').Relationship
.count().to_frame('Count').reset_index()
.assign(**{'Column 1':'Col2', 'Column 2':'Col3'})
.reindex(columns=['Column 1', 'Column 2', 'Relationship', 'Count'])
)
Output:
Column 1 Column 2 Relationship Count
0 Col2 Col3 One-to-Many 3
1 Col2 Col3 One-to-One 2
I want to remove duplicates from multiple cells of the column 5 with delimiter "|". The data I have looks like this:
Col1 Col2 Col3 Col4 Col5
1048563 93750984 5 0.499503476 HTR7|HTR7|HTR7
1048564 93751210 5 0.499503476 ABHD3|ABHD3|ABHD3|ABHD3|ABHD3|ABHD3
1048566 93751298 5 0.499503476 ADCYAP1|ADCYAP1|ADCYAP1|ADCYAP1
And I want the result to be:
Col1 Col2 Col3 Col4 Col5
1048563 93750984 5 0.499503476 HTR7
1048564 93751210 5 0.499503476 ABHD3
1048566 93751298 5 0.499503476 ADCYAP1
The number of rows and columns are different.The length of the text in column 5 is not always the same