merge rows in pandas with empty values - python-3.x

I have following table.
col1
col2
col3
col4
col5
Key1
value1
value2
value3
value4
key2
I want to merge rows in the way it shows normal table
col1
col2
col3
col4
col5
Key1
value1
value2
value3
value4
key2
value6
value7
value8
value9
how is that possible with pandas?
EDIT:
As in comments mentioned, I upload the dataset here.
the dataset is some multiple choice questions and answers, some of the questions are multiple lines (rows) and I have to merge these rows first.
Thanks

Related

How to transform one row into multiple columns in ADF?

I have TABLE as a source with just 1 row, like this in Azure data factory with 336 columns:
1
2
3
4
5
6
7
8
9
value1
value2
value3
value4
value5
value6
value7
value8
value9
And a want to combine every 3 columns into the first 3:
1
2
3
value1
value2
value3
value4
value5
value6
value7
value8
value9
What is the alternative to using Select on every 3 columns and then Join as it is long process with this many columns?
If your datasource is Azure SQL DB, you could conventional SQL to transform the row with a combination of UNVPIVOT, PIVOT and some of the ranking functions to help group the data. A simple example:
DROP TABLE IF EXISTS #tmp;
CREATE TABLE #tmp (
col1 VARCHAR(10),
col2 VARCHAR(10),
col3 VARCHAR(10),
col4 VARCHAR(10),
col5 VARCHAR(10),
col6 VARCHAR(10),
col7 VARCHAR(10),
col8 VARCHAR(10),
col9 VARCHAR(10)
);
INSERT INTO #tmp
VALUES ( 'value1', 'value2', 'value3', 'value4', 'value5', 'value6', 'value7', 'value8', 'value9' )
SELECT [1], [2], [0] AS [3]
FROM
(
SELECT
NTILE(3) OVER( ORDER BY ( SELECT NULL ) ) nt,
ROW_NUMBER() OVER( ORDER BY ( SELECT NULL ) ) % 3 groupNumber,
newCol
FROM #tmp
UNPIVOT ( newCol for sourceCol In ( col1, col2, col3, col4, col5, col6, col7, col8, col9 ) ) uvpt
) x
PIVOT ( MAX(newCol) For groupNumber In ( [1], [2], [0] ) ) pvt;
Tweak the NTILE value depending on the number of columns you have - it should be the total number of columns you have divided by 3. For example if you have 300 columns, the NTILE value should be 100, if you have 336 columns it should be 112. A bigger example with 336 columns is available here.
Present the data to Azure Data Factory (ADF) either as a view or use the Query option in the Copy activity for example.
My results:
If you are using Azure Synapse Analytics then another fun way to approach this would be using Synapse Notebooks. With just three lines of code, you can get the table from the dedicated SQL pool, unpivot all 336 columns using the stack function and write it back to the database. This simple example is in Scala:
val df = spark.read.synapsesql("someDb.dbo.pivotWorking")
val df2 = df.select( expr("stack(112, *)"))
// Write it back
df2.write.synapsesql("someDb.dbo.pivotWorking_after", Constants.INTERNAL)
I have to admire the simplicity of it.

Partial and exact match of strings in postgres

I have following table in postgres 10.
col1 col2 col3 col4
NCT00000102 Drug nifedipine nifed
NCT00000102 Drug nifedipine sulphate nifedipine
NCT00000103 Drug phospho nifedipine ni
NCT00000103 Drug phospho nifedipine sulphate phospho nifedipine
NCT00000105 Drug fluticasone furoate (veramyst®) nasal spray veramyst
NCT00000105 Drug fluticasone furoate (veramyst®) nasal spray vera
NCT00000106 Drug veramyst veramyst
I am looking for a way to filter above table so that either only exact match or partial match but with the complete word/s included.
I am expecting below output.
col1 col2 col3 col4
NCT00000102 Drug nifedipine sulphate nifedipine
NCT00000103 Drug phospho nifedipine sulphate phospho nifedipine
NCT00000105 Drug fluticasone furoate (veramyst®) nasal spray veramyst
NCT00000106 Drug veramyst veramyst
I tried using below query but it is not giving me the partial matches properly.
where col3 '%' || col4 || '%'
Any suggestion here will be really useful.
Try using regex matching:
SELECT col1, col2, col3, col4
FROM yourTable
WHERE col3 ~* CONCAT('\y', col4, '\y');
This will return every record where the word or words in col4 appear as proper word(s) in col3. If you wanted to use LIKE, you could use:
WHERE col3 LIKE '%' || col4 || '%'
but keep in mind this might return false positives for substring matches.

How to prevent df.pivot from inserting 'None' and blank rows during pivot?

I am wanting to pivot a df that looks like this:
columns values
col1 test1
col2 test2
col3 test3
col4 test4
col1 test5
col2 test6
col3 test7
col4 test8
I am trying this:
df['index'] = df.index
df = df.pivot(index='index', columns='columns', values='values')
which results in a df that looks like this (roughly):
col1 col2 col3 col4
None None test1 None
test5 None None None
How do I pivot the df to look like this?:
col1 col2 col3 col4
test1 test2 test3 test4
test5 test6 test7 test8
I am creating an articfical index column because I dont have another column to make an index. I only have 2 columns in the dataframe.
Using cumcount create a new key then do pivot
df.assign(key=df.groupby('columns').cumcount()).pivot('key','columns','values')
Out[54]:
columns col1 col2 col3 col4
key
0 test1 test2 test3 test4
1 test5 test6 test7 test8

Remove duplicates from multiple cells of a column seperated by "|"

I want to remove duplicates from multiple cells of the column 5 with delimiter "|". The data I have looks like this:
Col1 Col2 Col3 Col4 Col5
1048563 93750984 5 0.499503476 HTR7|HTR7|HTR7
1048564 93751210 5 0.499503476 ABHD3|ABHD3|ABHD3|ABHD3|ABHD3|ABHD3
1048566 93751298 5 0.499503476 ADCYAP1|ADCYAP1|ADCYAP1|ADCYAP1
And I want the result to be:
Col1 Col2 Col3 Col4 Col5
1048563 93750984 5 0.499503476 HTR7
1048564 93751210 5 0.499503476 ABHD3
1048566 93751298 5 0.499503476 ADCYAP1
The number of rows and columns are different.The length of the text in column 5 is not always the same

How to flip string using linux in specific columns

I have few columns as shown below:
col1 col2 col3 col4 a/t t/g g/t f/g
col3 col2 col4 col5 t/a g/t f/g g/t
I would need to flip the values in columns after 4, and the sample output is shown below:
col1 col2 col3 col4 t/a g/t t/g g/f
col3 col2 col4 col5 a/t t/g g/f t/g
I tried using the -rev option in bash but it prints the whole row in the inverted direction (mirror image). Is there an alternate solution for this just to flip the strings as shown in the output? Thanks in advance.
You don't say what the first 4 column may contain, so I assume this would be enough
sed 's/\(\w\)\/\(\w\)/\2\/\1/g' <yourfile>
like:
$ cat test
col1 col2 col3 col4 t/a g/t t/g g/f
col3 col2 col4 col5 a/t t/g g/f t/g
$ sed 's/\(\w\)\/\(\w\)/\2\/\1/g' test
col1 col2 col3 col4 a/t t/g g/t f/g
col3 col2 col4 col5 t/a g/t f/g g/t
if you want to save the result to a file, redirect sed output:
$ sed 's/\(\w\)\/\(\w\)/\2\/\1/g' test > newfile
perl -lane 'print join " ", #F[0..3], map { scalar reverse $_} #F[4..$#F]'

Resources