Explode a dataframe column of csv text into columns

Explode a dataframe column of csv text into columns - apache-spark

+-------------+-------------------+--------------------+----------------------+
|serial_number| test_date| s3_path| table_csv_data |
+-------------+-------------------+--------------------+----------------------+
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1BE|2019-05-08 09:26:55|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 06:54:28|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 00:19:52|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-24 22:24:40|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-09-12 22:15:19|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:27:56|s3://test-bucket-...|col1,col2,col3,col4...|
+-------------+-------------------+--------------------+----------------------+
sample table_csv_data column contains:
timestamp,partition,offset,key,value
1625218801350,97,33009,2CKXTKAT_20210701193302_6400_UCMP,458969040
1625218801349,41,33018,3FGW9S6T_20210701193210_6400_UCMP,17569160
Trying to achieve the final dataframe as below, please help
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
|serial_number| test_date| timestamp| partition | offset | key | value |
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
| 1050D1B0|2019-05-07 15:41:11| 1625218801350 | 97 | 33009 | 2CKXTKAT_20210701193302_6400_UCMP | 458969040 |
| 1050D1B0|2019-05-07 15:41:11| 1625218801349 | 41 | 33018 | 3FGW9S6T_20210701193210_6400_UCMP | 17569160 |
..
..
..
I cannot think of an approach, kindly help with some suggestions.
As I alternative, I converted the sting csv data into list using csv_reader as well as below, but post that I have been blocked
[[timestamp,partition,offset,key,value],
[1625218801350, 97, 33009, 2CKXTKAT_20210701193302_6400_UCMP, 458969040]
[1625218801349, 41,33018, 3FGW9S6T_20210701193210_6400_UCMP, 17569160]]

You just need to use split :
from pyspark.sql import functions as F
df = df.withColumn("table_csv_data", F.split("table_csv_data", ",")).select(
"serial_number",
"test_date",
F.col("table_csv_data").getItem(0).alias("timestamp"),
F.col("table_csv_data").getItem(1).alias("partition"),
... # Do the same for all the columns you need
)

Related

PySpark Map to Columns, rename key columns

I am converting the Map column to multiple columns dynamically based on the values in the column. I am using the following code (taken mostly from here), and it works perfectly fine.
However, I would like to rename the column names that are programmatically generated.
Input df:
| map_col |
|:-------------------------------------------------------------------------------|
| {"customer_id":"c5","email":"abc#yahoo.com","mobile_number":"1234567890"} |
| null |
| {"customer_id":"c3","mobile_number":"2345678901","email":"xyz#gmail.com"} |
| {"email":"pqr#hotmail.com","customer_id":"c8","mobile_number":"3456789012"} |
| {"email":"mnk#GMAIL.COM"} |
Code to convert Map to Columns
keys_df = df.select(F.explode(F.map_keys(F.col("map_col")))).distinct()`
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("map_col").getItem(f).alias(str(f)), keys))
final_cols = [F.col("*")] + key_cols
df = df.select(final_cols)
Output df:
| customer_id | mobile_number | email |
|:----------- |:--------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |
I already have the fields customer_id, mobile_number and email in the main dataframe, of which map_col is one of the columns. I get error when I try to generate the output because same column names are already in the dataset. Therefore, I need to rename these column names to customer_id_2, mobile_number_2, and email_2 before it is generated in the dataset. map_col column may have more keys and values than shown.
Desired output:
| customer_id_2 | mobile_number_2 | email_2 |
|:------------- |:-----------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |

Add the following line just before the code which converts map to columns:
df = df.withColumn('map_col', F.expr("transform_keys(map_col, (k, v) -> concat(k, '_2'))"))
This uses transform_keys which changes the key names adding _2 to the originam name, as you needed.

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.

Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Insert data if previous column contains data

I'm trying to think of a way where I can have a Powershell script check an Excel or CSV file, read the last column and then insert a static data for each row that has data.
For example this is the original file:
| Id | Price | Description |
|-----|---------|---------------------------------|
| 33 | 878.35 | C2 Flush Casement |
| 52 | 1111.69 | DS2-L101-T102_L |
| 71 | 875.75 | C2 Flush Casement |
I would like to add the Quote number (Which I can get from the file name):
| Id | Price | Description | Quote |
|----|---------|-------------------|-------|
| 33 | 878.35 | C2 Flush Casement | Q1234 |
| 52 | 1111.69 | DS2-L101-T102_L | Q1234 |
| 71 | 875.75 | C2 Flush Casement | Q1234 |
Any pointers would be appreciated.

You can use Import-Csv to read and parse the data, then create the new column with Select-Object and write to a new CSV file with Export-Csv:
Import-Csv path\to\file.csv |Select-Object *,#{Name='Quote';Expression={if($_.Description -ne ''){'Q1234'}}} |Export-Csv path\to\output.csv -NoTypeInformation

Pandas merge have two results with the same code and input data

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations（First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+

I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times，the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")

'where' in apache spark

df:
-----------+
| word|
+-----------+
| 1609|
| |
| the|
| sonnets|
| |
| by|
| william|
|shakespeare|
| |
| fg|
This is my data frame. How to remove the empty rows (to remove the rows that contain '') using the 'where' clause.
code:
df.where(trim(df.word) == "").show()
output:
----+
|word|
+----+
| |
| |
| |
| |
| |
| |
| |
| |
| |
Any help is appreciated.

You can trim and check if result is empty:
>>> from pyspark.sql.functions import trim
>>> df.where(trim(df.word) != "")

Apart from where, you can also use filter to achieve this.
from pyspark.sql.functions import trim
df.filter(trim(df.word) != "").show()
df.where(trim(df.word) != "").show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Explode a dataframe column of csv text into columns - apache-spark

Related

PySpark Map to Columns, rename key columns

Split column on condition in dataframe

Insert data if previous column contains data

Pandas merge have two results with the same code and input data

'where' in apache spark

Categories

Resources