+-------------+-------------------+--------------------+----------------------+
|serial_number| test_date| s3_path| table_csv_data |
+-------------+-------------------+--------------------+----------------------+
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1B0|2019-05-07 15:41:11|s3://test-bucket-...|col1,col2,col3,col4...|
| 1050D1BE|2019-05-08 09:26:55|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 06:54:28|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:07:21|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-25 00:19:52|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-24 22:24:40|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-09-12 22:15:19|s3://test-bucket-...|col1,col2,col3,col4...|
| A0123456|2019-07-22 21:27:56|s3://test-bucket-...|col1,col2,col3,col4...|
+-------------+-------------------+--------------------+----------------------+
sample table_csv_data column contains:
timestamp,partition,offset,key,value
1625218801350,97,33009,2CKXTKAT_20210701193302_6400_UCMP,458969040
1625218801349,41,33018,3FGW9S6T_20210701193210_6400_UCMP,17569160
Trying to achieve the final dataframe as below, please help
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
|serial_number| test_date| timestamp| partition | offset | key | value |
+-------------+-------------------+--------------------+-----------------+-----------+-----------------------------------+--------------+
| 1050D1B0|2019-05-07 15:41:11| 1625218801350 | 97 | 33009 | 2CKXTKAT_20210701193302_6400_UCMP | 458969040 |
| 1050D1B0|2019-05-07 15:41:11| 1625218801349 | 41 | 33018 | 3FGW9S6T_20210701193210_6400_UCMP | 17569160 |
..
..
..
I cannot think of an approach, kindly help with some suggestions.
As I alternative, I converted the sting csv data into list using csv_reader as well as below, but post that I have been blocked
[[timestamp,partition,offset,key,value],
[1625218801350, 97, 33009, 2CKXTKAT_20210701193302_6400_UCMP, 458969040]
[1625218801349, 41,33018, 3FGW9S6T_20210701193210_6400_UCMP, 17569160]]
You just need to use split :
from pyspark.sql import functions as F
df = df.withColumn("table_csv_data", F.split("table_csv_data", ",")).select(
"serial_number",
"test_date",
F.col("table_csv_data").getItem(0).alias("timestamp"),
F.col("table_csv_data").getItem(1).alias("partition"),
... # Do the same for all the columns you need
)
Related
I am converting the Map column to multiple columns dynamically based on the values in the column. I am using the following code (taken mostly from here), and it works perfectly fine.
However, I would like to rename the column names that are programmatically generated.
Input df:
| map_col |
|:-------------------------------------------------------------------------------|
| {"customer_id":"c5","email":"abc#yahoo.com","mobile_number":"1234567890"} |
| null |
| {"customer_id":"c3","mobile_number":"2345678901","email":"xyz#gmail.com"} |
| {"email":"pqr#hotmail.com","customer_id":"c8","mobile_number":"3456789012"} |
| {"email":"mnk#GMAIL.COM"} |
Code to convert Map to Columns
keys_df = df.select(F.explode(F.map_keys(F.col("map_col")))).distinct()`
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("map_col").getItem(f).alias(str(f)), keys))
final_cols = [F.col("*")] + key_cols
df = df.select(final_cols)
Output df:
| customer_id | mobile_number | email |
|:----------- |:--------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |
I already have the fields customer_id, mobile_number and email in the main dataframe, of which map_col is one of the columns. I get error when I try to generate the output because same column names are already in the dataset. Therefore, I need to rename these column names to customer_id_2, mobile_number_2, and email_2 before it is generated in the dataset. map_col column may have more keys and values than shown.
Desired output:
| customer_id_2 | mobile_number_2 | email_2 |
|:------------- |:-----------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |
Add the following line just before the code which converts map to columns:
df = df.withColumn('map_col', F.expr("transform_keys(map_col, (k, v) -> concat(k, '_2'))"))
This uses transform_keys which changes the key names adding _2 to the originam name, as you needed.
The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!
I'm trying to think of a way where I can have a Powershell script check an Excel or CSV file, read the last column and then insert a static data for each row that has data.
For example this is the original file:
| Id | Price | Description |
|-----|---------|---------------------------------|
| 33 | 878.35 | C2 Flush Casement |
| 52 | 1111.69 | DS2-L101-T102_L |
| 71 | 875.75 | C2 Flush Casement |
I would like to add the Quote number (Which I can get from the file name):
| Id | Price | Description | Quote |
|----|---------|-------------------|-------|
| 33 | 878.35 | C2 Flush Casement | Q1234 |
| 52 | 1111.69 | DS2-L101-T102_L | Q1234 |
| 71 | 875.75 | C2 Flush Casement | Q1234 |
Any pointers would be appreciated.
You can use Import-Csv to read and parse the data, then create the new column with Select-Object and write to a new CSV file with Export-Csv:
Import-Csv path\to\file.csv |Select-Object *,#{Name='Quote';Expression={if($_.Description -ne ''){'Q1234'}}} |Export-Csv path\to\output.csv -NoTypeInformation
I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations(First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times,the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")
df:
-----------+
| word|
+-----------+
| 1609|
| |
| the|
| sonnets|
| |
| by|
| william|
|shakespeare|
| |
| fg|
This is my data frame. How to remove the empty rows (to remove the rows that contain '') using the 'where' clause.
code:
df.where(trim(df.word) == "").show()
output:
----+
|word|
+----+
| |
| |
| |
| |
| |
| |
| |
| |
| |
Any help is appreciated.
You can trim and check if result is empty:
>>> from pyspark.sql.functions import trim
>>> df.where(trim(df.word) != "")
Apart from where, you can also use filter to achieve this.
from pyspark.sql.functions import trim
df.filter(trim(df.word) != "").show()
df.where(trim(df.word) != "").show()