PySpark Map to Columns, rename key columns

PySpark Map to Columns, rename key columns - apache-spark

I am converting the Map column to multiple columns dynamically based on the values in the column. I am using the following code (taken mostly from here), and it works perfectly fine.
However, I would like to rename the column names that are programmatically generated.
Input df:
| map_col |
|:-------------------------------------------------------------------------------|
| {"customer_id":"c5","email":"abc#yahoo.com","mobile_number":"1234567890"} |
| null |
| {"customer_id":"c3","mobile_number":"2345678901","email":"xyz#gmail.com"} |
| {"email":"pqr#hotmail.com","customer_id":"c8","mobile_number":"3456789012"} |
| {"email":"mnk#GMAIL.COM"} |
Code to convert Map to Columns
keys_df = df.select(F.explode(F.map_keys(F.col("map_col")))).distinct()`
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("map_col").getItem(f).alias(str(f)), keys))
final_cols = [F.col("*")] + key_cols
df = df.select(final_cols)
Output df:
| customer_id | mobile_number | email |
|:----------- |:--------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |
I already have the fields customer_id, mobile_number and email in the main dataframe, of which map_col is one of the columns. I get error when I try to generate the output because same column names are already in the dataset. Therefore, I need to rename these column names to customer_id_2, mobile_number_2, and email_2 before it is generated in the dataset. map_col column may have more keys and values than shown.
Desired output:
| customer_id_2 | mobile_number_2 | email_2 |
|:------------- |:-----------------| :---------------|
| c5 | 1234567890 | abc#yahoo.com |
| null | null | null |
| c3 | 2345678901 | xyz#gmail.com |
| c8 | 3456789012 | pqr#hotmail.com |
| null | null | mnk#GMAIL.COM |

Add the following line just before the code which converts map to columns:
df = df.withColumn('map_col', F.expr("transform_keys(map_col, (k, v) -> concat(k, '_2'))"))
This uses transform_keys which changes the key names adding _2 to the originam name, as you needed.

Related

Find cell address of value found in range

tl;dr In Google Sheets/Excel, how do I find the address of a cell with a specified value within a specified range where value may be in any row or column?
My best guess is
=CELL("address",LOOKUP("My search value", $search:$range))
but it doesn't work. When it finds a value at all, it returns the rightmost column every time, rather than the column of the cell it found.
I have a sheet of pretty, formatted tables that represent various concepts. Each table consists of
| Title |
+------+------+-------+------+------+-------+------+------+-------+
| Sub | Prop | Name | Sub | Prop | Name | Sub | Prop | Name |
+------+------+-------+------+------+-------+------+------+-------+
| Sub prop | value | Sub prop | value | Sub prop | value |
+------+------+-------+------+------+-------+------+------+-------+
| data | data | data | data | data | data | data | data | data |
| data | data | data | data | data | data | data | data | data |
⋮
I have 8 such tables of variable height arranged in a grid within the sheet 3 tables wide and 3 tables tall except the last column which has only 2 tables--see image. These fill the range C2:AI78.
Now I have a table off to the right consisting in AK2:AO11 of
| Table title | Table title address | ... |
+---------------+-----------------------+-----+
| Table 1 Title | | ... |
| Table 2 Title | | ... |
⋮
| Table 8 Title | | ... |
I want to fill out the Table title address column. (Would it be easier to do this manually for all of 8 values? Absolutely. Did I need to in order to write this question? Yes. But using static values is not the StackOverflow way, now, is it?)
Based on very limited Excel/Google Sheets experience, I believe I need to use CELL() and LOOKUP() for this.
=CELL("address",LOOKUP($AK4, $C$2:$AI$78))
This retrieves the wrong value. For AL4 (looking for value Death Wave), LOOKUP($AK4, $C$2:$AI$78) should retrieve cell C2 but it finds AI2 instead.
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $AI$3 | 3 | | 15 |
| Poison Swamp | $AI$30 | | | |
| Smart Missiles | $AI$56 | | | |
| Black Hole | #N/A | 1 | | |
| Inner Land Mines | $AI$3 | | | |
| Chain Lightning | #N/A | | | |
| Golden Tower | $AI$3 | | | |
| Chrono Field | #N/A | 25 | | |
The error messages for the #N/A columns is
Did not find value '<Table Title>' in LOOKUP evaluation.
My expected table is
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $C$2 | 3 | | 15 |
| Poison Swamp | $C$28 | | | |
| Smart Missiles | $C$54 | | | |
| Black Hole | $O$2 | 1 | | |
| Inner Land Mines | $O$28 | | | |
| Chain Lightning | $O$54 | | | |
| Golden Tower | $AA$2 | | | |
| Chrono Field | $AA$39 | 25 | | |

try:
=INDEX(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&""&ROW(D2:F4)), ""), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&""&COLUMN(D2:F4)), ""), 2, ), 4))
or if you want to create jump links:
=INDEX(LAMBDA(x, HYPERLINK("#gid=1273961649&range="&x, x))(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&""&ROW(D2:F4)), ""), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&""&COLUMN(D2:F4)), ""), 2, ), 4)))

Try this:
=QUERY(
FLATTEN(
ARRAYFORMULA(
IF(
C:AI=$AK4,
ADDRESS(ROW(C:AI), COLUMN(C:AI)),
""
)
)
), "
SELECT
Col1
WHERE
Col1<>''
"
, 0)
Basically, cast all cells in the search range to addresses if they equal the search term. Then flatten that 2D range and filter out non-nulls.

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.

Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Pandas merge have two results with the same code and input data

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations（First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+

I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times，the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")

Excel search range of cells that contain text return entire cell that contains text

I am trying to get a formula to check a set of data for a certain text. For example, assuming the below table starts in Cell A1, I would like to search Columns C,D,E,F,G for a string, and return the entire contents of the cell that contains that string. So for the "AltID 101020", I would like to search columns C-G for the string "Plan" and return the value of "Plan11" in B2, "Plan88" in B5, and "Plan2d" in B7.
A B C D E F G
Data Column1 Column2 Column3 Column4 Column5 Column6
+--------+-------+-------+-------+-------+-------+-------+
1 | AltID |Plans | CovA | CovB |CovC | CovD | CovE |
+--------+-------+-------+-------+-------+-------+-------+
2 | 101020 | | Pol3 |Plan11 | | |Coord2e|
3 | 907030 | | Pol | | Sub5a | Alt24 | |
4 | 805050 | | | | | | |
5 | 778050 | | |Plan88 | Sub7d | |Coord2 |
6 | 232520 | | | | | | |
7 | 357031 | | |Plan2d | Sub7e | | |
8 | ... | ... | ... | ... | ... | ... | ... |
+--------+-------+-------+-------+-------+-------+-------+

Using this formula, you can see if there is the word "plan" in a cell or not. In case not there is an empty string. You can concatenate all those in one cell, and use a MATCH function for using it:
=IF(ISERROR(FIND("plan";D2));"";D2)

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...

Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark Map to Columns, rename key columns - apache-spark

Add the following line just before the code which converts map to columns: df = df.withColumn('map_col', F.expr("transform_keys(map_col, (k, v) -> concat(k, '_2'))")) This uses transform_keys which changes the key names adding _2 to the originam name, as you needed.

Related

Find cell address of value found in range

Split column on condition in dataframe

Pandas merge have two results with the same code and input data

Excel search range of cells that contain text return entire cell that contains text

PySpark getting distinct values over a wide range of columns

Categories

Resources