How do I batch rename columns in pyspark efficiently? - apache-spark

I am trying to batch rename my columns in PySpark from:
'collect_list(Target_Met_1)[1]' --> 'AB11'
'collect_list(Target_Met_1)[2]' --> 'AB12'
'collect_list(Target_Met_2)[1]' --> 'AB21'
'collect_list(Target_Met_1)[150]' --> 'AB150'
How do I go about it in a programmatically? Right now, I can manually change the names using:
df.withColumnRenamed('collect_list(Target_Met_1)[1]', 'AB11')
But if I have 500 columns, it's not efficient. I realize that an other way to renaming it would be using something like a udf, but I cannot figure the best possible approach.
I have split the columns and that's not the problem. The problem is around renaming the column.

Never mind. Figured out. Essentially I had to use a list comprehension to rename the columns. I was splitting columns mentioned in the link above. Here is what that did the trick:
df = df.select('1', '2', '3', *[df[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])

To rename all columns you can use the method toDf:
import re
df.toDF(*['AB' + ''.join(re.findall('\d+', i)) for i in df.columns])

Something like this can help too. It's a rename function similar to the Pandas rename functionality.
def rename_cols(map_dict):
"""
Rename a bunch of columns in a data frame
:param map_dict: Dictionary of old column names to new column names
:return: Function for use in transform
"""
def _rename_cols(df):
for old, new in map_dict.items():
df = df.withColumnRenamed(old, new)
return df
return _rename_cols
And you can use it like
spark_df.transform(rename_cols(dict(old1='new1', old2='new2', old3='new3')))

Related

Better way to read columns from excel file as variables in Python

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)
Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

Iterate over 4 pandas data frame columns and store them into 4 lists with one for loop instead of 4 for loops

I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()

replacing a special character in a pandas dataframe

I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?
If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))

How to extract several dataframes from dictionary

I currently trying to extract several dataframes from a dictionary. The problem is, that the number of dataframes will vary, sometimes I'll have two dataframes in there and sometimes 30.
At the beginning I create a dictionary (dict_of_exceptions) from a dataframe (exceptions_df). In this dictionary I'll have several dataframes depending on how many different 'Source Wells' I have. With the current code I can extract the first dataframe from the dictionary which is j:
dict_of_exceptions = {k: v for k, v in exceptions_df.groupby('Source Well') }
print (dict_of_exceptions)
for k in dict_of_exceptions.keys():
j = dict_of_exceptions[k]
Could someone help me modify the last line to go trough the dictionary and extract each dataframe (and name them like the corresponding key)?
I think I get your intention, but could not really read your intentions from your code though. Currently, as #cyrilb38 stated in comments, your loop is overriding j, so you would only be able to see the result of last iteration. Anyways, rather transforming use dataframe instead, and I think (may be wrong) that you call your row a dataframe. Replacing a groupby object with dict is not what you wanted, or it is just prolonging the process for nothing.
If you want to see the info of Well X only for example, try this
exceptions_df[exceptins_df['Source Well'] == 'Well X']

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Resources