Transforming multiple data frame columns into one series - python-3.x

I have a dataset df(250,3) 250 raws and three columns. I want to write a loop that merges the content of each column in my dataframe to have one single series(250,1) of 250 raws and 1 columns 'df_single'. The manual operation is the following:
df_single = df['colour']+" "+df['model']+" "+df['size']
How can I create df_single with a for loop, or non-manually?
I tried to write this code with TypeError
df_conc=[]
for var in cols:
cat_list=df_code_part[var]
df_conc = df_conc+" "+cat_list
TypeError: can only concatenate list (not "str") to list

I think if need join 3 columns then your solution is really good:
df_single = df['colour']+" "+df['model']+" "+df['size']
If need general solution for many columns use DataFrame.astype for convert to strings if necessary with DataFrame.add for add whitespace, sum for concatenate and last remove tralining whitespeces by Series.str.rstrip for remove traling whitespace:
cols = ['color','model','size']
df_single = df[cols].astype(str).add(' ').sum(axis=1).str.rstrip()
Or:
df_single = df[cols].astype(str).apply(' '.join, axis=1)

If you want to have spaces between columns, run:
df.apply(' '.join, axis=1)
"Ordinary" df.sum(axis=1) concatenates all columns, but without
spaces between them.

if you want the sum You need use:
df_single=df.astype(str).add(' ').sum(axis=1).str.rstrip()
if you don't want to add all the columns then you need to select them previously:
columns=['colour','model','size']
df_single=df[columns].astype(str).add(' ').sum(axis=1).str.rstrip()

Related

Pandas: Problem in Concatenation of three Data Frames

I am loading three.csv files into three different Pandas data frames. The number of rows in each file should be the same, but it is not. Sometimes one file contains four to five additional rows compared to another. Sometimes two files include two to three extra rows compared to one file. I'm concatenating all files row by row, so I'm removing these additional rows to make them the same length. I wrote the following code to accomplish this.
df_ch = pd.read_csv("./file1.csv")
df_wr = pd.read_csv("./file2.csv")
df_an = pd.read_csv("./file3.csv")
# here df_wr have less number of rows than df_ch and df_an, so dropping rows from other two frames (df_ch and df_an)
df_ch.drop(df_ch.index[274299:274301], inplace=True)
df_an.drop(df_an.index[274299], inplace=True)
I did it manually in the above code, and now I want to do it automatically. One method is to use if-else to check the length of all frames and make it equal to the shortest length frame. But I'm curious whether Pandas has a faster technique to compare these frames and delete the additional rows.
Couple of way you can do:
In general,
#take the intersection of all the index:
common = df1.index.intersection(df2.index).intersection(df3.index)
pd.concat([df1, df2, df3], axis=1).reindex(common)
Or in your case, maybe just
max_rows = min(map(len, [df1,df2,df3]))
pd.concat([df1, df2, df3], axis=1).iloc[:max_rows]
Update: best option should be join:
df1.join(df2, how='inner').join(df3, how='inner')

Is there any python-Dataframe function in which I can iterate over rows of certain columns?

Want to solve this kind of problem in python:
tran_df['bad_debt']=train_df.frame_apply(lambda x: 1 if (x['second_mortgage']!=0 and x['home_equity']!=0) else x['debt'])
I want be able to create a new column and iterate over index row for specific columns.
in excel it's really easy I did:
if(AND(col_name1<>0,col_name2<>0),1,col_name5)
Any help will be very appreciated.
To iterate over rows only for certain columns:
for rowIndex, row in df[['col1','col2']].iterrows(): #iterate over rows
To create a new column:
df['new'] = 0 # Initialise as 0
As a rule, iterating over rows in pandas is wrong. Use the np.where function from NumPy to select the right values for the rows:
tran_df['bad_debt'] = np.where(
(tran_df['second_mortgage'] != 0) & (tran_df['home_equity'] != 0),
1, tran_df['debt'])
First to create a new column with initial value, then to use .loc to locate rows that match certain condition and assign new value:
tran_df['bad_debt']=tran_df['debt']
tran_df.loc[(tran_df['second_mortgage']!=0)&(tran_df['home_equity']!=0),'bad_debt']=1
Or
tran_df['bad_debt']=1
tran_df.loc[(tran_df['second_mortgage']==0)|(tran_df['home_equity']==0),'bad_debt']=tran_df['debt']
Remember to put round brackets for each condition between bitwise operators (& |)

Using value_counts() and filter elements based on number of instances

I use the following code to create two arrays in a histogram, one for the counts (percentages) and the other for values.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
So, an output looks like
counts = 66.7, 8.3, 8.3, 8.3, 8.3
values = 1024, 356352, 73728, 16384, 4096
Problem is that some values exist one time only and I would like to ignore them. In the example above, only 1024 repeated multiple times and others are there only once. I can manually check the number of occurrences in the row and see if they are not repeated multiple times and ignore them.
df = row.value_counts(normalize=True).mul(100).round(1)
counts = df # contains percentages
values = df.keys().tolist()
for v in values:
# N = get_number_of_instances in row
# if N == 1
# remove v in row
I would like to know if there are other ways for that using the built-in functions in Pandas.
Some clarity requested on your question in comments above
If keys is a column and you want to retain non duplicates, please try
values=df.loc[~df['keys'].duplicated(keep=False), 'keys'].to_list()

Can I use pandas.DataFrame.apply() to make more than one column at once?

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?
You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

how to remove rows with less than 3 letter?

I have a pyspark data frame with many rows. each rows is a text. there is just one column. I want to delete or remove rows with less than 3 letter. for example in the following 4 rows I want to remove the second column and 4th. (pdf and a):
this is a text
pdf
a
No ways
You can filter using the length of the column:
df2 = df.filter('length(col) > 3')
If spaces matter, you can remove them first:
df2 = df.filter("length(replace(col, ' ', '')) > 3")

Resources