Fill dataframe cells entry using dataframe column names and index - python-3.x

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+

IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Related

How to get string value from object datatype column in python dataframe which have combination of datatype values in column

I want to get row from this dataframe which have string value in column "city".
| city | region
0 | 5 | 19
1 | Pakri | 37
2 | 9 | 26
3 | 452 | 59
4 | 66 | 2
5 | 226 | 19
Answer should like the below snippet which contain row that have string value in its first column
| city | region
0 | Pakri | 37
You could use a boolean filter using apply to the pandas DataFrame o Series.
consider:
import pandas as pd
import numpy as np
#create a series
s= pd.Series([5,100,'string'], dtype=object)
#set type to string
s = s.astype(str)
# apply
temp = s.apply(lambda x: x.isdigit())
#return
#(0, True) (1, True) (2, False)
#index of the string
index = np.where(temp == False)
#select by index
s.loc[index]
#return
# (2, 'string')

Merging series of pandas dataframe into single dataframe

I have a series of pandas data frame stored in variable df similar to below:
df
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
| 0 | 1 |
+-------+--------+
|DEFJ | HJKL |
| 0 | 1 |
+-------+--------+
|ZXCT | WYOM |
| 0 | 1 |
+-------+--------+
|TYZX | NMEX |
I want to merge them to a single pandas data frame as below :
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
|DEFJ | HJKL |
|ZXCT | WYOM |
|TYZX | NMEX |
So how can I merge series of pandas dataframe into one single pandas dataframe ?
As your code is now, you're only outputing one dataframe with one row only (overwriting the others).
Try this:
# Copy the names to pandas dataframes and save them in a list
import pandas as pd
dfs = []
for j in range(0,5):
for i in divs[j].find_elements_by_tag_name('a'):
i = i.get_attribute('text')
i = parse_name(i)
df = pd.DataFrame(i)
df = df.transpose()
dfs.append(df)
# Aggregate all dataframes in one
new_df = dfs[0]
for df in dfs[1:]:
new_df = new_df.append(df)
# Update index
new_df = new_df.reset_index(drop=True)
# Print first five rows
new_df.head()
0 1
0 Lynn Batten Emeritus Professor
1 Andrzej Goscinski Emeritus Professor
2 Jemal Abawajy Professor
3 Maia Angelova Professor
4 Gleb Beliakov Professor
There are four ways to concat or merge dataframes, you may refer to this post
these are the most common implementations
import pandas as pd
df1 = pd.DataFrame({0:['ABCD'], 1:['WXYX']})
df2 = pd.DataFrame({0:['DEFJ'], 1:['HJKL']})
df3 = pd.DataFrame({0:['ZXCT'], 1:['WYOM']})
...
df = pd.concat([df1, df2, df3], axis=0)
print(df.head())
or if you have a list of dataframes with the same headers you can try
dfs = [df1, df2, df3 ..]
df = pd.concat(dfs, axis=0)
and the most simple way is to just use df.append
df = df.append(anotherdf)

Identify cells with characters other than numbers in dataframe

I have dataframe with two columns named A and B. How can I fill column B so that a cell will display "text" if A contains something other than a number or the number itself comma separated, and "number" when it is just a number?
(See the example below)
You could do it using apply on column A :
import pandas as pd
data = [{'A': '4 tons'},
{'A': '2.0*'},
{'A': 4.1},
{'A': 4.2},
{'A': '4,2'},
{'A': '6,3'}]
df = pd.DataFrame(data)
def checkType(x):
try:
#Trying to convert the value into a float type
float(x)
return 'number'
except:
#If there's an error, it's a text
return 'text'
df['B'] = df.A.apply(lambda x : checkType(x))
Output
| A | B |
|:-------|:-------|
| 4 tons | text |
| 2.0* | text |
| 4.1 | number |
| 4.2 | number |
| 4,2 | text |
| 6,3 | text |
What about this?
import pandas as pd
# making just column A
ls = ['4 tons', '4.0*', '4,0', '4.1', '5.2', '6,3']
df = pd.DataFrame()
df['A'] = ls
df['B'] = ls # copying for now
df["C"] = df["A"].str.replace(".", "") # creating a pivot column
df["C"] = df['C'].str.isdigit()
i = 0
for elem in df['C']:
if elem:
df['B'][i] = 'number'
else:
df['B'][i] = 'text'
i = i+1
del df['C'] # delete pivot column

Groupby one dataframe based on tags in second dataframe

i have a dataframe (actually 105k dframes) representation of a SWOT table (should be a 4x4 grid containing strings). The problem is that not all frames in the total set (105k) have the same shape and the position of SWOT elements also varies. My approach is to create a copy of each frame and then look for specific strings in the original frame and add tags to the copy + some forward filling.
Now I have the original frames and copies (same size same indices and same column names), where the original frame has the strings I want to group and a copy that is basically a mask (not boolan - but a mask of tags).
When looping over the whole set I am merging the original frame with the copy based on the index and get a megerd frame that has columns form both frames with _x and _y suffixes. How can I concatenate the strings in columns _y that have the same tag in column _x to a new dataframe?
+---+-------+-------------+-------+-------------+-----+-----+
| | 0_x | 0_y | 1_x | 1_y | n_x | n_y |
+---+-------+-------------+-------+-------------+-----+-----+
| 0 | tag_1 | some_string | tag_2 | some_string | | |
| 1 | tag_1 | some_string | tag_2 | some_string | | |
| 2 | tag_2 | some_string | tag_3 | some_string | | |
| n | tag_2 | some_string | tag_3 | some_string | | |
+---+-------+-------------+-------+-------------+-----+-----+
Here is my code so far - basically it does what i want for pairs of columns (n_x, n_y), but it does not work if i have the same tag in more columns (say in 0_x and 1_x). I could do a second iteration over columns a and b and do the same as in the first case, but is there a more efficient way?
df_tmp (cotains the tags) and df_orig (contains the strings)
df_merged = pd.merge(df_tmp, df_orig, left_index=True, right_index=True, validate="1:1")
df_merged = df_merged.applymap(str)
columns_list = df_merged.columns.to_list()
x_columns = [elem for elem in columns_list if elem.endswith("x")]
y_columns = [elem for elem in columns_list if elem.endswith("y")]
df_collect = pd.DataFrame()
for col_x, col_y in zip(x_columns, y_columns):
df_final = df_merged.groupby([col_x])[col_y].apply(','.join).reset_index()
df_final = df_final.rename(columns={col_x: 'a', col_y: 'b' })
df_collect = df_collect.append(df_final, ignore_index=True)

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Resources