I want to get row from this dataframe which have string value in column "city".
| city | region
0 | 5 | 19
1 | Pakri | 37
2 | 9 | 26
3 | 452 | 59
4 | 66 | 2
5 | 226 | 19
Answer should like the below snippet which contain row that have string value in its first column
| city | region
0 | Pakri | 37
You could use a boolean filter using apply to the pandas DataFrame o Series.
consider:
import pandas as pd
import numpy as np
#create a series
s= pd.Series([5,100,'string'], dtype=object)
#set type to string
s = s.astype(str)
# apply
temp = s.apply(lambda x: x.isdigit())
#return
#(0, True) (1, True) (2, False)
#index of the string
index = np.where(temp == False)
#select by index
s.loc[index]
#return
# (2, 'string')
I have a series of pandas data frame stored in variable df similar to below:
df
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
| 0 | 1 |
+-------+--------+
|DEFJ | HJKL |
| 0 | 1 |
+-------+--------+
|ZXCT | WYOM |
| 0 | 1 |
+-------+--------+
|TYZX | NMEX |
I want to merge them to a single pandas data frame as below :
| 0 | 1 |
+-------+--------+
|ABCD | WXYZ |
|DEFJ | HJKL |
|ZXCT | WYOM |
|TYZX | NMEX |
So how can I merge series of pandas dataframe into one single pandas dataframe ?
As your code is now, you're only outputing one dataframe with one row only (overwriting the others).
Try this:
# Copy the names to pandas dataframes and save them in a list
import pandas as pd
dfs = []
for j in range(0,5):
for i in divs[j].find_elements_by_tag_name('a'):
i = i.get_attribute('text')
i = parse_name(i)
df = pd.DataFrame(i)
df = df.transpose()
dfs.append(df)
# Aggregate all dataframes in one
new_df = dfs[0]
for df in dfs[1:]:
new_df = new_df.append(df)
# Update index
new_df = new_df.reset_index(drop=True)
# Print first five rows
new_df.head()
0 1
0 Lynn Batten Emeritus Professor
1 Andrzej Goscinski Emeritus Professor
2 Jemal Abawajy Professor
3 Maia Angelova Professor
4 Gleb Beliakov Professor
There are four ways to concat or merge dataframes, you may refer to this post
these are the most common implementations
import pandas as pd
df1 = pd.DataFrame({0:['ABCD'], 1:['WXYX']})
df2 = pd.DataFrame({0:['DEFJ'], 1:['HJKL']})
df3 = pd.DataFrame({0:['ZXCT'], 1:['WYOM']})
...
df = pd.concat([df1, df2, df3], axis=0)
print(df.head())
or if you have a list of dataframes with the same headers you can try
dfs = [df1, df2, df3 ..]
df = pd.concat(dfs, axis=0)
and the most simple way is to just use df.append
df = df.append(anotherdf)
I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value.
example dataframe will be of
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456]
345 | alen | [789]
456 | sam | [789,999]
789 | marc | [111]
555 | dan | [333]
From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456,789,999,111]
345 | alen | [789,111]
456 | sam | [789,999,111]
789 | marc | [111]
555 | dan | [333]
need help on how to do it. thanks
One way to handle this task is to do self leftjoin, update the RELATED_IDLIST, do this several iterations until some conditions satisfy (this works only when the max-depth of the whole hierarchy is small). For Spark 2.3, we can convert the ArrayType column into a comma-delimitered StringType column, use SQL builtin function find_in_set and a new column PROCESSED_IDLIST to set up the join-conditions, see below for the main logic:
Functions:
from pyspark.sql import functions as F
import pandas as pd
# define a function which takes a dataframe as input, does a self left-join and then return another
# dataframe with exactly the same schema as the input dataframe. do the same repeatly until some conditions satisfy
def recursive_join(d, max_iter=10):
# function to find direct child-IDs and merge into RELATED_IDLIST
def find_child_idlist(_df):
return _df.alias('d1').join(
_df.alias('d2'),
F.expr("find_in_set(d2.ID,d1.RELATED_IDLIST)>0 AND find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1"),
"left"
).groupby("d1.ID", "d1.NAME").agg(
F.expr("""
/* combine d1.RELATED_IDLIST with all matched entries from collect_set(d2.RELATED_IDLIST)
* and remove trailing comma from when all d2.RELATED_IDLIST are NULL */
trim(TRAILING ',' FROM
concat_ws(",", first(d1.RELATED_IDLIST), concat_ws(",", collect_list(d2.RELATED_IDLIST)))
) as RELATED_IDLIST"""),
F.expr("first(d1.RELATED_IDLIST) as PROCESSED_IDLIST")
)
# below the main code logic
d = find_child_idlist(d).persist()
if (d.filter("RELATED_IDLIST!=PROCESSED_IDLIST").count() > 0) & (max_iter > 1):
d = recursive_join(d, max_iter-1)
return d
# define pandas_udf to remove duplicate from an ArrayType column
get_uniq = F.pandas_udf(lambda s: pd.Series([ list(set(x)) for x in s ]), "array<int>")
Where:
in the function find_child_idlist(), the left-join must satisfy the following two conditions:
d2.ID is in d1.RELATED_IDLIST: find_in_set(d2.ID,d1.RELATED_IDLIST)>0
d2.ID not in d1.PROCESSED_IDLIST: find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1
quit the recursive_join when no row satisfying RELATED_IDLIST!=PROCESSED_IDLIST or max_iter > 1
Processing:
set up dataframe:
df = spark.createDataFrame([
(123, "mike", [345,456]), (345, "alen", [789]), (456, "sam", [789,999]),
(789, "marc", [111]), (555, "dan", [333])
],["ID", "NAME", "RELATED_IDLIST"])
add a new column PROCESSED_IDLIST to save RELATED_IDLIST in the previous join, and do recursive_join()
df1 = df.withColumn('RELATED_IDLIST', F.concat_ws(',','RELATED_IDLIST')) \
.withColumn('PROCESSED_IDLIST', F.col('ID'))
df_new = recursive_join(df1, 5)
df_new.show(10,0)
+---+----+-----------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-----------------------+-----------------------+
|555|dan |333 |333 |
|789|marc|111 |111 |
|345|alen|789,111 |789,111 |
|123|mike|345,456,789,789,999,111|345,456,789,789,999,111|
|456|sam |789,999,111 |789,999,111 |
+---+----+-----------------------+-----------------------+
split RELATED_IDLIST into array of integers and then use pandas_udf function to drop duplicate array elements:
df_new.withColumn("RELATED_IDLIST", get_uniq(F.split('RELATED_IDLIST', ',').cast('array<int>'))).show(10,0)
+---+----+-------------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-------------------------+-----------------------+
|555|dan |[333] |333 |
|789|marc|[111] |111 |
|345|alen|[789, 111] |789,111 |
|123|mike|[999, 456, 111, 789, 345]|345,456,789,789,999,111|
|456|sam |[111, 789, 999] |789,999,111 |
+---+----+-------------------------+-----------------------+
i have a dataframe (actually 105k dframes) representation of a SWOT table (should be a 4x4 grid containing strings). The problem is that not all frames in the total set (105k) have the same shape and the position of SWOT elements also varies. My approach is to create a copy of each frame and then look for specific strings in the original frame and add tags to the copy + some forward filling.
Now I have the original frames and copies (same size same indices and same column names), where the original frame has the strings I want to group and a copy that is basically a mask (not boolan - but a mask of tags).
When looping over the whole set I am merging the original frame with the copy based on the index and get a megerd frame that has columns form both frames with _x and _y suffixes. How can I concatenate the strings in columns _y that have the same tag in column _x to a new dataframe?
+---+-------+-------------+-------+-------------+-----+-----+
| | 0_x | 0_y | 1_x | 1_y | n_x | n_y |
+---+-------+-------------+-------+-------------+-----+-----+
| 0 | tag_1 | some_string | tag_2 | some_string | | |
| 1 | tag_1 | some_string | tag_2 | some_string | | |
| 2 | tag_2 | some_string | tag_3 | some_string | | |
| n | tag_2 | some_string | tag_3 | some_string | | |
+---+-------+-------------+-------+-------------+-----+-----+
Here is my code so far - basically it does what i want for pairs of columns (n_x, n_y), but it does not work if i have the same tag in more columns (say in 0_x and 1_x). I could do a second iteration over columns a and b and do the same as in the first case, but is there a more efficient way?
df_tmp (cotains the tags) and df_orig (contains the strings)
df_merged = pd.merge(df_tmp, df_orig, left_index=True, right_index=True, validate="1:1")
df_merged = df_merged.applymap(str)
columns_list = df_merged.columns.to_list()
x_columns = [elem for elem in columns_list if elem.endswith("x")]
y_columns = [elem for elem in columns_list if elem.endswith("y")]
df_collect = pd.DataFrame()
for col_x, col_y in zip(x_columns, y_columns):
df_final = df_merged.groupby([col_x])[col_y].apply(','.join).reset_index()
df_final = df_final.rename(columns={col_x: 'a', col_y: 'b' })
df_collect = df_collect.append(df_final, ignore_index=True)
I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+
IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6