I have a dataframe with two columns A & B, B is column of lists and A is a string, I want to search a value in the column B and get the corresponding value in column A. For ex :
category zones
0 category_1 [zn_1, zn_2]
1 category_2 [zn_3]
2 category_3 [zn_4]
3 category_4 [zn_5, zn_6]
If the input = 'zn_1', how can i get a response back as 'category_1'?
Use str.contains and filter category values
inputvalue='zn_1'
df[df.zones.str.contains(inputvalue)]['category']
#If didnt want an array
inputvalue='zn_1'
list(df[df.zones.str.contains(inputvalue)]['category'].values)[0]
Related
Hello
I have 2 dataframes one old and one new. After comparing the 2 dataframes I want output generated with the column names for each id and the only changes in values as shown below.
I could merge the 2 dataframes and find the differences for each column separately like
a=df1.merge(df2, on='Ids')
a[a['ColA_x'] != a['ColA_y']]
But I have 80 columns and I want to get the difference with column names and values as shown in the output. Is there a way to do this?
Stack each dataframe to convert column names into row indexes. Concatenate the dataframes side by side:
combined = pd.concat([df1.stack(), df2.stack()], axis=1)
Now, extract the rows with the values that do not match:
combined[combined[0] != combined[1]]
# 0 1
#Ids
#123 ColA AH AB
#234 ColB GO MO
#456 ColA GH AB
#...
I have a pyspark dataframe with only one record. it contains an id field and a "value" field. the value field contains nested dicts like the example record shown in the inputdf below. I would like to create a new dataframe like the outputdf below, where the type column is the keys from the nested dict in the value field in inputdf, and the value and active columns contain corresponding values from the nested dicts. if it's easier, the dataframe could be converted to a pandas dataframe using .toPandas(). does anyone have a slick way to do this?
inputdf:
id value
1 {"soda":{"value":2,"active":1},"jet":{"value":0,"active":1}}
outputdf:
type value active
soda 2 1
jet 0 1
Let us try , notice here I also include the id column
yourdf = pd.DataFrame(df.value.tolist(),index=df.id).stack().apply(pd.Series).reset_index()
Out[13]:
id level_1 value active
0 1 soda 2 1
1 1 jet 0 1
i need to plot the state called 'kerala' in the column 'state/unionterritory' and 'confirmed' to create a lineplot.
so far I have written till
1sns.lineplot(x=my_data['state/unionterritory'],y=my_data['confirmed'])
[https://www.kaggle.com/essentialguy/exercise-final-project]
this is the dataframe.head() , see the column name
I assume you want to get a column state/unionterritory from your DataFrame and filter it so it contains only Kerala state:
my_data_kerala = my_data[my_data['state/unionterritory'] == 'Kerala']['state/unionterritory']
I am trying to split a copy off of a Pandas dataframe starting after a certain column by header name.
So far, I've been able to manipulate the column headers or indexes according to a set number of known columns, like below. However, the number of columns will change, and I want to still extract every column that happens after.
In the below example, say I want to grab all columns after 'Tail' even if the 'Body' columns goes to column X. So the below sample with X number of Body columns:
df = pd.DataFrame({'Intro1': ['blah'],
'Intro2': ['blah'],'Intro3': ['blah'],'Body1': ['blah'],'Body2': ['blah'],'Body3': ['blah'],'Body4': ['blah'], ... 'BodyX': ['blah'],'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Should produce a copy of the dataframe as:
dftail = pd.DataFrame({'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Ideally I'd like to find a way to combine the two techiques below so that the column starts at 'Tail' and goes to the end of the dataframe:
dftail = [col for col in df if col.startswith('Tail')]
dftail = df.iloc[:, 164:] # column number (164) will change based on 'Tail' index number
How about this:
df_tail = df.iloc[:, list(df.columns).index("Tail"):]
df_tail then prints out:
Tail OtherTail StillAnotherTail
0 blah blah blah
I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')