Find a value in column of lists in pandas dataframe

Find a value in column of lists in pandas dataframe - python-3.x

I have a dataframe with two columns A & B, B is column of lists and A is a string, I want to search a value in the column B and get the corresponding value in column A. For ex :
category zones
0 category_1 [zn_1, zn_2]
1 category_2 [zn_3]
2 category_3 [zn_4]
3 category_4 [zn_5, zn_6]
If the input = 'zn_1', how can i get a response back as 'category_1'?

Use str.contains and filter category values
inputvalue='zn_1'
df[df.zones.str.contains(inputvalue)]['category']
#If didnt want an array
inputvalue='zn_1'
list(df[df.zones.str.contains(inputvalue)]['category'].values)[0]

Related

Pandas: Compare between same columns for same Ids between 2 dataframes and create a new dataframe with the differences for each column

Hello
I have 2 dataframes one old and one new. After comparing the 2 dataframes I want output generated with the column names for each id and the only changes in values as shown below.
I could merge the 2 dataframes and find the differences for each column separately like
a=df1.merge(df2, on='Ids')
a[a['ColA_x'] != a['ColA_y']]
But I have 80 columns and I want to get the difference with column names and values as shown in the output. Is there a way to do this?

Stack each dataframe to convert column names into row indexes. Concatenate the dataframes side by side:
combined = pd.concat([df1.stack(), df2.stack()], axis=1)
Now, extract the rows with the values that do not match:
combined[combined[0] != combined[1]]
# 0 1
#Ids
#123 ColA AH AB
#234 ColB GO MO
#456 ColA GH AB
#...

convert nested dict values in pyspark/pandas dataframe to column and rows

I have a pyspark dataframe with only one record. it contains an id field and a "value" field. the value field contains nested dicts like the example record shown in the inputdf below. I would like to create a new dataframe like the outputdf below, where the type column is the keys from the nested dict in the value field in inputdf, and the value and active columns contain corresponding values from the nested dicts. if it's easier, the dataframe could be converted to a pandas dataframe using .toPandas(). does anyone have a slick way to do this?
inputdf:
id value
1 {"soda":{"value":2,"active":1},"jet":{"value":0,"active":1}}
outputdf:
type value active
soda 2 1
jet 0 1

Let us try , notice here I also include the id column
yourdf = pd.DataFrame(df.value.tolist(),index=df.id).stack().apply(pd.Series).reset_index()
Out[13]:
id level_1 value active
0 1 soda 2 1
1 1 jet 0 1

my dataframe contains '/' in column name , How can i plot certain column values in that column name?

i need to plot the state called 'kerala' in the column 'state/unionterritory' and 'confirmed' to create a lineplot.
so far I have written till
1sns.lineplot(x=my_data['state/unionterritory'],y=my_data['confirmed'])
[https://www.kaggle.com/essentialguy/exercise-final-project]
this is the dataframe.head() , see the column name

I assume you want to get a column state/unionterritory from your DataFrame and filter it so it contains only Kerala state:
my_data_kerala = my_data[my_data['state/unionterritory'] == 'Kerala']['state/unionterritory']

Splicing a Pandas dataframe by column name

I am trying to split a copy off of a Pandas dataframe starting after a certain column by header name.
So far, I've been able to manipulate the column headers or indexes according to a set number of known columns, like below. However, the number of columns will change, and I want to still extract every column that happens after.
In the below example, say I want to grab all columns after 'Tail' even if the 'Body' columns goes to column X. So the below sample with X number of Body columns:
df = pd.DataFrame({'Intro1': ['blah'],
'Intro2': ['blah'],'Intro3': ['blah'],'Body1': ['blah'],'Body2': ['blah'],'Body3': ['blah'],'Body4': ['blah'], ... 'BodyX': ['blah'],'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Should produce a copy of the dataframe as:
dftail = pd.DataFrame({'Tail': ['blah'],'OtherTail': ['blah'],'StillAnotherTail': ['blah'],})
Ideally I'd like to find a way to combine the two techiques below so that the column starts at 'Tail' and goes to the end of the dataframe:
dftail = [col for col in df if col.startswith('Tail')]
dftail = df.iloc[:, 164:] # column number (164) will change based on 'Tail' index number

How about this:
df_tail = df.iloc[:, list(df.columns).index("Tail"):]
df_tail then prints out:
Tail OtherTail StillAnotherTail
0 blah blah blah

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even

I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find a value in column of lists in pandas dataframe - python-3.x

Use str.contains and filter category values inputvalue='zn_1' df[df.zones.str.contains(inputvalue)]['category'] #If didnt want an array inputvalue='zn_1' list(df[df.zones.str.contains(inputvalue)]['category'].values)[0]

Related

Pandas: Compare between same columns for same Ids between 2 dataframes and create a new dataframe with the differences for each column

convert nested dict values in pyspark/pandas dataframe to column and rows

my dataframe contains '/' in column name , How can i plot certain column values in that column name?

Splicing a Pandas dataframe by column name

PySpark: Update column values for a given number of rows of a DataFrame

Categories

Resources