Finding common rows between two dataframes based on a column using pandas - python-3.x

I have two dataframes. I need to extract rows based on common values in column 'a'. However, instead of creating one data frame at the end I want to retain the two data frames.
For example:
###Consider the following input
df1 = pd.DataFrame({'a':[0,1,1,2,3,4], 'b':['q','r','s','t','u','v'],'c':['a','b','c','d','e','f']})
df2 = pd.DataFrame({'a':[1,4,5,6], 'b':['qq','rr','ss','tt'],'c':[1,2,3,4]})
The expected output is:
###df1:
a. b. c
0. 1. r. a
1. 1. s. c
2. 4. v. f
###df2:
a. b. c
0. 1. qq 1
1. 4. rr 2
How can I achieve the following result? Insights will be appreciated.

You can generalize it with numpy's intersect1d
import numpy as np
intersection_arr = np.intersect1d(df1['a'], df2['a'])
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
More than two dataframes:
import numpy as np
from functools import reduce
intersection_arr = reduce(np.intersect1d, (df1['a'], df2['a'], df3['a']))
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
df3 = df3.loc[df3['a'].isin(intersection_arr),:]

df1 = df1[df1['a'].isin(df2['a'])].reset_index(drop=True)
df2 = df2[df2['a'].isin(df1['a'])].reset_index(drop=True)

Related

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

From dict to pandas dataframe as rows

I am sure i must be missing something basic here. But as far as I know you can create a dataframe from a dict with pd.DataFrame.from_dict(). But I am not sure how it can be set that key-values pairs in a dict can be put it as rows in a dataframe.
For instance, given this example
d = {'a':1,'b':2}
the desired output would be:
col1 col2
0 a 1
1 b 2
I know that the index might be a problem but that can be handle it with a simple index = [0]
Duplicate of Convert Python dict into a dataframe.
Simple answer for python 3.
import pandas as pd
d = {'a':1,'b':2, 'c':3}
df = pd.DataFrame(list(d.items()), columns = ['cola','colb'])
This code should help you.
d = {k: [l] for k, l in d.items()}
pd.DataFrame(d).T.reset_index().rename(columns={'index': 'col1', 0: 'col2'})

Change index column text in pandas

I have the following spreadsheet that I am bringing in to pandas:
Excel Spreadsheet
I import it with:
import pandas as pd
df = pd.read_excel("sessions.xlsx")
Jupyter shows it like this:
Panda Dataframe 1
I then transpose the dataframe with
df = df.T
Which results in this
Transposed DataFrame
At this stage how can I now change the text in the leftmost index column? I want to change the word Day to the word Service, but I am not sure how to address that cell/header. I can't refer to column 0 and change the header for that.
Likewise how could i then go on to change the A, B, C, D text which is now the index column?
You could first assign to the columns attribute, and then apply the transposition.
import pandas as pd
df = pd.read_excel("sessions.xlsx")
df.columns = ['Service','AA', 'BB', 'CC', 'DD']
df = df.T
Renaming the columns before transposing would work. To do exactly what you want, you can use the the rename function. In the documentation it also has a helpful example on how to rename the index.
Your example in full:
import pandas as pd
df = pd.read_excel("sessions.xlsx")
df = df.T
dict_rename = {'Day': 'Service'}
df.rename(index = dict_rename)
To extend this to more index values, you merely need to adjust the dict_rename argument before renaming.
Full sample:
import pandas as pd
df = pd.read_excel("sessions.xlsx")
df = df.T
dict_rename = {'Day': 'Service','A':'AA','B':'BB','C':'CC','D':'DD'}
df.rename(index = dict_rename)

Python - Pandas Dataframe with Multiple Names per Column

Is there a way in pandas to give the same column of a pandas dataframe two names, so that I can index the column by only one of the two names? Here is a quick example illustrating my problem:
import pandas as pd
index=['a','b','c','d']
# The list of tuples here is really just to
# somehow visualize my problem below:
columns = [('A','B'), ('C','D'),('E','F')]
df = pd.DataFrame(index=index, columns=columns)
# I can index like that:
df[('A','B')]
# But I would like to be able to index like this:
df[('A',*)] #error
df[(*,'B')] #error
You can create a multi-index column:
df.columns = pd.MultiIndex.from_tuples(df.columns)
Then you can do:
df.loc[:, ("A", slice(None))]
Or: df.loc[:, (slice(None), "B")]
Here slice(None) is equivalent to selecting all indices at the level, so (slice(None), "B") selects columns whose second level is B regardless of the first level names. This is semantically the same as :. Or write in pandas index slice way. df.loc[:, pd.IndexSlice[:, "B"]] for the second case.

Pyspark dataframe split json column values into top-level multiple columns

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs.
For Eg if I have this data
A B
"{\"C\":\"c\" , \"D\":\"d\"...}" b
This is the output that i want
B C D ...
b c d
There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!
You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$#from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column
Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87
Sample Usage from commit:
>>> from pyspark.sql.types import *
>>> data = [(1, '''{"a": 1}''')]
>>> schema = StructType([StructField("a", IntegerType())])
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(from_json(df.value, schema).alias("json")).collect()
[Row(json=Row(a=1))]

Resources