iterate through rows and columns in excel using pandas-Python 3 - excel

I have an excel spreadsheet that I read with this code:
df=pd.ExcelFile('/Users/xxx/Documents/Python/table.xlsx')
ccg=df.parse("CCG")
With the sheet that I want inside the spreadsheet being CCG
The sheet looks like this:
col1 col2 col3
x a 1 2
x b 3 4
x c 5 6
x d 7 8
x a 9 10
x b 11 12
x c 13 14
y a 15 16
y b 17 18
y c 19 20
y d 21 22
y a 23 24
How would I write code that gets values of col 2 and col3 for rows that contain both a and x. So the proposed output for this table would be: col1=[1,9], col2=[2,10]

Try this:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx', 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Demo:
Excel file:
In [243]: fn = r'C:\Temp\.data\41718085.xlsx'
In [244]: pd.read_excel(fn, 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Out[244]:
col1 col2
x a 1
x a 9

You can do:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx'),sheetname='CCG', index_col=0)
filter = df[(df.index == 'x') & (df.col1 == 'a')]
Then from here, you can return all the values as a numpy array with:
filter['col2']
filter['col3']

Managed to create a count that iterates until it finds a adds +1 to the count and only appends to the list index if it is between the ranges that x is in, once i have the indices i search through col 2 and 3 and pull the values out for the indices

Related

Python Pandas select rows in numpy array on first columns

I have a dataframe like this:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9],'D':[10,11,12]})
and a list, here arr, that may vary in length like this:
arr = np.array([[1,4],[2,6]])
arr = np.array([[2,5,8], [1,5,8]])
And I would like get all rows in df that matches first elements in arr like following:
for x in arr:
df[df.iloc[:, :len(x)].eq(x).all(1)]
Thanks guys!
IIUC, you can convert the array to df and use merge
arr = np.array([[1,4],[2,6],[2,5]])
df.merge(pd.DataFrame(arr, columns = df.iloc[:,:arr.shape[1]].columns))
A B C D
0 1 4 7 10
1 2 5 8 11
This solution will handle arrays of different shapes (as long as shape[1] of arr <= shape[1] of df)
arr = np.array([[2,5,8], [1,5,8], [3,6,9]])
df.merge(pd.DataFrame(arr, columns = df.iloc[:,:arr.shape[1]].columns))
A B C D
0 2 5 8 11
1 3 6 9 12

Two new columns based on return has two values in dataframe apply

I have a DataFrame:
Num
1
2
3
def foo(x):
return x**2, x**3
When I did df['sq','cube'] = df['num'].apply(foo)
It is making a single column like below:
num (sq,cub)
1 (1,1)
2 (4,8)
3 (9,27)
I want these column separate with their values
num sq cub
1 1 1
2 4 8
3 9 27
How can I achieve this...?
obj = df['num'].apply(foo)
df['sq'] = obj.str[0]
df['cube'] = obj.str[1]

How to explode/split a nested list, inside a list inside a pandas dataframe column and make separate columns out of them?

I have a dataframe. I want to split the Options column into id, AUD,ud.
id col1 col2 Options
1 A B [{'id':25,'X': {'AUD': None, 'ud':0}}]
2 C D [{'id':27,'X': {'AUD': None, 'ud':0}}]
3 E F [{'id':28,'X': {'AUD': None, 'ud':0}}]
4 G H [{'id':29,'X': {'AUD': None, 'ud':0}}]
Expected output dataframe:
id col1 col2 id Aud ud
1 A B 25 None 0
2 C D 27 None 0
3 E F 28 None 0
4 G H 29 None 0
How do you go about it using python3.6 and pandas dataframe?
Use list comprehension with json_normalize for get DataFrames and join together by concat, also added DataFrame.add_prefix for avoid duplicated columns names:
from pandas.io.json import json_normalize
import ast
L = [json_normalize(x) for x in df.pop('Options')]
#if strings instead dicts
#L = [json_normalize(ast.literal_eval(x)) for x in df.pop('Options')]
df = df.join(pd.concat(L, ignore_index=True, sort=False).add_prefix('opt_'))
print (df)
id col1 col2 opt_id opt_X.AUD opt_X.ud
0 1 A B 25 None 0
1 2 C D 27 None 0
2 3 E F 28 None 0
3 4 G H 29 None 0
Another solution with extract X values of nested dictionaries:
L = [{k: v for y in ast.literal_eval(x) for k, v in {**y.pop('X'), **y}.items()}
for x in df.pop('Options')]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('opt_'))
print (df)
id col1 col2 opt_AUD opt_ud opt_id
0 1 A B None 0 25
1 2 C D None 0 27
2 3 E F None 0 28
3 4 G H None 0 29
Try this:
for dit in df['Options'].iteritems():
df.loc[dit[0],'id'] = dit[1][0]['id']
df.loc[dit[0],'Aud'] = dit[1][0]['X']['AUD']
df.loc[dit[0],'ud'] = dit[1][0]['X']['ud']

Pandas Aggregate data other than a specific value in specific column

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.
First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

select the first n largest groups from grouped data frames

Data frame(df) structure
col1 col2
x 3131
y 9647
y 9648
z 9217
y 9652
x 23
grouping:
grouped = df.groupby(col1)
I want to select first 2 largest groups i.e.,
y 9647
y 9648
y 9652
and
x 3131
x 23
How can I do that using pandas. I've achieved it using list but that makes it clumsy again as it becomes a list of tuples and I've to convert them back to data frame types
Use value_counts with indexing index and filter rows by isin in boolean indexing:
df1 = df[df['col1'].isin(df['col1'].value_counts().index[:2])]
print (df1)
col1 col2
0 x 3131
1 y 9647
2 y 9648
4 y 9652
5 x 23
If need DataFrames by top groups use dictionary comprehension with enumerate:
dfs = {i: df[df['col1'].eq(x)] for i, x in enumerate(df['col1'].value_counts().index[:2], 1)}
print (dfs)
{1: col1 col2
1 y 9647
2 y 9648
4 y 9652, 2: col1 col2
0 x 3131
5 x 23}
print (dfs[1])
col1 col2
1 y 9647
2 y 9648
4 y 9652
print (dfs[2])
col1 col2
0 x 3131
5 x 23

Resources