Set 3 level of column names in pandas DataFrame - python-3.x

I'm trying to have a frame with the following structure
h/a totales
sub1 sub2 sub1 sub2
a b ... f g ....m a b ... f g ....m
That being, 2 labels for the first layer, again 2 labels for the second one, and then a subset of column names where sub1 and sub2 doesn't have the same column names.
In order to do so I did the following:
columnas=pd.MultiIndex.from_product([['h/a','totals'],['means','percentages'],
[('means','a'),('means','b'),....('percentage','g'),....],
names=['data level 1','data level 2','data level 3']])
data=[data,pata,......]
newframe=pd.DataFrame(data,columns=columnas)
What I get is this error:
>ValueError: Shape of passed values is (1, 21), indices imply (84, 21)
How can I fix this to have a multi leveled frame by column names?
Thank you

I think need MultiIndex.from_tuples from list comprehensions:
L1 = list('abc')
L2 = list('ghi')
tups = ([('h/a','means', x) for x in L1] +
[('h/a','percentage', x) for x in L2] +
[('totals','means', x) for x in L1] +
[('totals','percentage', x) for x in L2])
columnas=pd.MultiIndex.from_tuples(tups, names=['data level 1','data level 2','data level 3'])
print (columnas)
MultiIndex(levels=[['h/a', 'totals'],
['means', 'percentage'],
['a', 'b', 'c', 'g', 'h', 'i']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['data level 1', 'data level 2', 'data level 3'])
#some random data
np.random.seed(785)
data = np.random.randint(10, size=(3, 12))
print (data)
[[8 0 4 1 2 5 4 1 4 1 1 8]
[1 5 0 7 4 8 4 1 3 8 0 2]
[5 9 4 9 4 6 3 7 0 5 2 1]]
newframe=pd.DataFrame(data,columns=columnas)
print (newframe)
data level 1 h/a totals
data level 2 means percentage means percentage
data level 3 a b c g h i a b c g h i
0 8 0 4 1 2 5 4 1 4 1 1 8
1 1 5 0 7 4 8 4 1 3 8 0 2
2 5 9 4 9 4 6 3 7 0 5 2 1

Related

Print a groupby object for a specific group/groups only

I need to print the result of groupby object in Python for a specific group/groups only.
Below is the dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4],
'Entry' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6]})
print("\n df = \n",df)
In order to group the dataferame by ID and print the result I used these codes:
grouped_by_unit = df.groupby(by="ID")
print("\n", grouped_by_unit.apply(print))
Can somebody please let me know below two things:
How can I print the data frame grouped by 'ID=1' only?
I need to get the below output:
Likewise, how can I print the data frame grouped by 'ID=1' and 'ID=4' together?
I need to get the below output:
You can iterate over the groups for example with for-loop:
grouped_by_unit = df.groupby(by="ID")
for id_, g in grouped_by_unit:
if id_ == 1 or id_ == 4:
print(g)
print()
Prints:
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
ID Entry
12 4 1
13 4 2
14 4 3
15 4 4
16 4 5
17 4 6
You can use get_group function:
df.groupby(by="ID").get_group(1)
which prints
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
You can use the same method to print the group for the key 4.

How to solve diagonally constraint sudoku?

Write a program to solve a Sudoku puzzle by filling the empty cells where 0 represent empty cell.
Rules:
All the number in sudoku must appear exactly once in diagonal running from top-left to bottom-right.
All the number in sudoku must appear exactly
once in diagonal running from top-right to bottom-left.
All the number in sudoku
must appear exactly once in a 3*3 sub-grid.
However number can repeat in row or column.
Constraints:
N = 9; where N represent rows and column of grid.
What I have tried:
N = 9
def printing(arr):
for i in range(N):
for j in range(N):
print(arr[i][j], end=" ")
print()
def isSafe(grid, row, col, num):
for x in range(9):
if grid[x][x] == num:
return False
cl = 0
for y in range(8, -1, -1):
if grid[cl][y] == num:
cl += 1
return False
startRow = row - row % 3
startCol = col - col % 3
for i in range(3):
for j in range(3):
if grid[i + startRow][j + startCol] == num:
return False
return True
def solveSudoku(grid, row, col):
if (row == N - 1 and col == N):
return True
if col == N:
row += 1
col = 0
if grid[row][col] > 0:
return solveSudoku(grid, row, col + 1)
for num in range(1, N + 1, 1):
if isSafe(grid, row, col, num):
grid[row][col] = num
print(grid)
if solveSudoku(grid, row, col + 1):
return True
grid[row][col] = 0
return False
if (solveSudoku(grid, 0, 0)):
printing(grid)
else:
print("Solution does not exist")
Input:
grid =
[
[0, 3, 7, 0, 4, 2, 0, 2, 0],
[5, 0, 6, 1, 0, 0, 0, 0, 7],
[0, 0, 2, 0, 0, 0, 5, 0, 0],
[2, 8, 3, 0, 0, 0, 0, 0, 0],
[0, 5, 0, 0, 7, 1, 2, 0, 7],
[0, 0, 0, 3, 0, 0, 0, 0, 3],
[7, 0, 0, 0, 0, 6, 0, 5, 0],
[0, 2, 3, 0, 3, 0, 7, 4, 2],
[0, 5, 0, 0, 8, 0, 0, 0, 0]
]
Output:
8 3 7 0 4 2 1 2 4
5 0 6 1 8 6 3 6 7
4 1 2 7 3 5 5 0 8
2 8 3 5 4 8 6 0 5
6 5 0 0 7 1 2 1 7
1 4 7 3 2 6 8 4 3
7 8 1 4 1 6 3 5 0
6 2 3 7 3 5 7 4 2
0 5 4 2 8 0 6 8 1
Basically I am stuck on the implementation of checking distinct number along diagonal. So question is how I can make sure that no elements gets repeated in those diagonal and sub-grid.

Creating dataframe with multi level column index from from four 2d numpy arrays

I haveĀ four 2d numpy arrays:
import numpy as np
import pandas as pd
x1 = np.array([[2, 4, 1],
[2, 2, 1],
[1, 3, 3],
[2, 2, 1],
[3, 3, 2]])
x2 = np.array([[1, 2, 2],
[4, 1, 4],
[1, 4, 4],
[3, 3, 2],
[2, 2, 4]])
x3 = np.array([[4, 3, 2],
[4, 3, 2],
[4, 3, 3],
[1, 2, 2],
[1, 4, 3]])
x4 = np.array([[3, 1, 1],
[3, 4, 3],
[2, 2, 1],
[2, 1, 1],
[1, 2, 4]])
And I would like to create a dataframe as following:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
df = pd.DataFrame(np.concatenate((x1,x1,x3,x4),axis=1), columns=header)
df.index.name = 'Time'
df
Data in this DataFrame is not in the desired form.
I want the four columns (x1,x2,x3,x4) in the first level column label (location1) should be created by taking the first columns from all the numpy arrays. The next four columns (x1,x2,x3,x4) ie. the four columns in the second first level column label (location2) should be created by taking second columns from all four numpy arrays and so on. The length of first level column label ie. len(level_1_label) will be equal to the number of columns in all four 2d numpy arrays.
Desired DataFrame:
One option is to reverse the order in creating the MultiIndex column (since level_1_label corresponds to the columns and level_2_label corresponds to the arrays); then swaplevel + sort_index (to get it in the desired order) after building the DataFrame:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_2_label, level_1_label], names=['Variable','Location'])
df = pd.DataFrame(np.concatenate((x1,x2,x3,x4),axis=1), columns=header).swaplevel(axis=1).sort_index(level=0, axis=1)
df.index.name = 'Time'
Output:
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4
One option is to reshape the data in Fortran order, before creating the dataframe:
# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header)
df.index.name = 'Time'
df
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]

How to extract data using python from a text file

I have been having troubles with extracting reading/manipulating/extracting data from a txt file. In the text file it has a general header with various information that is setup something like this below just as an example:
~ECOLOGY
~LOCATION
LAT: 59
LONG: 23
~PARAMETERS
Area. 8
Distribution. 3
Diversity. 5
~DATA X Y CONF DECID PEREN
3 6 1 3 0
7 2 4 2 1
4 8 0 6 2
9 9 6 2 0
2 3 2 5 4
6 5 0 2 7
7 1 2 4 2
I want to be able to extract the headers of the columns and use the headers of the columns as an index or key since sometimes the types of column data can change between files and the amount of rows of data can fluctuate as well. I want to be able to read the data in each column so that pending on location I can sum or add columns such as show below and export it as a separate file:
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
Any suggestions?
This is what I have so far:
E = open("ECOLOGY.txt", "r")
with open(path) as E:
for i, line in enumerate(E):
sep_lines = line.rsplit()
if "~DATA" in sep_lines:
key =(line.rsplit())
key.remove('~DATA')
for j, value in enumerate(key):
print (j,value)
print (key)
dict = {L: v for v, L in enumerate(key)}
print(dict)
Life would be much easier for you if you learned a smidgen of Pandas. But you can do it without.
with open('ttl.txt') as ttl:
for _ in range(10):
next(ttl)
first = True
for line in ttl:
line = line.rstrip()
if first:
first = False
labels = line.split()+['TOTAL']
fmt = 7*'{:<9s}'
print (fmt.format(*labels))
else:
numbers = [int(_) for _ in line.split()]
total = sum(numbers[-3:])
other_items = numbers + [total]
fmt = 6*'{:<9d}'
fmt = '{:<9s}'+fmt
print (fmt.format('', *other_items))
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
next skips lines in the input file. You can use split() to split input lines on whitespace, the use formatting to put items back together as you want them.
This a very basic, frail, format depending solution. But I hope it can help you.
with open("test.txt") as f:
data_part_reached = False
for line in f:
if "~DATA" in line:
column = [[elem] for elem in line.split(" ") if elem not in (" ", "", "\n", "~DATA")]
data_part_reached = True
elif data_part_reached:
values = [int(elem) for elem in line.split(" ") if elem not in (" ", "", "\n")]
for i in range(len(columns)):
columns[i].append(values[i])
columns =
[['X', 3, 7, 4, 9, 2, 6, 7],
['Y', 6, 2, 8, 9, 3, 5, 1],
['CONF', 1, 4, 0, 6, 2, 0, 2],
['DECID', 3, 2, 6, 2, 5, 2, 4],
['PEREN', 0, 1, 2, 0, 4, 7, 2],
['TOTAL', 4, 7, 8, 8, 11, 9, 8]]
This will get you a list of lists where the first element of each list is the header and the rest are the values. I casted the values to int since you said you want to operate with them. You can turn this list into a dict where the key is the header and the list of values of each column are the value if you want, like this.
d = {}
for column in columns:
d[column.pop(0)] = column
d =
{'DECID': [3, 2, 6, 2, 5, 2, 4],
'PEREN': [0, 1, 2, 0, 4, 7, 2],
'CONF': [1, 4, 0, 6, 2, 0, 2],
'X': [3, 7, 4, 9, 2, 6, 7],
'TOTAL': [4, 7, 8, 8, 11, 9, 8],
'Y': [6, 2, 8, 9, 3, 5, 1]}
Create a empty dictionary to store all needed data.
Read from the file object as E and loop until you reach a line starting with ~DATA.
Then split the header items, append TOTAL and then break from the loop.
Create a list to store the remaining data.
Loop to split the data and then append the sum total.
The list will append each list of data.
Loop ends and then adds to list of lists to the dictionary.
dic = {}
with open("ECOLOGY.txt") as E:
for line in E:
if line[:5] == '~DATA':
dic['header'] = line.split()[1:] + ['TOTAL']
break
data = []
for line in E:
cols = line.split()
cols.append(sum([int(num) for num in cols[2:]]))
data.append(cols)
dic['data'] = data
The dictionary will be i.e. {'header': [...], 'data': [[...], ...]}
edit: Added missing dic declaration at the beginning of code.

Resources