Problems with creating a stacked bar plot in Matplotlib - python-3.x

I have 5 lists with 7 items each:
A=[0, 2, 0, 154018, 0, 0, 0]
B=[0, 22, 0, 153882, 0, 0, 0]
C=[0, 17, 0, 152901, 0, 0, 0]
D=[0, 14, 4, 154302, 0, 0, 0]
E=[0, 18, 8, 155052, 0, 0, 0]
I wold like to get a stacked bar plot according to each stations.
I tried with the below code but getting error:
df = pd.DataFrame([['Stn1', A], ['Stn2', B], ['Stn3', C],
['Stn4', D],['Stn5', E]],
columns=['RU_26', 'RU_52', 'RU_106', 'RU_242',
'RU_484','RU_996','RU_2x996'])
Could someone help with this?

Related

How to get important words using LGBM feature importance and Tfidf vectorizer?

I am working a Kaggle dataset that predicts a price of an item using its description and other attributes. Here is the link to the competition. As part of an experiment, I am currently, only using an item's description to predict its price. The description is free text and I use sklearn's Tfidf's vectorizer with a bi-gram and max features set to 60000 as input to a lightGBM model.
After training, I would like to know the most influential tokens for predicting the price. I assumed lightGBM's feature_importance method will be able to give me this. This will return a 60000 dim numpy array, whose index I can use to retrieve the token from the Tfidf's vectorizer's vocab dictionary.
Here is the code:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=60000)
x_train = vectorizer.fit_transform(train_df['text'].values.astype('U'))
x_valid = vectorizer.transform(valid_df['text'].values.astype('U'))
idx2tok = {v: k for k, v in vectorizer.vocabulary_.items()}
features = [f'token_{i}' for i in range(len(vectorizer.vocabulary_))]
get_tok = lambda x, idxmap: idxmap[int(x[6:])]
lgb_train = lgb.Dataset(x_train, y_train)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train)
gbm = lgb.train(lgb_params, lgb_train, num_boost_round=10, valid_sets=[lgb_train, lgb_valid], early_stopping_rounds=10, verbose_eval=True)
The model trains, however, after training when I call gbm.feature_importance(), I get a sparse array of integers, that really doesn't make sense to me:
fi = gbm.feature_importance()
fi[:100]
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)
np.unique(fi)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 33, 34, 38, 45],
dtype=int32)
I'm not sure how to interpret this. I thought that earlier indices of the feature importance array will have higher value and thus tokens corresponding to that index in the vectorizer's vocab will be more important/influential than other tokens. Is this assumption wrong? How do I get the most influential/important terms that determines the model outcome? Any help is appreciated.
Thanks.

Too many indexers with iloc

i have a data frame (df) which looks like this
i am trying to get Week1 to Week3 sales for year 1950 to 1952. i am able to achieve this by loc using the below code
idx = pd.IndexSlice
df.loc[:,idx[:,1950:1952,'Week1':'Week3']]
Result:
but when i am trying to do it through iloc i am getting too many indexers error. I am using this code:
df.iloc[:,idx[:,:4,:4]]
why is it failing for iloc but working for loc?
below is the columns value of my data frame.
df.columns
MultiIndex(levels=[['Sale'], [1950, 1951, 1952, 1953], ['Week1', 'Week2', 'Week3', 'Week4']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]],
names=[None, 'Year', 'Week'])

How to define a row and a column on N-queen program?

I have to write a code for the N-queen chess board problem. I understand the theory behind it but dont understand how I should code it. In this exercise 0's represent spaces, and 1's represent queens)
so far I have only written:
import numpy as np
board=np.zeros((8,8))
board[0:,0]=1
Following this, I want to define what the rows in this board are and what the columns in this board are. So I am able to define collisions between the queens on the board.
Thank you.
I don't know how much I should be helping you (sounds like a homework), but my curiosity was piqued. So here's a preliminary exploration:
Representing a board as a 8x8 array of 0/1 is easy:
In [1783]: B=np.zeros((8,8),int)
But since a solution requires 1 queen per row, and only 1 per column, I can represent it as just a permutation of the column numbers. Looking online I found a solution, which I can enter as:
In [1784]: sol1=[2,5,1,6,0,3,7,4]
I can map that onto the board with:
In [1785]: B[np.arange(8),sol1]=1
In [1786]: B # easy display
Out[1786]:
array([[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0]])
How about testing this? Row and column sums are easy with numpy. For a valid solution these must all be 1:
In [1787]: B.sum(axis=0)
Out[1787]: array([1, 1, 1, 1, 1, 1, 1, 1])
In [1788]: B.sum(axis=1)
Out[1788]: array([1, 1, 1, 1, 1, 1, 1, 1])
Diagonals differ in length, but can also be summed
In [1789]: np.diag(B,0)
Out[1789]: array([0, 0, 0, 0, 0, 0, 0, 0])
and to look at the other diagonals, 'flip' columns:
In [1790]: np.diag(B[:,::-1],1)
Out[1790]: array([0, 1, 0, 0, 0, 0, 0])
I can generate all diagonals with a list comprehension (not necessarily the fastest way, but easy to test):
In [1791]: [np.diag(B,i) for i in range(-7,8)]
Out[1791]:
[array([0]),
array([0, 0]),
array([0, 0, 0]),
array([1, 0, 0, 0]),
array([0, 0, 0, 0, 1]),
array([0, 0, 0, 1, 0, 0]),
array([0, 1, 0, 0, 0, 0, 0]),
array([0, 0, 0, 0, 0, 0, 0, 0]),
array([0, 0, 0, 0, 0, 0, 1]),
array([1, 0, 0, 0, 0, 0]),
array([0, 0, 0, 1, 0]),
array([0, 1, 0, 0]),
array([0, 0, 0]),
array([0, 0]),
array([0])]
and for the other direction, with sum:
In [1792]: [np.diag(B[:,::-1],i).sum() for i in range(-7,8)]
Out[1792]: [0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0]
No diagonal can have a sum >1, but some may be 0.
If the proposed solution is indeed a permutation of np.arange(8) then it is guaranteed to satisfy the row and column sum test. That just leaves the diagonal tests. The board mapping may be nice for display purposes, but it isn't required to represent the solution. And it might not be the best way to test the diagonals.
A brute force solution is to generate all permutations, and test each.
In [1796]: len(list(itertools.permutations(range(8))))
Out[1796]: 40320
There are, of course, smarter ways of generating and test solutions.
A few months ago I worked on a Sudoku puzzle question
Why is translated Sudoku solver slower than original?
the initial question was whether lists or arrays were a better representation. But I found, on an AI site, that an efficient, smart solver can be written with a dictionary.
There are quite a number of SO questions tagged Python and involving 8-queens. Fewer tagged with numpy as well.
==========
Your initial setup:
board[0:,0]=1
would pass the row sum test, fail the column sum test, and pass the diagonals tests.

how to assign non contiguous labels of one numpy array to another numpy array and add accordingly?

I have the following labels
>>> lab
array([3, 0, 3 ,3, 1, 1, 2 ,2, 3, 0, 1,4])
I want to assign this label to another numpy array i.e
>>> arr
array([[81, 1, 3, 87], # 3
[ 2, 0, 1, 0], # 0
[13, 6, 0, 0], # 3
[14, 0, 1, 30], # 3
[ 0, 0, 0, 0], # 1
[ 0, 0, 0, 0], # 1
[ 0, 0, 0, 0], # 2
[ 0, 0, 0, 0], # 2
[ 0, 0, 0, 0], # 3
[ 0, 0, 0, 0], # 0
[ 0, 0, 0, 0], # 1
[13, 2, 0, 11]]) # 4
and add all corresponding rows with same labels.
The output must be
([[108, 7, 4,117]--3
[ 0, 0, 0, 0]--0
[ 0, 0, 0, 0]--1
[ 0, 0, 0, 0]--2
[13, 2, 0, 11]])--4
You could use groupby from pandas:
import pandas as pd
parr=pd.DataFrame(arr,index=lab)
pd.groupby(parr,by=parr.index).sum()
0 1 2 3
0 2 0 1 0
1 0 0 0 0
2 0 0 0 0
3 108 7 4 117
4 13 2 0 11
numpy doesn't have a group_by function like pandas, but it does have a reduceat method that performs fast array actions on groups of elements (rows). But it's application in this case is a bit messy.
Start with our 2 arrays:
In [39]: arr
Out[39]:
array([[81, 1, 3, 87],
[ 2, 0, 1, 0],
[13, 6, 0, 0],
[14, 0, 1, 30],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[13, 2, 0, 11]])
In [40]: lbls
Out[40]: array([3, 0, 3, 3, 1, 1, 2, 2, 3, 0, 1, 4])
Find the indices that will sort lbls (and rows of arr) into contiguous blocks:
In [41]: I=np.argsort(lbls)
In [42]: I
Out[42]: array([ 1, 9, 4, 5, 10, 6, 7, 0, 2, 3, 8, 11], dtype=int32)
In [43]: s_lbls=lbls[I]
In [44]: s_lbls
Out[44]: array([0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4])
In [45]: s_arr=arr[I,:]
In [46]: s_arr
Out[46]:
array([[ 2, 0, 1, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[81, 1, 3, 87],
[13, 6, 0, 0],
[14, 0, 1, 30],
[ 0, 0, 0, 0],
[13, 2, 0, 11]])
Find the boundaries of these blocks, i.e. where s_lbls jumps:
In [47]: J=np.where(np.diff(s_lbls))
In [48]: J
Out[48]: (array([ 1, 4, 6, 10], dtype=int32),)
Add the index of the start of the first block (see the reduceat docs)
In [49]: J1=[0]+J[0].tolist()
In [50]: J1
Out[50]: [0, 1, 4, 6, 10]
Apply add.reduceat:
In [51]: np.add.reduceat(s_arr,J1,axis=0)
Out[51]:
array([[ 2, 0, 1, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[108, 7, 4, 117],
[ 13, 2, 0, 11]], dtype=int32)
These are your numbers, sorted by lbls (for 0,1,2,3,4).
With reduceat you could take other actions like maximum, product etc.

how to assign labels of one numpy array to another numpy array and group accordingly?

I have the following labels
>>> lab
array([2, 2, 2, 2, 2, 3, 3, 0, 0, 0, 0, 1])
I want to assign this label to another numpy array i.e
>>> arr
array([[81, 1, 3, 87], # 2
[ 2, 0, 1, 0], # 2
[13, 6, 0, 0], # 2
[14, 0, 1, 30], # 2
[ 0, 0, 0, 0], # 2
[ 0, 0, 0, 0], # 3
[ 0, 0, 0, 0], # 3
[ 0, 0, 0, 0], # 0
[ 0, 0, 0, 0], # 0
[ 0, 0, 0, 0], # 0
[ 0, 0, 0, 0], # 0
[13, 2, 0, 11]]) # 1
and add the elements of 0th group, 1st group, 2nd group, 3rd group?
If the labels of equal values are contiguous, as in your example, then you may use np.add.reduceat:
>>> lab
array([2, 2, 2, 2, 2, 3, 3, 0, 0, 0, 0, 1])
>>> idx = np.r_[0, 1 + np.where(lab[1:] != lab[:-1])[0]]
>>> np.add.reduceat(arr, idx)
array([[110, 7, 5, 117], # 2
[ 0, 0, 0, 0], # 3
[ 0, 0, 0, 0], # 0
[ 13, 2, 0, 11]]) # 1
if they are not contiguous, then use np.argsort to align the array and labels such that labels of the same values are next to each other:
>>> i = np.argsort(lab)
>>> lab, arr = lab[i], arr[i, :] # aligns array and labels such that labels
>>> lab # are sorted and equal labels are contiguous
array([0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 3, 3])
>>> idx = np.r_[0, 1 + np.where(lab[1:] != lab[:-1])[0]]
>>> np.add.reduceat(arr, idx)
array([[ 0, 0, 0, 0], # 0
[ 13, 2, 0, 11], # 1
[110, 7, 5, 117], # 2
[ 0, 0, 0, 0]]) # 3
or alternatively use groupby from pandas library:
>>> pd.DataFrame(arr).groupby(lab).sum().values
array([[ 0, 0, 0, 0],
[ 13, 2, 0, 11],
[110, 7, 5, 117],
[ 0, 0, 0, 0]])

Resources