Transpose Excel data, on multiple columns - excel

I have a dataset in Excel that I need to transpose. It is survey data and the first column is the month of the survey. The second is unique to each company, the third is a sector code for that company (which can change over time), the forth is a Size variable and then there is question number and answer columns. I want to be able to do Pivot tables of this, but as I understand it I need to get each question in its own column to be able to cross tabulate in the pivot table. Eg what has companies answered on question 2, dependent on their answer on question 1. Ho wan I transpose the data?
From this
Period Company Sector Size Question Answer
201601 101 Cons Small 1 2
201601 101 Cons Small 2 1
201601 101 Cons Small 3 2
201601 102 Int Small 1 3
201601 102 Int Small 2 1
201601 102 Int Small 3 1
201602 101 Cons Small 1 3
201602 101 Cons Small 2 2
201602 101 Cons Small 3 1
201602 102 Int Small 1 3
201602 102 Int Small 2 1
201602 102 Int Small 3 2
To this
Period Company Sector Size Question1 Question2 Question3
201601 101 Cons Small 2 1 2
201601 102 Int Small 3 1 1
201602 101 Cons Small 3 2 1
201602 102 Int Small 3 1 2
There can be up to about 30 questions in one file, about 1500-2000 companies and in my first files I will have 4 months. The companies are grouped on at most 5 sectors and two different sizes.

Thanks to a comment from Doug Glancy I could figure out how to do things.
Create a Pivot Table with all columns in Row Lables except for Question and Answer. Then put Question in Column Labels and Answer in Values. Choose to sum the values.
To get the format correct, in the Pivot Table Tools - Design menu, choose Subtotals - Do not show Subtotals. Copy the resulting table into a new workbook without the sums column and row.

Related

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Group by name and count unique values

I have an Excel file like this, where column A and B are given. I want to add column C and D that represent days. D is pretty easy, because it is always one day. C is tricky, because I want to count only "unique" days, where a branch can be one day maximum, where D counts all days.
A B C D
Row Name Branch Unique Overall
1 Jack Health 1 1
2 Jack Health 0 1
3 Jack Food 1 1
4 Jolie Tech 1 1
5 Jolie Food 1 1
6 Jolie Tech 0 1
7 Jolie Health 1 1
I need column C and D for a pivot table like this:
Branch Unique Overall
Health 2 3
Food 2 2
Tech 1 2
I also could add names as a sub position.
Branch Unique Overall
Health 2 3
-Jack 1 2
-Jolie 1 1
Food 2 2
-Jack 1 1
-Jolie 1 1
Tech 1 2
-Jolie 1 2
But that´s something, that can be done after preparing the data and what comes with the program anyway. So how can I design a formula that counts only unique branches for a data set of hundreds of rows?
Thank you!
In C2 put:
=--(COUNTIFS($A$2:A2,A2,$B$2:B2,B2)=1)
Then copy down

Allocate class based on school ranking

In Excel I am trying to allocate classes to pupils based on their ranking in school. The set of data I have looks like this:
S/N Name LevelPosition
1 Andrea 10
2 Bryan 25
3 Catty 5
4 Debbie 26
5 Ellie 30
6 Freddie 28
I would like to have a formula that could sort the pupils based on the LevelPosition and allocate the class in order of this sequence - A,B,C,C,B,A. Hence the result would be:
S/N Name LevelPosition AllocatedClass
3 Catty 5 A
1 Andrea 10 B
2 Bryan 25 C
4 Debbie 26 C
6 Freddie 28 B
5 Ellie 30 A
This was the sort of thing I had in mind.
Column D is just a ranking from bottom to top:-
=RANK(C2,C$2:C$7,1)
Colum D is adjusted for any ties:-
=D2+COUNTIF(D$1:D1,D2)
Column E is based on the #pnuts formula:-
=CHOOSE(MOD(E2-1,6)+1,"A","B","C","C","B","A")
I've put some ties in to show what would happen. The last two students' allocations are reversed because the second to last has the higher mark.

Randomly select a percentage of columns and rows in a dataframe (Pandas, Python 3)

I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])

Resources