How can I split a single dataframe column, containing multiple values, into a transposed table of 'flags' - python-3.x

I'm trying to separate a single column in a dataframe, that contains multiple comma separated values, into a transposed table that contains columns which flag every potential comma separated value.
e.g.
name purchased total_value
-----------------------------------------------
John Eggs, bread, milk 100
Steve Milk, cheese, wine 140
Susan Beer, cheese, milk 120
Needs to become:
name total_value eggs bread milk cheese wine beer
-----------------------------------------------------------------
John 100 1 1 1 0 0 0
Steve 140 0 0 1 1 1 0
Susan 120 0 0 1 1 0 1
As a small added complication, the purchased column as spaces after the commas, and some of the values have capitals.
Can anyone help me get from A to B?
Thanks in advance

First convert all values to lowercase by lower, call str.get_dummies and last join to original.
For remove original purchased is possible use pop or drop:
df['purchased'] = df['purchased'].str.lower()
df = df.join(df.pop('purchased').str.get_dummies(', '))
df = df.drop('purchased', 1).join(df['purchased'].str.lower().str.get_dummies(', '))
print (df)
name total_value beer bread cheese eggs milk wine
0 John 100 0 1 0 1 1 0
1 Steve 140 0 0 1 0 1 1
2 Susan 120 1 0 1 0 1 0

Related

Pandas: how to calculate average ignoring 0 within groups?

My data looks like this:
It is grouped by "name"
name star atm food foodcp drink drinkcp clean cozy service
___Backyard Jr. (__Xinyi) 4 4 4 4 4 0 4 0 0
___Backyard Jr. (__Xinyi) 3 0 3 0 3 0 0 0 3
___Backyard Jr. (__Xinyi) 4 0 0 0 4 0 0 0 0
___Backyard Jr. (__Xinyi) 3 0 0 0 0 0 0 3 3
I want to calculate the mean of all columns except for name, which will ignore the "0" and it will be done within groups. How can I do it?
I've tried use
df.groupby('name',as_index=False).mean()
but it dose calculate the "0".
Thank you for your help!!
You can first replace all the zeros by NaN:
df = df.replace(0, np.nan)
These nan values will be excluded from your mean.

Is it possible to create a formula that would sum multiple columns by looking up two parameters?

The question may sound very odd, so let me explain.
E.g. I have a table that looks like this:
1 2 3 3
Soda 1 2 1 0
Beer 2 0 0 0
Beer 2 1 2 2
Soda 0 2 0 2
I would like to sum each "drink number" for every category, like this:
1 2 3
Soda 1 4 3
Beer 4 1 4
I have used the IFS and SUMIF formula sequence for this (where B$1 = "Column number",$A2 = "Drink name", rest is just columns corresponding to value locations):
=IFS(B$1=“1”,SUMIF($A$2:$A$5,$A2,$B$2:$B$5),B$1=“2”,SUMIF($A$2:$A$5,$A2,$C$2:$C$5),B$1=“3”,sum(SUMIF($A$2:$A$5,$A2,$D$2:$D$5),SUMIF($A$2:$A$5,$A2,$E$2:$E$5)))
But it is very robust, and needs to be changed manually if the initial table is changed.
I would like to make my formula adaptable to changes. Say, it "looks up" row and column values and sums up every column values by them. For example, if one more column "2" is added, and "Beer" is changed to "Milk" (but the "column length" remains the same);
1 2 2 3 3
Soda 1 2 1 1 0
Milk 2 0 1 0 0
Milk 2 1 0 2 2
Soda 0 2 1 0 2
The formula automatically adjusts to this change, and calculate new sum:
1 2 3
Soda 1 6 3
Milk 4 2 4
Is it possible to create such formula? Because I'm not even sure if something like this can be done with basic Excel formulas. Thank you beforehand.
Yes. For example:
=SUMPRODUCT(($A$2:$A$5=$G2)*($B$1:$E$1=H$1),$B$2:$E$5)
where G2 has Soda and H1 has 1. Then copy across and down (assuming Beer in G3 and 2, 3 in I1:J1)

Count rows where two values appear together

My data are in MS Excel:
Col A Col B Col C Col D
1 amy john bob andy
2 andy mel amy john
3 max andy jim bob
4 wil steve andy amy
So, in 4x4 table there are 9 different values.
I need to create table to find how many times each PAIR is occurring in the same ROW. Something like this:
amy andy bob jim john max mel steve will
amy 0
andy 3 0
bob 1 2 0
jim 0 1 1 0
john 2 2 1 0 0
max 0 1 1 1 0 0
mel 1 1 0 0 1 0 0
steve 1 1 0 0 0 0 0 0
will 1 1 0 0 0 0 0 1 0
And I have no clue how to do it...
To reiterate: no duplicated values in each row, each row has unique values, each value in separate cell, so there are column with values and within column values can duplicate.
Any help will be much appreciated!
Assuming your data is in A5:D8 I proceeded like this -
created a helper column with the formula (copied downwards)
=A5&"-"&B5&"-"&C5&"-"&D5
Named this helper column as helper (named range)
listed down and across the unique combinations of names in H4:P4 (across) and G5:G13 (down)
enter this formula in H5 and copy it both downwards and across to fill all 9x9 matrix
=IF($G5=H$4,0,COUNTIFS(helper,"*"&$G5&"*",helper,"*"&H$4&"*"))
Your desired matrix is ready
A detailed blog is available on web for this.

Excel how to check if a row has any other values and return the index

I have this big table where the first column is the person. The second column is something they all have. And then there are different columns for possible other things they could have.
For example :
Person apples pears bananas oranges
Luc 7 0 0 0
Julia 10 0 0 2
Maria 8 0 0 0
Lena 15 0 3 0
Tina 2 1 0 1
I know for a fact that everybody eats apples, but i would like to know if people eat other things also and wich thing those are
The result should be
Person appels pears bananas oranges result
Luc 7 0 0 0 0
Julia 10 0 0 2 oranges
Maria 8 0 0 0 0
Lena 15 0 3 0 bananas
Tina 2 1 0 1 pears, oranges
The last column doesn't have to be the name of the fruit, I would be happy if I had the column number. I tried HLOOKUP, but this doesn't work . Or maybe I don't use the right lookup_value ? I use > 0 as lookup_value.
Can somebody please help me ?
One solution: Create "dummy variables" (formatted with text instead of {0,1} values) that are only populated for fruits actually consumed (N>0), and then concatenate in the results column:
ID Apples Pears Bananas Oranges d_apples d_pears d_bananas d_oranges result
Luc 7 0 0 0 Apples
Julia 10 0 0 2 Apples Oranges Oranges
Maria 8 0 0 0 Apples
Lena 15 0 3 0 Apples Bananas Bananas
Tina 2 1 0 1 Apples Pears Oranges Pears Oranges
In the d_apples column, I've entered =IF(B2>0, B$1, "") and filled this across the d_ columns.
In the results column, I've entered =TRIM(CONCATENATE(G2, " ", H2, " ", I2)).
The TRIM() function removes the extra blank spaces between words for fruits not eaten. Note of course that this doesn't add commas between words, and adding commas to the CONCATENATE() function will yield something like , Bananas,.
If you want the column numbers, you can change B$1 in the IF() function to COLUMN(B$1).

Creating a Two-Mode Network

Using Python 3.2 I am trying to turn data from a CSV file into a two-mode network. For those who do not know what that means, the idea is simple:
This is a snippet of my dataset:
Project_ID Name_1 Name_2 Name_3 Name_4 ... Name_150
1 Jean Mike
2 Mike
3 Joe Sarah Mike Jean Nick
4 Sarah Mike
5 Sarah Jean Mike Joe
I want to create a CSV that puts the Project_IDs across the first row of the CSV and each unique name down the first column (with cell A1 blank) and then a 1 in the i,j cell if that person worked on a given project. NOTE: My data has full names (with middle initial), with no two people having the same name so there will not be any duplicates.
The final data output would look like this:
1 2 3 4 5
Jean 1 0 1 0 1
Mike 1 1 1 1 1
Joe 0 0 1 0 1
Sarah 0 0 1 1 1
... ... ... ... ... ...
Nick 0 0 1 0 0
Start by using the CVS reader
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Note that row will read as arrays for each line.
The output array should probably be created before you start. As from this question, here is how you could do that
buckets = [[0 for col in range(5)] for row in range(10)]

Resources