convert repetitive lists as rows in pandas column - python-3.x

I have a dataframe in pandas as mentioned below where list elements in column info is same as unique file in column id:
id text info
1 great ['son','daughter']
1 excellent ['son','daughter']
2 nice ['father','mother','brother']
2 good ['father','mother','brother']
2 bad ['father','mother','brother']
3 awesome nan
All I want to get list elements as row for each file, like:
id text info
1 great son
1 excellent daughter
2 nice father
2 good mother
2 bad brother
3 awesome nan

Let us try explode after drop_duplicates
df['info'] = df['info'].drop_duplicates().explode().values
df
Out[298]:
id text info
0 1 great son
1 1 excellent daughter
2 2 nice father
3 2 good mother
4 2 bad brother
5 3 awesome NaN

Related

comma seperated values in columns as rows in pandas

I have a dataframe in pandas as mentioned below where elements in column info is same as unique file in column id:
id text info
1 great boy,police
1 excellent boy,police
2 nice girl,mother,teacher
2 good girl,mother,teacher
2 bad girl,mother,teacher
3 awesome grandmother
4 superb grandson
All I want to get list elements as row for each file, like:
id text info
1 great boy
1 excellent police
2 nice girl
2 good mother
2 bad teacher
3 awesome grandmother
4 superb grandson
Let us try
df['new'] = df.loc[~df.id.duplicated(),'info'].str.split(',').explode().values
df
id text info new
0 1 great boy,police boy
1 1 excellent boy,police police
2 2 nice girl,mother,teacher girl
3 2 good girl,mother,teacher mother
4 2 bad girl,mother,teacher teacher
5 3 awesome grandmother grandmother
6 4 superb grandson grandson
Take advantage of the fact that 'info' is duplicated.
df['info'] = df['info'].drop_duplicates().str.split(',').explode().to_numpy()
Output:
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
One way using pandas.DataFrame.groupby.transform.
Note that this assumes:
elements in info have same length as the number of members for each id after split by ','
elements in info are identical among the same id.
df["info"] = df.groupby("id")["info"].transform(lambda x: x.str.split(",").iloc[0])
print(df)
Output:
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
create temp variable counting the number of rows for each info group:
temp = df.groupby('info').cumcount()
Do a list comprehension to index per text in info:
df['info'] = [ent.split(',')[pos] for ent, pos in zip(df['info'], temp)]
df
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
Or try apply:
df['info'] = pd.DataFrame({'info': df['info'].str.split(','), 'n': df.groupby('id').cumcount()}).apply(lambda x: x['info'][x['n']], axis=1)
Output:
>>> df
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
>>>

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

How to select and subset rows based on sting in pandas dataframe?

My dataset looks like following. I am trying to subset my pandas dataframe such that only the responses by all 3 people will get selected. For example, in below data frame the responses that were answered by all 3 people were "I like to eat" and "You have nice day" . Thus only those should be subsetted. I am not sure how to achieve this in Pandas dataframe.
Note: I am new to Python ,please provide explanation with your code.
DataFrame example
import pandas as pd
data = {'Person':['1', '1','1','2','2','2','2','3','3'],'Response':['I like to eat','You have nice day','My name is ','I like to eat','You have nice day','My name is','This is it','I like to eat','You have nice day'],
}
df = pd.DataFrame(data)
print (df)
Output:
Person Response
0 1 I like to eat
1 1 You have nice day
2 1 My name is
3 2 I like to eat
4 2 You have nice day
5 2 My name is
6 2 This is it
7 3 I like to eat
8 3 You have nice day
IIUC I am using transform with nunique
yourdf=df[df.groupby('Response').Person.transform('nunique')==df.Person.nunique()]
yourdf
Out[463]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Method 2
df.groupby('Response').filter(lambda x : pd.Series(df['Person'].unique()).isin(x['Person']).all())
Out[467]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day

sort pandas value_counts() primarily by descending counts and secondarily by ascending values

When applying value_counts() to a series in pandas, by default the counts are sorted in descending order, however the values are not sorted within each count.
How can i have the values within each identical count sorted in ascending order?
apples 5
peaches 5
bananas 3
carrots 3
apricots 1
The output of value_counts is a series itself (just like the input), so you have available all of the standard sorting options as with any series. For example:
df = pd.DataFrame({ 'fruit':['apples']*5 + ['peaches']*5 + ['bananas']*3 +
['carrots']*3 + ['apricots'] })
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,True])
index 0
0 apples 5
1 peaches 5
2 bananas 3
3 carrots 3
4 apricots 1
I'm actually getting the same results by default so here's a test with ascending=[False,False] to demonstrate that this is actually working as suggested.
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,False])
index 0
1 peaches 5
0 apples 5
3 carrots 3
2 bananas 3
4 apricots 1
I'm actually a bit confused about exactly what desired output here in terms of ascending vs descending, but regardless, there are 4 possible combos here and you can get it however you like by altering the ascending keyword argument.

Creating a Two-Mode Network

Using Python 3.2 I am trying to turn data from a CSV file into a two-mode network. For those who do not know what that means, the idea is simple:
This is a snippet of my dataset:
Project_ID Name_1 Name_2 Name_3 Name_4 ... Name_150
1 Jean Mike
2 Mike
3 Joe Sarah Mike Jean Nick
4 Sarah Mike
5 Sarah Jean Mike Joe
I want to create a CSV that puts the Project_IDs across the first row of the CSV and each unique name down the first column (with cell A1 blank) and then a 1 in the i,j cell if that person worked on a given project. NOTE: My data has full names (with middle initial), with no two people having the same name so there will not be any duplicates.
The final data output would look like this:
1 2 3 4 5
Jean 1 0 1 0 1
Mike 1 1 1 1 1
Joe 0 0 1 0 1
Sarah 0 0 1 1 1
... ... ... ... ... ...
Nick 0 0 1 0 0
Start by using the CVS reader
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Note that row will read as arrays for each line.
The output array should probably be created before you start. As from this question, here is how you could do that
buckets = [[0 for col in range(5)] for row in range(10)]

Resources