How to convert a column containing list into separate column in pandas data-frame? [duplicate] - python-3.x

This question already has answers here:
Split a Pandas column of lists into multiple columns
(11 answers)
How to convert string representation of list to a list
(19 answers)
Closed 3 years ago.
I've a data frame and one of its columns contains a list.
A B
0 5 [3, 4]
1 4 [1, 1]
2 1 [7, 7]
3 3 [0, 2]
4 5 [3, 3]
5 4 [2, 2]
The output should look like this:
A x y
0 5 3 4
1 4 1 1
2 1 7 7
3 3 0 2
4 5 3 3
5 4 2 2
I have tried these options that I found here but its not working.

df = pd.DataFrame(data={"A":[0,1],
"B":[[3,4],[1,1]]})
df['x'] = df['B'].apply(lambda x:x[0])
df['y'] = df['B'].apply(lambda x:x[1])
df.drop(['B'],axis=1,inplace=True)
A x y
0 0 3 4
1 1 1 1
Incase the list is stored as string
from ast import literal_eval
df = pd.DataFrame(data={"A":[0,1],
"B":['[3,4]','[1,1]']})
df['x'] = df['B'].apply(lambda x:literal_eval(x)[0])
df['y'] = df['B'].apply(lambda x:literal_eval(x)[1])
df.drop(['B'],axis=1,inplace=True)
3rd way credit goes to #anky_91
df = pd.DataFrame(data={"A":[0,1],
"B":['[3,4]','[1,1]']})
df["B"] = df["B"].apply(lambda x :literal_eval(x))
df[['A']].join(pd.DataFrame(df["B"].values.tolist(),columns=['x','y'],index=df.index))
df.drop(["B"],axis=1,inplace=True)

Related

python tuple compare with specific number

I have this piece of code
import itertools
values = [1, 2, 3, 4]
per = itertools.permutations(values, 2)
hyp = 3
for val in per:
print(*val)
Output:
1 2
1 3
1 4
2 1
2 3
2 4
3 1
3 2
3 4
4 1
4 2
4 3
I want to compare each tuple with value of hyp (e.g. 3). If each tuple has value less than or equal to hyp it keeps it and if condition doesn't meet, It discard it.
In this case the tuples (4,1),(4,2),(4,3) should be removed.
in other words,
Based on hyp value it takes pair.
If hyp =2 then from value list it output should be like this
1 2
1 3
1 4
2 1
2 3
2 4
I am not sure whether i explained my problem clearly or not. Let me know if it is unclear.
This will do it. You just need to extract the zero index of each tuple and compare it to hyp:
import itertools
values = [1, 2, 3, 4]
per = itertools.permutations(values, 2)
hyp = 3
for tup in per:
if tup[0] <= hyp:
print(*tup)

Increasing iteration speed

Good afternoon,
I'm iterating through a huge Dataframe (104062 x 20) with the following code:
import pandas as pd
df_tot = pd.read_csv("C:\\Users\\XXXXX\\Desktop\\XXXXXXX\\LOGS\\DF_TOT.txt", header=None)
df_tot = df_tot.replace("\[", "", regex=True)
df_tot = df_tot.replace("\]", "", regex=True)
df_tot = df_tot.replace("\'", "", regex=True)
i = 0
while i < len(df_tot):
to_compare = df_tot.iloc[i].tolist()
for j in range(len(df_tot)):
if to_compare == df_tot.iloc[j].tolist():
if i == j:
print('Matched itself.')
else:
print('MATCH FOUND - row: {} --- match row: {}'.format(i,j))
i += 1
I am looking to optimize time spent for each iteration as much as possible, since this code iterates 104062(^2) times. (More or less ten billions iterations).
With my computing power, time spent comparing to_compare in the whole DF is around 26 seconds.
I want to clarify that in case it would be needed, the whole code could be changed with faster constructs.
As usual, Thanks in advance.
as far as i understand, You just want to find duplicated rows.
Sample data(2 last rows are duplicated):
In [1]: df = pd.DataFrame([[1,2], [3,4], [5,6], [7,8], [1,2], [5,6]], columns=['a', 'b'])
df
Out[1]:
a b
0 1 2
1 3 4
2 5 6
3 7 8
4 1 2
5 5 6
This will return all duplicated rows:
In [2]: df[df.duplicated(keep=False)]
Out[2]:
a b
0 1 2
2 5 6
4 1 2
5 5 6
And indexes, grouped by duplicated row:
In [3]: df[df.duplicated(keep=False)].reset_index().groupby(list(df.columns), as_index=False)['index'].apply(list)
Out[3]: a b
1 2 [0, 4]
5 6 [2, 5]
You can also just remove duplicates from dataframe:
In [4]: df.drop_duplicates()
Out[4]:
a b
0 1 2
1 3 4
2 5 6
3 7 8

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

How can i calculate population in pandas?

I have a data set like this:-
S.No.,Year of birth,year of death
1, 1, 5
2, 3, 6
3, 2, -
4, 5, 7
I need to calculate population on till that years let say:-
year,population
1 1
2 2
3 3
4 3
5 4
6 3
7 2
8 1
How can i solve it in pandas?
Since i am not good in pandas.
Any help would be appreciate.
First is necessary choose maximum year of year of death if not exist, in solution is used 8.
Then convert values of year of death to numeric and replace missing values by this year. In first solution is used difference between birth and death column with Index.repeat with GroupBy.cumcount, for count is used Series.value_counts:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
df['year of death'] = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
df = df.loc[df.index.repeat(df['year of death'].add(1).sub(df['Year of birth']).astype(int))]
df['Year of birth'] += df.groupby(level=0).cumcount()
df1 = (df['Year of birth'].value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Another solution use list comprehension with range for repeat years:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
L = [x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)]
df1 = (pd.Series(L).value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Similar like before, only is used Counter for dictionary for final DataFrame:
from collections import Counter
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
d = Counter([x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)])
print (d)
Counter({5: 4, 3: 3, 4: 3, 6: 3, 2: 2, 7: 2, 1: 1, 8: 1})
df1 = pd.DataFrame({'year':list(d.keys()),
'population':list(d.values())})
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1

Placing n rows of pandas a dataframe into their own dataframe

I have a large dataframe with many rows and columuns.
An example of the structure is:
a = np.random.rand(6,3)
df = pd.DataFrame(a)
I'd like to split the DataFrame into seperate data frames each consisting of 3 rows.
you can use groupby
g = df.groupby(np.arange(len(df)) // 3)
for n, grp in g:
print(grp)
0 1 2
0 0.278735 0.609862 0.085823
1 0.836997 0.739635 0.866059
2 0.691271 0.377185 0.225146
0 1 2
3 0.435280 0.700900 0.700946
4 0.796487 0.018688 0.700566
5 0.900749 0.764869 0.253200
to get it into a handy dictionary
mydict = {k: v for k, v in g}
You can use numpy.split() method:
In [8]: df = pd.DataFrame(np.random.rand(9, 3))
In [9]: df
Out[9]:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238
In [10]: for x in np.split(df, len(df)//3):
...: print(x)
...:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
0 1 2
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
0 1 2
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238

Resources