Randomly select elements from string in a dataframe - python-3.x

I have dataframe with 7 string columns:
bul; age; gender; hh; pn; freq_pn; rcrds_to_select
1; 2; 5; 1; ['35784905', '40666303', '47603805', '68229102'];4;3
2; 3; 3; 3; ['06299501', '07694901', '35070201'];3;2
In the last column I have the number of id's from "pn" column that I need to select randomly. Example: in the first row I have 4 id's ['35784905', '40666303', '47603805', '68229102'] and I need to select 3 random id's and remove the not selected one. There can be rows with only one id. I came to the conclusion that I need to turn the values in tuples and store them in another column ('pnTuple'). I don't know if this is the right way.
mass_grouped3['pnTuple'] = [tuple(x) for x in mass_grouped3['pn'].values]
I think random.shuffle will do the job, but have no idea how to implement it in my script. I was thinking something like this, but is not working:
for row in mass_grouped3['pnTuple']:
list = list(mass_grouped3['pnTuple'])
whitelist = random.shuffle(list)
Any ideas how to do this selection are appreciated.

You want to randomly select 1 from every row and make the rest 0. Here's one approach. Sample the indices and based on indices assign 1. i.e
idx = pd.DataFrame(np.stack(np.where(df==1))).T.groupby(0).apply(lambda x: x.sample(1)).values
# array([[0, 2],
# [1, 1],
# [2, 0],
# [3, 3]])
ndf = pd.DataFrame(np.zeros(df.shape),columns=df.columns)
ndf.values[idx[:,0],idx[:,1]] = 1
W1 W2 W3 W4
0 0 0 1 0
1 1 0 0 0
2 1 0 0 0
3 0 1 0 0

Welcome to StackOverflow! Hope this helps
Lets go step by step
First lets construct our random function that can select 3
>>> import random
>>> random.choices(['35784905', '40666303', '47603805', '68229102'], k=3)
['68229102', '40666303', '35784905']
I have a sample data frame, df with columns with same data as yours
>>> df
a b
0 12 [35784905, 40666303, 47603805, 68229102]
1 12 [06299501, 07694901, 35070201]
>>> df['b']
0 [35784905, 40666303, 47603805, 68229102]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
0 [35784905, 68229102, 35784905]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'] = df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
Using pandas map operation to apply this data transformation to whole columns
Note: We are using a lambda function lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist to ensure that each list has more than 3 items, and only then apply this operation.
It might be a little new, but this a standard way of writing code in python. Learn more about Python, lambda function and pandas for some time.

Related

Python - Pandas: perform column value based data grouping across separate dataframe chunks

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)
I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.

How to append dataframes from different files, but having same structure?

I have different datasets in a json format, with each file containing different matches details but have the same column names. I've isolated the 'Shots' taken by one team in a single match. How should i modify my code to take only the shots of that particular team for different matches.
def key_pass(filename):
with open(filename) as f:
comp = json.load(f)
eng = pd.json_normalize(comp)
for team in eng['possession_team.name'].unique():
if team != 'Belgium':
opp = team
eng = pd.json_normalize(comp).assign(Oppn = opp)
eng_pan = eng[['shot.statsbomb_xg','minute','player.name','shot.outcome.name','shot.key_pass_id','location','type.name','play_pattern.name','possession_team.name']]
eng_pan.rename(columns={'shot.statsbomb_xg':'Statsbomb_xG','shot.outcome.name':'Outcome','shot.key_pass_id':'Keypass_id'})
total_attempts = eng_pan.loc[(eng_pan['type.name'] == 'Shot') & (eng_pan['possession_team.name'] == 'Belgium')]
total_attempts.reset_index(drop=True,inplace=True)
return(total_attempts)
When i Call the function,
total_attempts = key_pass('7584.json')
total_attempts
The Output I get is,
Now, if i have to call another file, I need the shots from that file to continue from where the previous file has finished.
Should i pass the file names as list ? And add a for loop in the function, but then again how do i append the shots ?
You can use the pandas DataFrame append method easily if both df's have the same structure:
(notice the ignore index parameter)
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

How to compare columns in a pandas dataframe

I have a pandas dataframe that looks like this with "Word" as the column header for all the columns:
Word Word Word Word
0 Nap Nap Nap Cat
1 Cat Cat Cat Flower
2 Peace Kick Kick Go
3 Phone Fin Fin Nap
How can only return the words that appear in all 4 columns?
Expected Output:
Word
0 Nap
1 Cat
Use apply(set) to turn each column into a set of words
Use set.intersection to find all words in each column's set
Turn it into a list and then a series
pd.Series(list(set.intersection(*df.apply(set))))
0 Cat
1 Nap
dtype: object
We can accomplish the same task with some python functional magic to get some performance benefit.
pd.Series(list(
set.intersection(*map(set, map(lambda c: df[c].values.tolist(), df)))
))
0 Cat
1 Nap
dtype: object
Timing
Code Below
pir1 = lambda d: pd.Series(list(set.intersection(*d.apply(set))))
pir2 = lambda d: pd.Series(list(set.intersection(*map(set, map(lambda c: d[c].values.tolist(), d)))))
# I took some liberties with #Anton vBR's solution.
vbr = lambda d: pd.Series((lambda x: x.index[x.values == len(d.columns)])(pd.value_counts(d.values.ravel())))
results = pd.DataFrame(
index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000, 30000]),
columns='pir1 pir2 vbr'.split()
)
for i in results.index:
d = pd.concat(dict(enumerate(
[pd.Series(np.random.choice(words[:i*2], i, False)) for _ in range(4)]
)), axis=1)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=100))
results.plot(loglog=True)
Alternative solution (but this would require unique values).
tf = df.stack().value_counts()
df2 = pd.DataFrame(pd.Series(tf)).reset_index()
df2.columns = ["word", "count"]
word count
0 Nap 4
1 Cat 4
2 Fin 2
3 Kick 2
4 Go 1
5 Phone 1
6 Peace 1
7 Flower 1
This can be filtered with df2[df2["count"] == len(df.columns)]["word"]
0 Nap
1 Cat
Name: word, dtype: object

New colum in pandas, based on another column's last value

In the dataframe I got this data
Open High Low Close Volume \
-------------------------------------------------------------------------
Date
2015-05-01 538.429993 539.539978 532.099976 537.900024 1768200
2015-05-04 538.530029 544.070007 535.059998 540.780029 1308000
2015-05-05 538.210022 539.739990 530.390991 530.799988 1383100
2015-05-06 531.239990 532.380005 521.085022 524.219971 1567000
My question is: how do I add a new column and give it a value of 0 if the last close was lower that the present close and 1 if it is higher.
How do I make this work through out the dataframe?
df['increasing'] = (df['Open'].diff() > 0).astype(int)
or
df['increasing'] = (df['Open'] - df['Open'].shift() > 0).astype(int)
both work, but the former is quicker.
Take, for example,
In [41]: import pandas_datareader.data as pdata
In [42]: df = pdata.get_data_yahoo('AAPL', start='2009-01-02', end='2009-12-31')
In [43]: df.head()
Out[43]:
Open High Low Close Volume Adj Close
Date
2009-01-02 85.880003 91.040001 85.160000 90.750001 186503800 11.933430
2009-01-05 93.170003 96.179998 92.709999 94.580002 295402100 12.437067
2009-01-06 95.950000 97.170001 92.389998 93.020000 322327600 12.231930
2009-01-07 91.809999 92.500001 90.260003 91.010000 188262200 11.967619
2009-01-08 90.430000 93.150002 90.039998 92.699999 168375200 12.189851
diff() returns the difference between adjacent rows:
In [45]: df['Open'].diff().head()
Out[45]:
Date
2009-01-02 NaN
2009-01-05 7.290000
2009-01-06 2.779997
2009-01-07 -4.140001
2009-01-08 -1.379999
Name: Open, dtype: float64
(df['Open'].diff() > 0) returns a boolean-valued Series which is True when the difference is positive:
In [46]: (df['Open'].diff() > 0).head()
Out[46]:
Date
2009-01-02 False
2009-01-05 True
2009-01-06 True
2009-01-07 False
2009-01-08 False
Name: Open, dtype: bool
Calling .astype(int) converts False to 0 and True to 1:
In [47]: (df['Open'].diff() > 0).astype('int').head()
Out[47]:
Date
2009-01-02 0
2009-01-05 1
2009-01-06 1
2009-01-07 0
2009-01-08 0
Name: Open, dtype: int64
The code becomes a bit more complicated if you need to assign
a third possible value, 2, when the difference is 0:
import numpy as np
diff = df['Open'].diff()
conditions = [diff > 0, diff < 0]
choices = [1, 0]
df['increasing'] = np.select(conditions, choices, default=2)
np.select is a generalization of np.where. np.where is good for handling 1 condition, np.select handles multiple conditions. Above, the conditions are diff > 0 and diff < 0 and we wish to assign the values 1 and 0, respectively:
conditions = [diff > 0, diff < 0]
choices = [1, 0]
When neither condition is True, np.select assigns the default value 2:
df['increasing'] = np.select(conditions, choices, default=2)

Resources