pandas - with restructuring data in data frame - python-3.x

I have a data frame that has data in format
time | name | value
01/01/1970 | A | 1
02/01/1970 | A | 2
03/01/1970 | A | 1
01/01/1970 | B | 5
02/01/1970 | B | 3
I what to change this data to something like
time | A | B
01/01/1970 | 1 | 5
02/01/1970 | 2 | 3
03/01/1970 | 1 | NA
How can I achieve this in pandas? I have tried groupby on dataframe and then joining but its coming out right.
thanks in advance

Use DataFrame.pivot (doc):
import numpy as np
df = pd.DataFrame(
{'name': ['A', 'A', 'A', 'B', 'B'],
'time': ['01/01/1970', '02/01/1970', '03/01/1970', '01/01/1970', '02/01/1970'],
'value': [1, 2, 1, 5, 3]})
print(df.pivot(index='time', columns='name', values='value'))
yields
A B
time
01/01/1970 1 5
02/01/1970 2 3
03/01/1970 1 NaN
Note that time is now the index. If you wish to make it a column, call reset_index():
df.pivot(index='time', columns='name', values='value').reset_index()
# name time A B
# 0 01/01/1970 1 5
# 1 02/01/1970 2 3
# 2 03/01/1970 1 NaN

Use the .pivot function:
df = pd.DataFrame({'time' : [0,1,2,3],
'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]})
df.pivot(index = 'time', columns = 'name', values = 'value')

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`
Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data
You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

PySpark join on ID then on year and month from 'date' column

I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.
df1:
ID col1 date1
1 1 2018-01-05
1 2 2018-02-05
2 4 2018-04-05
2 1 2018-05-05
3 1 2019-01-05
3 4 2019-02-05
df2:
ID col2 date2
1 1 2018-01-08
1 1 2018-02-08
2 4 2018-04-08
2 3 2018-05-08
3 1 2019-01-08
3 4 2019-02-08
Expected output:
ID col1 date1 col2 date2
1 1 2018-01-05 1 2018-01-08
1 2 2018-02-05 1 2018-02-08
2 4 2018-04-05 4 2018-04-08
2 1 2018-05-05 3 2018-05-08
3 1 2019-01-05 1 2019-01-08
3 4 2019-02-05 4 2019-02-08
I tried something along the lines of:
df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))
How to join on date's month and year?
You can to it like this:
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 1, '2018-01-05'),
(1, 2, '2018-02-05'),
(2, 4, '2018-04-05'),
(2, 1, '2018-05-05'),
(3, 1, '2019-01-05'),
(3, 4, '2019-02-05')],
['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
[(1, 1, '2018-01-08'),
(1, 1, '2018-02-08'),
(2, 4, '2018-04-08'),
(2, 3, '2018-05-08'),
(3, 1, '2019-01-08'),
(3, 4, '2019-02-08')],
['ID', 'col2', 'date2'])
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)
df.show()
# +---+----+----------+----+----------+
# | ID|col1| date1|col2| date2|
# +---+----+----------+----+----------+
# | 1| 1|2018-01-05| 1|2018-01-08|
# | 1| 2|2018-02-05| 1|2018-02-08|
# | 2| 4|2018-04-05| 4|2018-04-08|
# | 2| 1|2018-05-05| 3|2018-05-08|
# | 3| 1|2019-01-05| 1|2019-01-08|
# | 3| 4|2019-02-05| 4|2019-02-08|
# +---+----+----------+----+----------+

Pandas find max column, subtract from another column and replace the value

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']
Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

Calculate mean per few columns in Pandas Dataframe

I have a Pandas dataframe, Data:
ID | A1| A2| B1| B2
ID1| 2 | 1 | 3 | 7
ID2| 4 | 6 | 5 | 3
I want to calculate mean of columns (A1 and A2), and (B1 and B2) separately and row-wise . My desired output:
ID | A1A2 mean | B1B2 mean
ID1| 1.5 | 5
ID2| 5 | 4
I can do mean of all columns together , but cannot find any functions to get my desired output.
Is there any built-in method in Python?
Use DataFrame.groupby with lambda function for get first letter of columns for mean, also if first column is not index use DataFrame.set_index:
df=df.set_index('ID').groupby(lambda x: x[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Another solution is extract columns names by indexing str[0]:
df = df.set_index('ID')
print (df.columns.str[0])
Index(['A', 'A', 'B', 'B'], dtype='object')
df = df.groupby(df.columns.str[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Or:
df = (df.set_index('ID')
.groupby(df.columns[1:].str[0], axis=1)
.mean()
.add_suffix('_mean').reset_index()
Verify solution:
a = df.filter(like='A').mean(axis=1)
b = df.filter(like='B').mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
EDIT:
If have different columns names and need specify them in lists:
a = df[['A1','A2']].mean(axis=1)
b = df[['B1','B2']].mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)

Filtering columns in a pandas dataframe

I have a dataframe with the following column.
A
55B
<lhggkkk>
66c
dggfhhjjjj
I need to filter the records which start with number(such as 55B and 66C) separately and the others separately. Can anyone please help?
Try:
import pandas as pd
df = pd.DataFrame()
df['A'] = ['55B','<lhggkkk>','66c','dggfhhjjjj']
df['B'] = df['A'].apply(lambda x:x[0].isdigit())
print(df)
A B
0 55B True
1 <lhggkkk> False
2 66c True
3 dggfhhjjjj False
Try to check if the first number is digit then boolen index i.e
mask = df['A'].str[0].str.isdigit()
one = df[mask]
two = df[~mask]
print(one,'\n',two)
A
0 55B
2 66c
A
1 <lhggkkk>
3 dggfhhjjjj
To check first string is digit or not:
df['A'].str[0].str.isdigit()
So:
import pandas as pd
import numpy as np
df:
-----------------
| A
-----------------
0 | 55B
1 | <lhggkkk>
2 | 66c
3 | dggfhhjjjj
df['Result'] = np.where(df['A'].str[0].str.isdigit(), 'Numbers', 'Others')
df:
----------------------------
| A | Result
----------------------------
0 | 55B | Numbers
1 | <lhggkkk> | Others
2 | 66c | Numbers
3 | dggfhhjjjj | Others

Resources