pandas - with restructuring data in data frame

pandas - with restructuring data in data frame - python-3.x

I have a data frame that has data in format
time | name | value
01/01/1970 | A | 1
02/01/1970 | A | 2
03/01/1970 | A | 1
01/01/1970 | B | 5
02/01/1970 | B | 3
I what to change this data to something like
time | A | B
01/01/1970 | 1 | 5
02/01/1970 | 2 | 3
03/01/1970 | 1 | NA
How can I achieve this in pandas? I have tried groupby on dataframe and then joining but its coming out right.
thanks in advance

Use DataFrame.pivot (doc):
import numpy as np
df = pd.DataFrame(
{'name': ['A', 'A', 'A', 'B', 'B'],
'time': ['01/01/1970', '02/01/1970', '03/01/1970', '01/01/1970', '02/01/1970'],
'value': [1, 2, 1, 5, 3]})
print(df.pivot(index='time', columns='name', values='value'))
yields
A B
time
01/01/1970 1 5
02/01/1970 2 3
03/01/1970 1 NaN
Note that time is now the index. If you wish to make it a column, call reset_index():
df.pivot(index='time', columns='name', values='value').reset_index()
# name time A B
# 0 01/01/1970 1 5
# 1 02/01/1970 2 3
# 2 03/01/1970 1 NaN

Use the .pivot function:
df = pd.DataFrame({'time' : [0,1,2,3],
'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]})
df.pivot(index = 'time', columns = 'name', values = 'value')

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`

Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

PySpark join on ID then on year and month from 'date' column

I have 2 PySpark dataframes and want to join on "ID", then on a year from "date1" and "date2" columns and then on month of the same date columns.
df1:
ID col1 date1
1 1 2018-01-05
1 2 2018-02-05
2 4 2018-04-05
2 1 2018-05-05
3 1 2019-01-05
3 4 2019-02-05
df2:
ID col2 date2
1 1 2018-01-08
1 1 2018-02-08
2 4 2018-04-08
2 3 2018-05-08
3 1 2019-01-08
3 4 2019-02-08
Expected output:
ID col1 date1 col2 date2
1 1 2018-01-05 1 2018-01-08
1 2 2018-02-05 1 2018-02-08
2 4 2018-04-05 4 2018-04-08
2 1 2018-05-05 3 2018-05-08
3 1 2019-01-05 1 2019-01-08
3 4 2019-02-05 4 2019-02-08
I tried something along the lines of:
df = df1.join(df2, (ID & (df1.F.year(date1) == df2.F.year(date2)) & (df1.F.month(date1) == df2.F.month(date2))
How to join on date's month and year?

You can to it like this:
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on)
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 1, '2018-01-05'),
(1, 2, '2018-02-05'),
(2, 4, '2018-04-05'),
(2, 1, '2018-05-05'),
(3, 1, '2019-01-05'),
(3, 4, '2019-02-05')],
['ID', 'col1', 'date1'])
df2 = spark.createDataFrame(
[(1, 1, '2018-01-08'),
(1, 1, '2018-02-08'),
(2, 4, '2018-04-08'),
(2, 3, '2018-05-08'),
(3, 1, '2019-01-08'),
(3, 4, '2019-02-08')],
['ID', 'col2', 'date2'])
join_on = (df1.ID == df2.ID) & \
(F.year(df1.date1) == F.year(df2.date2)) & \
(F.month(df1.date1) == F.month(df2.date2))
df = df1.join(df2, join_on).drop(df2.ID)
df.show()
# +---+----+----------+----+----------+
# | ID|col1| date1|col2| date2|
# +---+----+----------+----+----------+
# | 1| 1|2018-01-05| 1|2018-01-08|
# | 1| 2|2018-02-05| 1|2018-02-08|
# | 2| 4|2018-04-05| 4|2018-04-08|
# | 2| 1|2018-05-05| 3|2018-05-08|
# | 3| 1|2019-01-05| 1|2019-01-08|
# | 3| 4|2019-02-05| 4|2019-02-08|
# +---+----+----------+----+----------+

Pandas find max column, subtract from another column and replace the value

I have a df like this:
A | B | C | D
14 | 5 | 10 | 5
4 | 7 | 15 | 6
100 | 220 | 6 | 7
For each row in column A,B,C, I want the find the max value and from it subtract column D and replace it.
Expected result:
A | B | C | D
9 | 5 | 10 | 5
4 | 7 | 9 | 6
100 | 213 | 6 | 7
So for the first row, it would select 14(the max out of 14,5,10), subtract column D from it (14-5 =9) and replace the result(replace initial value 14 with 9)
I know how to find the max value of A,B,C and from it subctract D, but I am stucked on the replacing part.
I tought on putting the result in another column called E, and then find again the max of A,B,C and replace with column E, but that would make no sense since I would be attempting to assign a value to a function call. Is there any other option to do this?
#Exmaple df
list_columns = ['A', 'B', 'C','D']
list_data = [ [14, 5, 10,5],[4, 7, 15,6],[100, 220, 6,7]]
df= pd.DataFrame(columns=list_columns, data=list_data)
#Calculate the max and subctract
df['e'] = df[['A', 'B']].max(axis=1) - df['D']
#To replace, maybe something like this. But this line makes no sense since it's backwards
df[['A', 'B','C']].max(axis=1) = df['D']

Use DataFrame.mask for replace only maximal value matched by compare all values of filtered columns with maximals:
cols = ['A', 'B', 'C']
s = df[cols].max(axis=1)
df[cols] = df[cols].mask(df[cols].eq(s, axis=0), s - df['D'], axis=0)
print (df)
A B C D
0 9 5 10 5
1 4 7 9 6
2 100 213 6 7

Calculate mean per few columns in Pandas Dataframe

I have a Pandas dataframe, Data:
ID | A1| A2| B1| B2
ID1| 2 | 1 | 3 | 7
ID2| 4 | 6 | 5 | 3
I want to calculate mean of columns (A1 and A2), and (B1 and B2) separately and row-wise . My desired output:
ID | A1A2 mean | B1B2 mean
ID1| 1.5 | 5
ID2| 5 | 4
I can do mean of all columns together , but cannot find any functions to get my desired output.
Is there any built-in method in Python?

Use DataFrame.groupby with lambda function for get first letter of columns for mean, also if first column is not index use DataFrame.set_index:
df=df.set_index('ID').groupby(lambda x: x[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Another solution is extract columns names by indexing str[0]:
df = df.set_index('ID')
print (df.columns.str[0])
Index(['A', 'A', 'B', 'B'], dtype='object')
df = df.groupby(df.columns.str[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Or:
df = (df.set_index('ID')
.groupby(df.columns[1:].str[0], axis=1)
.mean()
.add_suffix('_mean').reset_index()
Verify solution:
a = df.filter(like='A').mean(axis=1)
b = df.filter(like='B').mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
EDIT:
If have different columns names and need specify them in lists:
a = df[['A1','A2']].mean(axis=1)
b = df[['B1','B2']].mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)

Filtering columns in a pandas dataframe

I have a dataframe with the following column.
A
55B
<lhggkkk>
66c
dggfhhjjjj
I need to filter the records which start with number(such as 55B and 66C) separately and the others separately. Can anyone please help?

Try:
import pandas as pd
df = pd.DataFrame()
df['A'] = ['55B','<lhggkkk>','66c','dggfhhjjjj']
df['B'] = df['A'].apply(lambda x:x[0].isdigit())
print(df)
A B
0 55B True
1 <lhggkkk> False
2 66c True
3 dggfhhjjjj False

Try to check if the first number is digit then boolen index i.e
mask = df['A'].str[0].str.isdigit()
one = df[mask]
two = df[~mask]
print(one,'\n',two)
A
0 55B
2 66c
A
1 <lhggkkk>
3 dggfhhjjjj

To check first string is digit or not:
df['A'].str[0].str.isdigit()
So:
import pandas as pd
import numpy as np
df:
-----------------
| A
-----------------
0 | 55B
1 | <lhggkkk>
2 | 66c
3 | dggfhhjjjj
df['Result'] = np.where(df['A'].str[0].str.isdigit(), 'Numbers', 'Others')
df:
----------------------------
| A | Result
----------------------------
0 | 55B | Numbers
1 | <lhggkkk> | Others
2 | 66c | Numbers
3 | dggfhhjjjj | Others

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pandas - with restructuring data in data frame - python-3.x

Use the .pivot function: df = pd.DataFrame({'time' : [0,1,2,3], 'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]}) df.pivot(index = 'time', columns = 'name', values = 'value')

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

PySpark join on ID then on year and month from 'date' column

Pandas find max column, subtract from another column and replace the value

Calculate mean per few columns in Pandas Dataframe

Filtering columns in a pandas dataframe

Categories

Resources