Can we separate data using Unique ID in to the following format? - python-3.x

Current Format:
UNIQUE ID
NAME
AGE
DEP
RANK
001
John
10
4th
1
002
Priya
11
4th
2
003
Jack
15
5th
2
004
Jill
14
5th
1
Expected Format:
UNIQUE ID
NAME
COLUMN_NO
001
John
1
001
10
2
001
4th
3
001
1
4
002
Priya
1
002
11
2
002
4th
3
002
2
4

My starting point:
>>> df
UNIQUE ID NAME AGE DEP RANK
0 1 John 10 4th 1
1 2 Priya 11 4th 2
2 3 Jack 15 5th 2
3 4 Jill 14 5th 1
The basic transformation you need is provided by df.stack, which results in:
0 UNIQUE ID 1
NAME John
AGE 10
DEP 4th
RANK 1
1 UNIQUE ID 2
NAME Priya
[...]
However, you want column UNIQUE ID to be treated separately. This can be accomplished by making it the index:
>>> df.set_index('UNIQUE ID').stack()
UNIQUE ID
1 NAME John
AGE 10
DEP 4th
RANK 1
2 NAME Priya
AGE 11
DEP 4th
RANK 2
The last missing bit are the column names: you want them renamed to numbers. This could be accomplished two different ways: a) by re-assigning df.columns (after having moved column UNIQUE ID to the index first):
df = df.set_index('UNIQUE_ID')
df.columns = range(1, 5)
or b) by df.renaming the columns:
df = df.set_index('UNIQUE_ID')
df = df.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
And finally you can convert the resulting Series back to a DataFrame. The most elegant way to get COLUMN NO at the right place is using df.rename_axis before stacking. All together as one expression (possibly better to split it up):
>>> (df.set_index('UNIQUE ID')
.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
.rename_axis('COLUMN NO', axis=1)
.stack()
.to_frame('NAME')
.reset_index())
UNIQUE ID COLUMN NO NAME
0 1 1 John
1 1 2 10
2 1 3 4th
3 1 4 1
4 2 1 Priya
5 2 2 11
6 2 3 4th
7 2 4 2
8 3 1 Jack
9 3 2 15
10 3 3 5th
11 3 4 2
12 4 1 Jill
13 4 2 14
14 4 3 5th
15 4 4 1
Things left out: reading the data; preserving the correct type: UNIQUE ID only looks numeric, but has leading zeros that probably want to be preserved; so parsing them as a string would be better.

Related

Sum whenever another column changes

I have a df with VENDOR, INVOICE and AMOUNT. I want to create a column called ITEM, which starts at 1, and when the invoice number changes it will change to 2 and so on.
I tried using cumsum, but it isn't actually working - and it makes sense not to work. The way I wrote the code it will sum 1 for the same invoice and start over when the invoice changes.
data = pd.read_csv('data.csv')
data['ITEM_drop'] = 1
s = data['INVOICE'].ne(data['INVOICE'].shift()).cumsum()
data['ITEM'] = data.groupby(s)['ITEM_drop'].cumsum()
Output:
VENDOR INVOICE AMOUNT ITEM_drop ITEM
A 123 10 1 1
A 123 12 1 2
A 456 44 1 1
A 456 5 1 2
A 456 10 1 3
B 999 7 1 1
B 999 1 1 2
And what I want is:
VENDOR INVOICE AMOUNT ITEM_drop ITEM
A 123 10 1 1
A 123 12 1 1
A 456 44 1 2
A 456 5 1 2
A 456 10 1 2
B 999 7 1 3
B 999 1 1 3

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

Pandas: Sort a dataframe based on multiple columns

I know that this question has been asked several times. But none of the answers match my case.
I've a pandas dataframe with columns,department and employee_count. I need to sort the employee_count column in descending order. But if there is a tie between 2 employee_counts then they should be sorted alphabetically based on department.
Department Employee_Count
0 abc 10
1 adc 10
2 bca 11
3 cde 9
4 xyz 15
required output:
Department Employee_Count
0 xyz 15
1 bca 11
2 abc 10
3 adc 10
4 cde 9
This is what I've tried.
df = df.sort_values(['Department','Employee_Count'],ascending=[True,False])
But this just sorts the departments alphabetically.
I've also tried to sort by Department first and then by Employee_Count. Like this:
df = df.sort_values(['Department'],ascending=[True])
df = df.sort_values(['Employee_Count'],ascending=[False])
This doesn't give me correct output either:
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10
0 abc 10
3 cde 9
It gives 'adc' first and then 'abc'.
Kindly help me.
You can swap columns in list and also values in ascending parameter:
Explanation:
Order of columns names is order of sorting, first sort descending by Employee_Count and if some duplicates in Employee_Count then sorting by Department only duplicates rows ascending.
df1 = df.sort_values(['Employee_Count', 'Department'], ascending=[False, True])
print (df1)
Department Employee_Count
4 xyz 15
2 bca 11
0 abc 10 <-
1 adc 10 <-
3 cde 9
Or for test if use second False then duplicated rows are sorting descending:
df2 = df.sort_values(['Employee_Count', 'Department',],ascending=[False, False])
print (df2)
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10 <-
0 abc 10 <-
3 cde 9

Match pandas column values and headers across dataframes

I have 3 files that I am reading into dataframes (https://pastebin.com/v7BnSH3s)
map_df: Maps data_file headers to codes_df headers
Field Name Code Name
Gender gender_codes
Race race_codes
Ethnicity ethnicity_codes
code_df: Valid codes
gender_codes race_codes ethnicity_codes
1 1 1
2 2 2
3 3 3
4 4 4
NaN NaN 5
NaN NaN 6
NaN NaN 7
data_df: the actual data that needs to be checked against the codes
Name Gender Race Ethnicity
Alex 99 1 7
Cindy 2 4 5
Tom 1 99 1
Problem:
I need to confirm that each value in every column of data_df is a valid code. If not, I need to write the Name, the invalid value and the column header label as a new column. So my example data_df would yield the following dataframe for the gender_codes check:
result_df:
Name Value Column
Alex 99 Gender
Background:
My actual data file has over 100 columns.
A code column can map to multiple columns in the data_df.
I am currently not using the map_df other than to know which columns map to
which codes. However, if I can incorporate this into my script, that would be
ideal.
What I've tried:
I am currently sending each code column to a list, removing the nan string, performing the lookup with loc and isin, then setting up the result_df...
# code column to list
gender_codes = codes_df["gender_codes"].tolist()
# remove nan string
gender_codes = [gender_codes
for gender_codes in gender_codes
if str(gender_codes) != "nan"]
# check each value against code list
result_df = data_df.loc[(~data_df.Gender.isin(gender_codes))]
result_df = result_df.filter(items = ["Name","Gender"])
result_df.rename(columns = {"Gender":"Value"}, inplace = True)
result_df['Column'] = 'Gender'
This works but obviously is extremely primitive and won't scale with my dataset. I'm hoping to find an iterative and pythonic approach to this problem.
EDIT:
Modified Dataset with np.nan
https://pastebin.com/v7BnSH3s
Boolean indexing
I'd reformat your data into different forms
m = dict(map_df.itertuples(index=False))
c = code_df.T.stack().groupby(level=0).apply(set)
ddf = data_df.melt('Name', var_name='Column', value_name='Value')
ddf[[val not in c[col] for val, col in zip(ddf.Value, ddf.Column.map(m))]]
Name Column Value
0 Alex Gender 99
5 Tom Race 99
Details
m # Just a dictionary with the same content as `map_df`
{'Gender': 'gender_codes',
'Race': 'race_codes',
'Ethnicity': 'ethnicity_codes'}
c # Series of membership sets
ethnicity_codes {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
gender_codes {1.0, 2.0, 3.0, 4.0}
race_codes {1.0, 2.0, 3.0, 4.0}
dtype: object
ddf # Melted dataframe to help match the final output
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
You will need to preprocess your dataframes and define a validation function. Something like below:
1. Preprocessing
# call melt() to convert columns to rows
mcodes = codes_df.melt(
value_vars=list(codes_df.columns),
var_name='Code Name',
value_name='Valid Code').dropna()
mdata = data_df.melt(
id_vars='Name',
value_vars=list(data_df.columns[1:]),
var_name='Column',
value_name='Value')
validation_df = mcodes.merge(map_df, on='Code Name')
Out:
mcodes:
Code Name Valid Code
0 gender_codes 1
1 gender_codes 2
7 race_codes 1
8 race_codes 2
9 race_codes 3
10 race_codes 4
14 ethnicity_codes 1
15 ethnicity_codes 2
16 ethnicity_codes 3
17 ethnicity_codes 4
18 ethnicity_codes 5
19 ethnicity_codes 6
20 ethnicity_codes 7
mdata:
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
validation_df:
Code Name Valid Code Field Name
0 gender_codes 1 Gender
1 gender_codes 2 Gender
2 race_codes 1 Race
3 race_codes 2 Race
4 race_codes 3 Race
5 race_codes 4 Race
6 ethnicity_codes 1 Ethnicity
7 ethnicity_codes 2 Ethnicity
8 ethnicity_codes 3 Ethnicity
9 ethnicity_codes 4 Ethnicity
10 ethnicity_codes 5 Ethnicity
11 ethnicity_codes 6 Ethnicity
12 ethnicity_codes 7 Ethnicity
2. Validation Function
def isValid(row):
valid_list = validation_df[validation_df['Field Name'] == row.Column]['Valid Code'].tolist()
return row.Value in valid_list
3. Validation
mdata['isValid'] = mdata.apply(isValid, axis=1)
result = mdata[mdata.isValid == False]
Out:
result:
Name Column Value isValid
0 Alex Gender 99 False
5 Tom Race 99 False
m, df1 = dict(map_df.values), data_df.set_index('Name')
df1[df1.apply(lambda x:~x.isin(code_df[m[x.name]]))].stack().reset_index()
Out:
Name level_1 0
0 Alex Gender 99.0
1 Tom Race 99.0

dataframe transformation python

I am new to pandas. I have dataframe,df with 3 columns:(date),(name) and (count).
Given each day: is there an easy way to create a new dataframe from original one that contains new columns representing the unique names in the original (name column) and their respective count values in the correct columns?
date name count
0 2017-08-07 ABC 12
1 2017-08-08 ABC 5
2 2017-08-08 TTT 6
3 2017-08-09 TAC 5
4 2017-08-09 ABC 10
It should now be
date ABC TTT TAC
0 2017-08-07 12 0 0
1 2017-08-08 5 6 0
3 2017-08-09 10 0 5
df = pd.DataFrame({"date":["2017-08-07","2017-08-08","2017-08-08","2017-08-09","2017-08-09"],"name":["ABC","ABC","TTT","TAC","ABC"], "count":["12","5","6","5","10"]})
df = df.pivot(index='date', columns='name', values='count').reset_index().fillna(0)

Resources