How to transfer a list of elements to a pandas dataframe by every three elements? - python-3.x

I have a list of people's info and I want to transfer it to a pandas dataframe.
My list:
My_lst = ['Name1','Title1','Company1','Name2','Title2','Company2','Name3','Title3','Company3',
'Name4','Title4','Company4','Name5','Title5','Company5','Name6','Title6','Company6'...]
Expected outputs:
NAME TITLE COMPANY
Name1 Title1 Company1
Name2 Title2 Company2
Name3 Title3 Company3
...
How do I do that in Python? Thank you for the help!

IIUC reshape
pd.DataFrame(np.array(My_lst).reshape((-1,3)),columns=['name','title','company'])
name title company
0 Name1 Title1 Company1
1 Name2 Title2 Company2
2 Name3 Title3 Company3
3 Name4 Title4 Company4
4 Name5 Title5 Company5
5 Name6 Title6 Company6

Related

Want To Collect The Same string of header

I have header of sheet as
'''
+--------------+------------------+----------------+--------------+---------------+
| usa_alaska | usa_california | france_paris | italy_roma | france_lyon |
|--------------+------------------+----------------+--------------+---------------|
+--------------+------------------+----------------+--------------+---------------+
'''
df = pd.DataFrame([], columns = 'usa_alaska usa_california france_paris italy_roma france_lyon'.split())
I want to separate the headers by country and region in a way so that when I call france, I should get paris and lyon as columns.
Create a MultiIndex from your column names:
Suppose this dataframe:
>>> df
usa_alaska usa_california france_paris italy_roma france_lyon
0 1 2 3 4 5
df.columns = df.columns.str.split('_', expand=True)
df = df.sort_index(axis=1)
Output
>>> df
france italy usa
lyon paris roma alaska california
0 5 3 4 1 2
>>> df['france']
paris lyon
0 3 5

Joining column of different rows in pandas

If i have a dataframe and i want to merge ID column based on the Name column without deleting any row.
How would i do this?
Ex-
Name
ID
John
ABC
John
XYZ
Lucy
MNO
I want to convert the above dataframe into the below one
Name
ID
John
ABC, XYZ
John
ABC, XYZ
Lucy
MNO
Use GroupBy.transform with join:
df['ID'] = df.groupby('Name')['ID'].transform(', '.join)
print (df)
Name ID
0 John ABC, XYZ
1 John ABC, XYZ
2 Lucy MNO

How to gather and rename the same columns from multiple Pandas dataframes

I have many dataframes with the same structure - number of rows and names of columns.
How can I gather all the columns with same name, but with name replaced, into a single new dataframe?
df1 = pd.DataFrame({'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]})
df2 = pd.DataFrame({'Name':['Wendy', 'Frank', 'krish', 'Lucy'], 'Age':[20, 21, 19, 18]})
print(df1)
print(df2)
I want:
df3 = pd.DataFrame({'Name1':['Wendy', 'Frank', 'krish', 'Lucy'], 'Name2':['Tom', 'nick', 'krish', 'jack']})
print(df3)
Output:
df1:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
df2:
Name Age
0 Wendy 20
1 Frank 21
2 krish 19
3 Lucy 18
df3:
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
df1 = df1.drop(column='Age')
df2 = df2.drop(column='Age')
df3 = df1.join(df2)
You can concat the two DataFrames together along axis=1 in a list comprehension. Use .add_suffix with enumerate to get the numbers appended to the column names.
pd.concat([df[['Name']].add_suffix(i+1) for i,df in enumerate([df2, df1])], axis=1)
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack
Or if you want to do this for many similar columns at once concat with keys to create a MultiIndex on the columns and then collapse the MultiIndex and join the column names in a list comprehension.
l = [df2, df1]
df3 = pd.concat(l, axis=1, keys=np.arange(len(l))+1)
df3.columns = [f'{y}{x}' for x,y in df3.columns]
# Name1 Age1 Name2 Age2
#0 Wendy 20 Tom 20
#1 Frank 21 nick 21
#2 krish 19 krish 19
#3 Lucy 18 jack 18
df3.filter(like='Name')
Name1 Name2
0 Wendy Tom
1 Frank nick
2 krish krish
3 Lucy jack

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

Keep rows if difference between two values in column 2 exceeds certain amount; do this for each category given in column 1

I want to subsample a file by keeping as many entries as possible whose difference in values in column 2 are at least 500 units, for each name in column 1. The full file is ~200,000 lines long, sorted by column 1 then column 2, tab-separated, and looks something like this:
name1 107
name1 110
name1 472
name1 509
name1 599
name1 679
name1 710
name2 36
name2 179
name2 391
name2 696
name2 1427
name2 1583
name2 1722
name2 2090
name2 2136
name2 2235
name3 687
name3 933
name4 43
name4 207
name4 384
name4 439
name4 447
name4 603
name4 774
name4 802
name4 876
name4 988
I would like an output that looks like this:
name1 107
name1 679
name2 36
name2 696
name2 1427
name2 2090
name3 687
name4 43
name4 603
I think one way to do it is to keep the first entry for each name and then keep the next entry for that name that is at least 500 units larger, and then the next entry that is at least 500 larger than that, etc. Then, repeat for each name. It would also be fine if it was in reverse starting with the last entry for each name, or it would be fine if it started elsewhere as long as it maximized the number of entries retained for each name that are greater 500 units apart.
However, I have no idea how to code this, as I am a novice! Thank you for your help!
I chose to do it in Python, which is turning into the lingua franca of bioinformatics.
(Learn enough Python for your biology needs here: http://learnpythonthehardway.org/book/)
Copy the following into a file and run it with python script_name.py input_textfile.txt
(If you do not know enough python to do that chapters 0 and 1 in the book referred to above will help you)
import sys
name_column = 0
number_column = 1
last_name = "dummy variable"
last_number = -1
min_difference = 500
with open(sys.argv[1], 'r') as input_file:
for line in input_file:
name = line.split()[name_column]
number = int(line.split()[number_column])
if name != last_name:
print(line.strip())
last_number = number
last_name = name
continue
if (number-last_number) >= min_difference:
print(line.strip())
last_number = number
Output using data above:
name1 107
name1 679
name2 36
name2 696
name2 1427
name2 2090
name3 687
name4 43
name4 603
If you want the output in a file, use python script_name.py input_textfile.txt > output_file

Resources