creating a matrix from from a list of subjects with different genes present or absent with python - string

I have a file with different subjects that have a list of genes that are present per subject (new line per gene). I would like to restructure the data to a matrix with the different subjects in the rows and then a column for every gene that is present (with a 1 or 0 for present or absent). I have the original data as an excel file that I have imported with pandas to try and do this with Python. But honestly I have no clue how to do this in a nice way.
image of how the data is structured and of how it is supposed to be formatted.
I really appreciate all the help I can get!
So many thanks already

If this is your file original file:
Subject,Gene
subject1,gene1
subject1,gene2
subject1,gene3
subject2,gene1
subject2,gene4
subject3,gene2
subject3,gene4
subject3,gene5
Then you can do something like this with pd.crosstab:
>>> import pandas as pd
>>> df = pd.read_csv("genes.csv")
>>> pd.crosstab(df["Subject"], df["Gene"])
Gene gene1 gene2 gene3 gene4 gene5
Subject
subject1 1 1 1 0 0
subject2 1 0 0 1 0
subject3 0 1 0 1 1

Use pivot()
df['count'] = 1
df.pivot(index='Subject', columns='Gene', values='count')
Gene gene1 gene2 gene3 gene4 gene5
Subject
subject1 1.0 1.0 1.0 NaN NaN
subject2 1.0 NaN NaN 1.0 NaN
subject3 NaN 1.0 NaN 1.0 1.0
UPDATED -- Full example based your comment
# import pandas module
import pandas as pd
import numpy as np
# read your excel file
df = pd.read_excel(r'path\to\your\file\myFile.xlsx')
# create a new column call 'count' and set it to a value of 1
df['count'] = 1
# use pivot and assign it to a new variable: df2
df2 = df.pivot(index='Subject', columns='Gene', values='count').replace(np.nan, 0)
# print your new dataframe
print(df2)

Related

How to unmerge cells and create a standard dataframe when reading excel file?

I would like to convert this dataframe
into this dataframe
So far reading excel the standard way gives me the following result.
df = pd.read_excel(folder + 'abcd.xlsx', sheet_name="Sheet1")
Unnamed: 0 Unnamed: 1 T12006 T22006 T32006 \
0 Casablanca Global 100 97.27252 93.464538
1 NaN Résidentiel 100 95.883979 92.414063
2 NaN Appartement 100 95.425152 91.674379
3 NaN Maison 100 101.463607 104.039383
4 NaN Villa 100 102.45132 101.996932
Thank you
You can try method .fillna() with parameter method='ffill'. According to the pandas documentation for the ffill method: ffill: propagate last valid observation forward to next valid backfill.
So, your code would be like:
df.fillna(method='ffill', inplace=True)
And change name of 0 and 1 columns with this lines:
df.columns.values[0] = "City"
df.columns.values[1] = "Type"

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

How to change the format for values in a dataframe?

I need to change the format for values in a column in a dataframe. If I have a dataframe in that format:
df =
sector funding_total_usd
1 NaN
2 10,00,000
3 3,90,000
4 34,06,159
5 2,17,50,000
6 20,00,000
How to change it to that format:
df =
sector funding_total_usd
1 NaN
2 10000.00
3 3900.00
4 34061.59
5 217500.00
6 20000.00
This is my code:
for row in df['funding_total_usd']:
dt1 = row.replace (',','')
print (dt1)
This is the error that I got "AttributeError: 'float' object has no attribute 'replace'"
I need really to your help in how to do that?
Here's the way to get the decimal places:
import pandas as pd
import numpy as np
df= pd.DataFrame({'funding_total_usd': [np.nan, 1000000, 390000, 3406159,21750000,2000000]})
print(df)
df['funding_total_usd'] /= 100
print(df)
funding_total_usd
0 NaN
1 1000000.0
2 390000.0
3 3406159.0
4 21750000.0
funding_total_usd
0 NaN
1 10000.00
2 3900.00
3 34061.59
4 217500.00
To solve your comma problem, please run this as your first command before you print. It will remove all your commas for the float values.
pd.options.display.float_format = '{:.2f}'.format

Creating sqlite table from csv files with different column names

I have a large amount .csv files that I would like to put in a sqlite database. Most of the files contain the same column names, but there are some files that have extra columns.
The code that I've tried is (altered to be generic):
import os
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.db')
cur = conn.cursor()
os.chdir(dir)
for file in os.listdir(dir):
df = pd.read_csv(file)
df.to_sql('X', conn, if_exists = 'append')
When it encounters a file with column that is not in table X I get the error:
OperationalError: table X has no column named ColumnZ
How can I alter my code to append the table with the new column and fill previous rows with NaN?
If all DataFrames can fit into RAM, you can do this:
import glob
files = glob.glob(r'/path/to/csv_files/*.csv')
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.to_sql('X', conn, if_exists = 'replace')
Demo:
In [22]: d1
Out[22]:
a b
0 0 1
1 2 3
In [23]: d2
Out[23]:
a b c
0 1 2 3
1 4 5 6
In [24]: d3
Out[24]:
x b
0 11 12
1 13 14
In [25]: pd.concat([d1,d2,d3], ignore_index=True)
Out[25]:
a b c x
0 0.0 1 NaN NaN
1 2.0 3 NaN NaN
2 1.0 2 3.0 NaN
3 4.0 5 6.0 NaN
4 NaN 12 NaN 11.0
5 NaN 14 NaN 13.0
Alternatively you can store all columns as a list and check in a loop whether a new DF has additional columns and add those columns to the SQLite DB, using SQLite ALTER TABLE statement:
ALTER TABLE tab_name ADD COLUMN ...

Using relative positioning with Python 3.5 and pandas

I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)

Resources