I have a data set for which has two labels, label 1 = 0(case), label 2 =1(control). I have already calculated the mean for the two different labels. Furthermore, I need to calculate two sample t test(dependent) and two sample rank sum test. My data set looks like :
SRA ID ERR169499 ERR169500 ERR169501 mean_ctrl mean_case
Label 1 0 1
TaxID PRJEB3251_ERR169499 PRJEB3251_ERR169500 PRJEB3251_ERR169501
333046 0.05 0 0.4
1049 0.03 0.9 0
337090 0.01 0.6 0.7
I am new to statistics.The code I have so far is this:
label = []
data = {}
x = open('final_out_transposed.csv','rt')
for r in x:
datas = r.split(',')
if datas[0] == ' Label':
label.append(r.split(",")[1:])
label = label[0]
label[-1] = label[-1].replace('\n','')
counter = len(label)
for row in file1:
content = row.split(',')
if content[0]=='SRA ID' or content[0]== 'TaxID' or content[0]==' Label':
pass
else:
dt = row.split(',')
dt[-1] = dt[-1].replace('\n','')
data[dt[0]]=dt[1:]
keys = list(data)
sum_file = open('sum.csv','w')
for key in keys:
sum_case = 0
sum_ctrl = 0
count_case = 0
count_ctrl = 0
mean_case = 0
mean_ctrl = 0
print(len(label))
for i in range(counter):
print(i)
if label[i] == '0' or label[i] == 0:
sum_case=np.float64(sum_case)+np.float64(data[key][i])
count_case = count_case+1
mean_case = sum_case/count_case
else:
sum_ctrl = np.float64(sum_ctrl)+np.float64(data[key][i])
count_ctrl = count_ctrl+1
mean_ctrl = sum_ctrl/count_ctrl
Any help will be highly appreciated.
Instead of using open to read your csv file, I would use Pandas. That will place it in a dataframe that will be easier to use
import pandas as pd
data_frame = pd.read_csv('final_out_transposed.csv')
For a Two Sample dependent T-test you want to use ttest_rel
notice ttest_ind is for independent groups. Since you specifically asked for dependent groups, use ttest_rel.
It's hard from your example above to see where your two columns of sample data are, but imagine I had the following made up data of 'case' and 'control'. I could calculate a dependent Two Sample t-test using pandas as shown below:
import pandas as pd
from scipy.stats import ttest_rel
data_frame = pd.DataFrame({
'case':[55, 43, 51, 62, 35, 48, 58, 45, 48, 54, 56, 32],
'control':[48, 38, 53, 58, 36, 42, 55, 40, 49, 50, 58, 25]})
(t_stat, p) = ttest_rel(data_frame['control'], data_frame['case'])
print (t_stat)
print (p)
p would be the p-value, t_stat would be the t-statistic. You can read more about this in the documentation
In a similar manner, once you have your sample .csv data in a dataframe, you can perform a rank sum test:
from scipy.stats import ranksums
(t_stat, p) = ranksums(data_frame['control'], data_frame['case'])
documentation for ranksums
Related
I have this weird Pandas problem, when I use the apply function using values from a data frame, it only gets applied to the first row:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [10, 20]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
#variable data frame - pull values from this to edit main data frame
headerVariables = [['varA', 'varB']]
valuesVariables = [[2, 10]]
dfVariables = pd.DataFrame(valuesVariables, columns = headerVariables)
dfVariables.to_csv('Variables.csv', index=False)
readVariablesCSV = pd.read_csv('Variables.csv')
readVarA = readVariablesCSV['varA']
readVarB = readVariablesCSV['varB']
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 NaN NaN
But when I just use regular variables (not being called from a data frame), it functions just fine:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [20, 40]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
# variables
readVarA = 2
readVarB = 10
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 100.0 200.0
Help please I'm pulling my hair out.
If you take readVarA and readVarB from the dataframe by selecting the column it is a pandas Series with an index, which gives a problem in the calculation (dividing a series by another series with a different index doesn't work).
You can take the first value from the series to get the value like this:
def formula(x):
return (x / readVarA[0]) * readVarB[0]
I'm 63 and just started with Python (My first steps with Udemy).
I'm Croatian so this is croatian language in program but you will understand when you run a program. I know it can be cleaner, shorter, more elegant etc, but as I mentioned before - I'm beginner.
import random
jedan = random.sample(range(1,99),15)
dva = random.sample(range(1,99),15)
def raspaljot(jedan, dva, i):
for x in jedan:
for y in dva:
if y == x:
index1 = jedan.index(x)
index1_str = str(index1)
index2 = dva.index(y)
index2_str = str(index2)
i += 1
x = str(x)
print(" Broj \033[31m" + x + "\033[0m,je dupli i nalazi se u listi jedan: na poziciji: \033[34m"
+ index1_str + "\033[0m a u listi dva na poziciji: \033[35m"+ index2_str + "\033[0m")
print()
print(jedan)
print(dva)
if i != 0:
print("\n *** Ukupno ima ", i, 'duplih brojeva. ***')
elif i == 0:
print("Nema duplih brojeva. :) ")
i = 0
raspaljot(jedan, dva,i)
What program do is finding duplicates in 2 random lists, them print duplicates in color and detecting position inside list[1] and list[2].
What I trying to do is printing list1 and list2 but showing duplicates in color.
For example:
[14, 78, 85, 31, 5, 54, 13, 46, 83, 4, 35, 41, 52, 51, 32]
[72, 40, 67, 85, 54, 76, 77, 39, 51, 36, 91, 70, 71, 38, 55]
here we have 3 duplicates (85,54,51). This above example on the console End was printed in white color, but I wanna these 3 numbers in red color in those two lines above.
Is this possible? I couldn't find a solution.
PS. Wing Pro version 7 on Fedora 33 Workstation / In WIngIde colors are only displayed in an external console and not the Debug I/O tool. :)
Simple solution would be something like this:
# Change list to string
jedan_str = str(jedan)
# Create set with numbers that need new color
num_set = {"85", "54", "51"}
# Iterate over every number and wrap it with color change
for i in num_set:
# Note that I used f-string to format string
# But you can also do this as "\033[31m" + i + "\033[0m"
jedan_str = jedan_str.replace("85", f"\033[31m{i}\033[0m")
# Print string that represent list
print(jedan_str)
Following the idea of using a set to determine which elements are in both lists (as Cv4niak proposed in his answer), I created a function to print the output as you desire. There are numerous other ways of achieving it, but I think this is a simple yet effective way.
The idea is to use the cprint() function from the termcolor package. You can install it with pip install termcolor, and then print normally all elements, except the ones that are duplicates, which will be printed using cprint(item, "red").
The "{:0>2d}" formatting in each ìtem print serves only to pad the number with zeros (so 2 will be printed as 02, for example), in order for the output of both lists to be aligned.
import random
from termcolor import cprint
def mark_duplicates(first, second):
duplicates = list(set(first).intersection(second))
if duplicates:
for list_ in [first, second]:
print("[", end="")
for item in list_:
if item in duplicates:
cprint("{:0>2d}".format(item), "red", end=",")
else:
print("{:0>2d}".format(item), end=",")
print("\b]")
else:
print("No duplicates.")
jedan = random.sample(range(1, 99), 15)
dva = random.sample(range(1, 99), 15)
mark_duplicates(jedan, dva)
With this, if there are no duplicates, the No duplicates. string will be printed. Also you can change the color with not much effort, and use other nice functionalities from termcolor package.
I have a Dataframe and I want to group by Type, and then Flag and plot a graph for count of ID and another graph grouped by Type , Flag and sum of Total column in Bokeh.
')
p.hbar(df,
plot_width=800,
plot_height=800,
label='Type',
values='ID',
bar_width=0.4,
group = ' Type', 'Flag'
legend='top_right')
[![Expected Graph ][2]][2]
If it's not possible with Bokeh what other package can I use to get a good looking graph( Vibrant colours with white background)
You can do this with the holoviews library, which uses bokeh as a backend.
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension("bokeh")
df = pd.DataFrame({
"type": list("ABABCCAD"),
"flag": list("YYNNNYNY"),
"id": list("DEFGHIJK"),
"total": [40, 100, 20, 60, 77, 300, 60, 50]
})
# Duplicate the dataframe
df = pd.concat([df] * 2)
print(df)
type flag id total
0 A Y 1 40
1 B Y 2 100
2 A N 3 20
3 B N 4 60
4 C N 5 77
5 C Y 6 300
6 A N 7 60
7 D Y 8 50
Now that we have our data, lets work on plotting it:
def mainplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="total",
text="total",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
def sideplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="count",
text="count",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
# Create single bar plot for sum of the total column
total_sum = df.groupby(["type", "flag"])["total"].sum().reset_index()
total_sum_bars = hv.Bars(total_sum, kdims=["type", "flag"], vdims="total")
# Create our multi-dimensional bar plot
all_ids = sorted(df["id"].unique())
counts = df.groupby(["type", "flag"])["id"].value_counts().rename("count").reset_index()
id_counts_hmap = hv.Bars(counts, kdims=["type", "flag", "id"], vdims="count").groupby("type")
main_plot = (total_sum_bars
.opts(hooks=[mainplot_hook],
title="Total Sum",
invert_axes=True)
)
side_plots = (
id_counts_hmap
.redim.values(id=all_ids, flag=["Y", "N"])
.redim.range(count=(0, 3))
.opts(
opts.NdLayout(title="Counts of ID"),
opts.Bars(color="#1F77B4", height=250, width=250, invert_axes=True, hooks=[sideplot_hook]))
.layout("type")
.cols(2)
)
final_plot = main_plot + side_plots
# Save combined output as html
hv.save(final_plot, "my_plot.html")
# Save just the main_plot as html
hv.save(main_plot, "main_plot.html")
As you can see, the code to make plots in holoviews can be a little tricky but it's definitely a tool I would recommend you pick up. Especially if you deal with high dimensional data regularly, it makes plotting it a breeze once you get the syntax down.
I found this thread how to make a variable change from the text "1m" into "1000000" in python
My string values are in a column within a pandas dataframe. The string/0bkects values are like 18M, 345K, 12.9K, 0, etc.
values = df5['Values']
multipliers = { 'k': 1e3,
'm': 1e6,
'b': 1e9,
}
pattern = r'([0-9.]+)([bkm])'
for number, suffix in re.findall(pattern, values):
number = float(number)
print(number * multipliers[suffix])
Running the code gives this error:
Traceback (most recent call last):
File "c:/Users/thebu/Documents/Python Projects/trading/screen.py", line 19, in <module>
for number, suffix in re.findall(pattern, values):
File "C:\Users\thebu\Anaconda3\envs\trading\lib\re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Thanks
Here's another way using regex:
import re
def get_word(s):
# find word
r = re.findall(r'[a-z]', s)
# find numbers
w = re.findall(r'[0-9]', s)
if len(r) > 0 and len(w) > 0:
r = r[0]
v = multipliers.get(r, None)
if v:
w = int(''.join(w))
w *= v
return round(w)
df['col2'] = df['col'].apply(get_word)
print(df)
col col2
0 10k 10000
1 20m 20000000
Sample Data
df = pd.DataFrame({'col': ['10k', '20m']})
I have an CSV file something like:
Total Energy Pump Diswasher Freezer
2015-04-15T13:00:00+0200 70 50 80
2015-04-15T14:00:00+0400 80 40 80
2015-04-15T15:00:00+0300 30 90 80
2015-04-15T16:00:00+0600 40 80 80
2015-04-15T17:00:00+0700 50 70 80
I want to take only the first 3 values of each column and export to different Python arrays. But for the total energy, I want to take only the value after "+" sign only.
The result is expected something like:
Total_energy = [200, 400, 300]
Pump = [70, 80, 30]
Diswasher = [50, 40, 90]
Freezer = [80, 80, 80]
I have two issue:
1. Only take the first three values in each column and I do not want to take the whole column.
2. First column(Total Energy) I only need the value after "+" sign
import pandas
colnames = ['Total_energy', 'Pump', 'Diswasher',]
data = pandas.read_csv('Data_60min.csv', names=colnames)
names = data.name.tolist()
Total_energy = data.Total_energy.tolist()
Pump = data.Pump.tolist()
Diswasher = data.Diswasher.tolist()
I can do something like above but it takes for the whole column and also I do not know how to take the value after the "+" sign only.enter image description here
First filter top 3 rows by DataFrame.head or DataFrame.iloc:
data = data.head(3)
Or:
data = data.iloc[:3]
then seelct values after + by Series.str.split or by Series.str.extract:
data['Total_energy'] = data['Total_energy'].str.split('+').str[1].astype(int)
data['Total_energy'] = data['Total_energy'].str.extract('\+(.+)', expand=False).astype(int)
Last is possible create dictionary and select by keys for lists:
d = data.to_dict('list')
print (d)
{'Total_energy': [200, 400, 300],
'Pump': [70, 80, 30],
'Diswasher': [50, 40, 90],
'Freezer': [80, 80, 80]}
print (d['Total_energy'])
[200, 400, 300]
print (d['Pump'])
[70, 80, 30]
Also working select each column separately like in your solution:
names = data.name.tolist()
Total_energy = data.Total_energy.tolist()
Pump = data.Pump.tolist()
Diswasher = data.Diswasher.tolist()
EDIT:
If need filter columns in read_csv by names is use parameter usecols, not names:
colnames = ['cet_cest_timestamp', 'DE_KN_residential2_circulation_pump', 'DE_KN_residential2_dishwasher']
data = pd.read_csv('household_data_60min_singleindex.csv', usecols=colnames)
print (data.head())
cet_cest_timestamp DE_KN_residential2_circulation_pump \
0 2015-04-15T13:00:00+0200 19.776
1 2015-04-15T14:00:00+0200 19.780
2 2015-04-15T15:00:00+0200 19.784
3 2015-04-15T16:00:00+0200 19.788
4 2015-04-15T17:00:00+0200 19.854
DE_KN_residential2_dishwasher
0 2.765
1 2.765
2 2.765
3 2.765
4 2.765
You can get the result you want by using:
Total_energy = data.loc[:2, "Total Energy"].apply(lambda x: int(x.split("+")[1])).tolist()
Pump = data.loc[:2, "Pump"].tolist()
Diswasher = data.loc[:2, "Diswasher"].tolist()
Use loc to select the rows and column you want and split to get the value after the "+" sign for Total Energy.