P-value normal test for multiple rows - python-3.x

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?

If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

Related

pandas: removing duplicate values in rows with same index in two columns

I have a dataframe as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'text':['she is good', 'she is bad'], 'label':['she is good', 'she is good']})
I would like to compare row wise and if two same-indexed rows have the same values, replace the duplicate in the 'label' column with the word 'same'.
Desired output:
pos label
0 she is good same
1 she is bad she is good
so far, i have tried the following, but it returns an error:
ValueError: Length of values (1) does not match length of index (2)
df['label'] =np.where(df.query("text == label"), df['label']== ' ',df['label']==df['label'] )
Your syntax is not correct, have a look at the documentation of numpy.where.
Check for equality between your two columns, and replace the values in your label column:
import numpy as np
df['label'] = np.where(df['text'].eq(df['label']),'same',df['label'])
prints:
text label
0 she is good same
1 she is bad she is good

How to create a scatter plot where values are across multiple columns?

I have a dataframe in Pandas in which the rows are observations at different times and each column is a size bin where the values represent the number of particles observed for that size bin. So it looks like the following:
bin1 bin2 bin3 bin4 bin5
Time1 50 200 30 40 5
Time2 60 60 40 420 700
Time3 34 200 30 67 43
I would like to use plotly/cufflinks to create a scatterplot in which the x axis will be each size bin, and the y axis will be the values in each size bin. There will be three colors, one for each observation.
As I'm more experienced in Matlab, I tried indexing the values using iloc (note the example below is just trying to plot one observation):
df.iplot(kind="scatter",theme="white",x=df.columns, y=df.iloc[1,:])
But I just get a key error: 0 message.
Is it possible to use indexing when choosing x and y values in Pandas?
Rather than indexing, I think you need to better understand how pandas and matplotlib interact each other.
Let's go by steps for your case:
As the pandas.DataFrame.plot documentation says, the plotted series is a column. You have the series in the row, so you need to transpose your dataframe.
To create a scatterplot, you need both x and y coordinates in different columns, but you are missing the x column, so you also need to create a column with the x values in the transposed dataframe.
Apparently pandas does not change color by default with consecutive calls to plot (matplotlib does it), so you need to pick a color map and pass a color argument, otherwise all points will have the same color.
Here a working example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Here I copied you data in a data.txt text file and import it in pandas as a csv.
#You may have a different way to get your data.
df = pd.read_csv('data.txt', sep='\s+', engine='python')
#I assume to have a column named 'time' which is set as the index, as you show in your post.
df.set_index('time')
tdf = df.transpose() #transpose the dataframe
#Drop the time column from the trasponsed dataframe. time is not a data to be plotted.
tdf = tdf.drop('time')
#Creating x values, I go for 1 to 5 but they can be different.
tdf['xval'] = np.arange(1, len(tdf)+1)
#Choose a colormap and making a list of colors to be used.
colormap = plt.cm.rainbow
colors = [colormap(i) for i in np.linspace(0, 1, len(tdf))]
#Make an empty plot, the columns will be added to the axes in the loop.
fig, axes = plt.subplots(1, 1)
for i, cl in enumerate([datacol for datacol in tdf.columns if datacol != 'xval']):
tdf.plot(x='xval', y=cl, kind="scatter", ax=axes, color=colors[i])
plt.show()
This plots the following image:
Here a tutorial on picking colors in matplotlib.

Generating DataFrame from Multiplication Table Printout - Python 3.x

In my following code, I am producing print-outs of randomly generated multiplication tables. I would like to make each table generated into a DataFrame. How would I do this? (New in Python 3.x)
This exercise was to generate a multiplication table. It expanded to a project to generate a set number of
multiplication tables with randomly generated one- or two-digit column and row numbers. Currently it is set to run five tables, each with 8 columns and rows. However, these numbers can be changed. Jupyter Notebook can
only print up to 12 columns nicely. While our program will generate as many columns and rows as we want (of equal size, eg, 6x6, 3x3, 9x9, etc), limiting it to a 12x12 matrix or smaller is best for viewing.
import pandas as pd
import numpy as np
%matplotlib notebook
# This sets up how many tables we will generate
for t in range(0,5):
# Make variable place holders for our columns and rows list
a=[]
b=[]
# To use randomly generated numbers, this sets up the random column numbers 'a' and random row numbers 'b'
import random
for x in range(12):
a.append(random.randint(41,99)) # We can adjust the range of the random selection of numbers here
b.append(random.randint(1,35)) # We can adjust the range of the random selection of number here
# Add the column titles for each table - these are the random numbers 'a'
print("C/R: ", end="\t ")
for number in a:
print(number,end = '\t ')
print()
# The double for-loop to generate the table
for row in b:
print(row, end="\t") # First column
for number in a:
print(round(row*number,1),end='\t' )# Next columns
print( )
# Add two blank cosmetic lines between tables for readability
print('\n\n')
Thanks.
Based upon Josewails' assistance, this is the code for which I was looking. Josewails, thank you.
import pandas as pd
import numpy as np
%matplotlib notebook
# This sets up how many tables we will generate
for t in range(0,5):
# Make variable place holders for our columns and rows list
a = []
b = []
dataframes = []
# To use randomly generated numbers, this sets up the random column numbers 'a' and random row numbers 'b'
import random
for x in range(12):
a.append(random.randint(41,99)) # We can adjust the range of the random selection of numbers here
b.append(random.randint(1,35)) # We can adjust the range of the random selection of number here
data = []
for row in b:
temp = []
for number in a:
temp.append(round(row*number,1))
data.append(temp)
dataframe = pd.DataFrame(data=data, columns=a)
dataframe.index = b
dataframes.append(dataframe)
display(dataframes[0])
Refactoring your code a bit should give the desired output.
import pandas as pd
import numpy as np
%matplotlib notebook
# This sets up how many tables we will generate
dataframes = []
for t in range(0,5):
# Make variable place holders for our columns and rows list
a = []
b = []
# To use randomly generated numbers, this sets up the random column numbers 'a' and random row numbers 'b'
import random
for x in range(12):
a.append(random.randint(41,99)) # We can adjust the range of the random selection of numbers here
b.append(random.randint(1,35)) # We can adjust the range of the random selection of number here
data = []
for row in b:
temp = []
for number in a:
temp.append(round(row*number,1))
data.append(temp)
dataframe = pd.DataFrame(data=data, columns=a)
dataframe.index = b
dataframes.append(dataframe)
dataframes[0]
This is the output.
pandas dataframe

Re-indexing a pandas Series extracted from DataFrame

Problematic
Let us say that I have a DataFrame df of n rows like this:
| Precipitation | Discharge |
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
...
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
Rows are automatically indexed from 1 to n. When we extract one column for example:
series = df.loc[5:,['Precipitation']]
or
series = df.Precipitation[5:]
The extracted series tends to be like:
Precipitation
5 16
6 17
7 18
...
n 15
So the question is how can we modify the generic indexing from 5 to n to 0 to n-5.
Note that I have tried series.reindex() and series.reset_index() but neither of them works...
Currently I do series.tolist() to solve the problem but is there any way more elegant and smarter?
Many thanks in advance !
If I understand your question correctly, you want to select the first n-5 rows of your column.
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df.loc[:len(df)-5,['col1']]
You may want to read up on slicing with pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges
EDIT:
I think I misunderstood, and what you instead want is to 'reset' your index. You can do that with the method reset_index()
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df = df.loc[5:,['col1']]
df.reset_index(drop=True)

Pandas sort not maintaining sort

What is the right way to multiply two sorted pandas Series?
When I run the following
import pandas as pd
x = pd.Series([1,3,2])
x.sort()
print(x)
w = [1]*3
print(w*x)
I get what I would expect - [1,2,3]
However, when I change it to a Series:
w = pd.Series(w)
print(w*x)
It appears to multiply based on the index of the two series, so it returns [1,3,2]
Your results are essentially the same, just sorted differently.
>>> w*x
0 1
2 2
1 3
>>> pd.Series(w)*x
0 1
1 3
2 2
>>> (w*x).sort_index()
0 1
1 3
2 2
The rule is basically this: Anytime you multiply a dataframe or series by a dataframe or series, it will be done by index. That's what makes it pandas and not numpy. As a result, any pre-sorting is necessarily ignored.
But if you multiply a dataframe or series by a list or numpy array of a conforming shape/size, then the list or array will be treated as having the exact same index as the dataframe or series. The pre-sorting of the series or dataframe can be preserved in this case because there can not be any conflict with the list or array (which don't have an index at all).
Both of these types of behavior can be very desirable depending on what you are trying do. That's why you will often see answers here that do something like df1 * df2.values when the second type of behavior is desired.
In this example, it doesn't really matter because your list is [1,1,1] and gives the same answer either way, but if it was [1,2,3] you would get different answers, not just differently sorted answers.

Resources