How to do a var.test on a two-level subset of dataframe - subset

I have a dataframe with three levels and I would like to do pair-wise T-tests on the different groups. I first need to check for equal variance between groups, but obviously can't do var.test since there are three levels (not two). I get the error:
grouping factor must have exactly 2 levels
I'm sure there must be a simple way to define the two levels to perform the var.test on but can't find out how. Any help would be hugely appreciated.
var.test(Rad51_foci~Genotype)
Error in var.test.formula(Rad51_foci ~ Genotype) :
grouping factor must have exactly 2 levels

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

How to handle categorical features with many unique values in python/Scikit learn

In my situation, I would like to encode around 5 different columns in my dataset but the issue is that these 5 columns have many unique values.
If I encode them using label encoder I add an unnecessary order that is not right whereas if I do OHE or pd.get_dummies then I end up having a lot of features that will add to much sparseness in the data.
I am currently dealing with a supervised learning problem and the following are the unique values per column:
Job_Role : Unique categorical values = 29
Country : Unique categorical values = 12
State : Unique categorical values = 14
Segment : Unique categorical values = 12
Unit : Unique categorical values = 10
I have already looked into multiple references but not sure about the best approach. What should in this situation to have least number of features with maximum positive impact on my model
As far as I know, usually uses OneHotEncoder for these cases but as you said, there are so many unique values in your data. I've looked for a solution for a project before and I saw different ways as follows,
OneHotEncoder + PCA: I think this way is not quite right, because PCA is designed for continuous variables.[*]
Entity Embeddings: I don't know this way very well, but you can check it from the link in the title.
BinaryEncoder: I think, this is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.
There are some other solutions in category_encoders library.

How can I get the ".describe()" statistics over all numerical columns, nested or not?

What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series structure), which is a nested structure.
Such nested structures are meant:
list of arrays,
array of arrays,
series of lists,
dataframe with nested lists of numerical values in some columns (my case)
How to get the simple descriptive statistics from any level of the nested structure in one go?
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values.
I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.
My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well.
In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=,
kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.

combining two dataframes sorted by common columns in pandas

I've been stuck on this for a while - I have tried merging, concatenating, joins, but can't get what I want.
given the two dataframes :
I want to combine them to get :
First I want them aligned by Start Location. If the names can be aligned, that is a bonus, but not critical. Although this example looks like the 2nd table is aligning to the 1st, it could be either way, depending on the Start Location.The two dataframes are appropriately sorted beforehand. I can't even figure this out with two dataframes, but ultimately I want to be able to combine several together.

PySpark 2.1.1 groupby + approx_count_distinct giving counts of 0

I'm using Spark 2.1.1 (pyspark), doing a groupby followed by an approx_count_distinct aggregation on a DataFrame with about 1.4 billion rows. The groupby operation results in about 6 million groups to perform the approx_count_distinct operation on. The expected distinct counts for the groups range from single-digits to the millions.
Here is the code snippet I'm using, with column 'item_id' containing the ID of items, and 'user_id' containing the ID of users. I want to count the distinct users associated with each item.
>>> distinct_counts_df = data_df.groupby(['item_id']).agg(approx_count_distinct(data_df.user_id).alias('distinct_count'))
In the resulting DataFrame, I'm getting about 16,000 items with a count of 0:
>>> distinct_counts_df.filter(distinct_counts_df.distinct_count == 0).count()
16032
When I checked the actual distinct count for a few of these items, I got numbers between 20 and 60. Is this a known issue with the accuracy of the HLL approximate counting algorithm or is this a bug?
Although I am not sure where the actual problem lies, but since approx_count_distinct relies on approximation(https://stackoverflow.com/a/40889920/7045987), HLL may well be the issue.
You can try this:
There is a parameter 'rsd' which you can pass in approx_count_distinct which determines the error margin. If rsd = 0, it will give you accurate results although the time increases significantly and in that case, countDistinct becomes a better option. Nevertheless, you can try decreasing rsd to say 0.008 at the cost of increasing time. This may help in giving a little more accurate results.

Resources