Determine which disease cluster together [closed] - statistics

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
How do I determine which diseases cluster together? I have a dataset with patients and their diseases. It's coded as HOHT = 1 if they have it, and HOHT = 0 if they do not have it.
Below is an example of the data. How would I go about determining which diseases occur most often with each other without writing a bunch of if then statements? The goal is to create something like a Venn diagram or a dendogram showing the overlapping of diseases.
Moya Hypothyroid Hyperthyroid Celiac
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1

The simplest approach I can think of would be to have a look at the correlation matrix via proc corr:
data diseases;
input Moya Hypothyroid Hyperthyroid Celiac;
cards;
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
;
run;
proc corr data = diseases out = disease_corr; run;
There are various other options, but I'm not sure whether this question is really the best fit for this site as it's very broad and more about statistics than programming. If you run into a more specific problem feel free to ask another question.

Related

Find minimum operations that make the array 0

Consider an array a of positive integers 1 <= a[i] <= 10^9
We can perform an operation on this array which involves taking any integer x and subtracting or adding it to a subarray of a
The goal is to make the entire array 0
Some examples:
2 1 2 1 => 1 0 1 0 => 0 0 1 0 => 0 0 0 0
This requires 3 operations. First subtract 1 from all the elements (subarray=a[0..3]), then individually subtract 1 from single element subarrays (a[0..0], a[2..2])
Alternative way could've been:
2 1 2 1 => 2 1 1 0 => 2 0 0 0 => 0 0 0 0
=> 1 0 0 0 => 0 0 0 0
Example 2:
4 6 2 => 0 2 2 => 0 0 0
Example 3:
10 3 2 9 => 10 10 9 9 => 0 0 9 9 => 0 0 0 0
I searched a lot, not able to find the exact problem statement. The solution that I came across though (https://stackoverflow.com/a/68789827/4014182) talks about a divide and conquer solution wherein you realize that every time there is a 0 in the array, you can "split" the array. But how to get that 0 in the first place is beyond my understanding.

How to subtract each value in a column from same number in a .txt file in Linux? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
The community is reviewing whether to reopen this question as of 2 years ago.
Improve this question
I have a a .txt file (mydata.txt) in Linux and want to subtract each value in one column and save the output values in a new column alongside with the old column. For example, in the data example below, I subtract each value in C1 column from 2 and saved the output column (C1_sub) as a new column, output of C2 in column C2_sub and so on. How do I do it if I have one or multiple columns?
C1 C1_sub C2 C2_sub C3 C3_sub
1 1 2 0 2 0
0 2 1 1 2 0
2 0 0 2 0 2
0 2 2 0 1 1
0.008 1.992 0 2 2 0
1.999 0.001 1 1 0 2
0 2 2 0 1 1
0 2 0.001 1.999 2 0
1 1 1 1 0 2
2 0 2 0 0.013 1.987
0 2 0.999 1.001 0 2
0 2 2 0 1 1
1 1 1.999 0.001 2 0
1.99 0.01 1.999 0.001 2 0
0 2 1.999 0.001 1 1
awk 'NR==1{print "C1", "C1_sub", "C2", "C2_sub", "C3", "C3_sub"; next}
{ print $1, 2 - $1,$2, 2 - $2, $3, 2-$3}' OFS=\\t input

large(over 1 million classes) multi-class classifier via Keras

I have data for around 2 million active customers and around 2-5 years worth of transaction data by customer. This data includes features such as what item that customer bought, what store they bought it from, the date they purchased that item, how much they bought, how much they paid, etc.
I need to predict which of our customers will shop in the next 2 weeks.
Right now my data is set up like this
item_a item_b item_c item_d customer_id visit
dates
6/01 1 0 0 0 cust_123 1
6/02 0 0 0 0 cust_123 0
6/03 0 1 0 0 cust_123 1
6/04 0 0 0 0 cust_123 0
6/05 1 0 0 0 cust_123 1
6/06 0 0 0 0 cust_123 0
6/07 0 0 0 0 cust_123 0
6/08 1 0 0 0 cust_123 1
6/01 0 0 0 0 cust_456 0
6/02 0 0 0 0 cust_456 0
6/03 0 0 0 0 cust_456 0
6/04 0 0 0 0 cust_456 0
6/05 1 0 0 0 cust_456 1
6/06 0 0 0 0 cust_456 0
6/07 0 0 0 0 cust_456 0
6/08 0 0 0 0 cust_456 0
6/01 0 0 0 0 cust_789 0
6/02 0 0 0 0 cust_789 0
6/03 0 0 0 0 cust_789 0
6/04 0 0 0 0 cust_789 0
6/05 0 0 0 0 cust_789 0
6/06 0 0 0 0 cust_789 0
6/07 0 0 0 0 cust_789 0
6/08 0 1 1 0 cust_789 1
should I make the target variable be something like
df['target_variable']='no_purchase'
for cust in list(set(df['customer'])):
df['target_variable']=np.where(df['visit']>0,cust,df['target_variable'])
or have my visit feature be my target variable? If it's the latter, should I OHE all 2 million customers? If not, how should I set this up on Keras so that it classifies visits for all 2 million customers?
I think you should better understand your problem -- your problem requires strong domain knowledge to correct model it, and it can be modeled in many different ways, and below are just some examples:
Regression problem: given a customer's purchase record only containing relative date, e.g.
construct a sequence like [date2-date1, date3-date2, date4-date3, ...] from your data.
[6, 7, 5, 13, ...] means a customer is likely to buy things on the weekly or biweekly basis
[24, 30, 33, ...] means a customer is likely to buy things on the monthly basis.
If you organize problem in this way, all you need is to predict what is the next number in a given sequence. You may easily get such data by
randomly select a full sequence, say [a, b, c, d, e, f, ..., z]
randomly select a position to predict, say x
pick K (say K=6) proceeding sequence [r, s, t, u, v, w]as your network input, and x as your network target.
Once you have this model been trained, your ultimate task can be easily resolved by checking whether the predicted number is greater than 60.
Classification problem: given a customer's purchase record of K months, predict how many purchase will a customer have in the next two month.
Again, you need to create training data from your raw data, but this time the target for a customer is how many items does he purchased in month K+1 and K+2, and you may organize your input data of K-month record in your own way.
Please note, the number of items a customer purchased is a discrete number, but way below 1M. In fact, like in problem of face image based age estimation, people often quantilize the target into bins, e.g. 0-8, 9-16, 17-24, etc. You may do the same thing for your problem. Of course, you may also formulate this target as a regression problem to directly predict how many items.
Why you need to know your problem better?
as you can see, you may come up a number of problem formulations that might all look reasonable at the first glance or very difficult for you to say which one is the best.
it is worthy noting the dependence between a problem set-up and its hidden premise, (you may not notice such things until you think the problem carefully). For example, the regression problem set-up to predict the gap of the next purchase implies that the number of items a customer purchased does not matter. This claim may or may not be fair in your problem.
you may come up a much simpler but more effective solution if you know your problem well.
In most of problems like yours, you don't have to use deep learning or at least not at the first place. Classic approaches may work better.

Generate data following specified pattern in J

I'm dabbling my feet with J and, to get the ball rolling, decided to write a function that:
gets integer N;
spits out a table that follows this pattern:
(example for N = 4)
1
0 1
0 0 1
0 0 0 1
i.e. in each row number of zeroes increases from 0 up to N - 1.
However, being newbie, I'm stuck. My current labored (and incorrect) solution for N = 4 case looks like:
(4 # ,: 0 1) #~/"1 1 (1 ,.~/ i.4)
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
And the problem with it is twofold:
it's not general enough and looks kinda ugly (parens and " usage);
trailing zeroes - as I understand, all arrays in J are homogeneous, so in my case every row should be boxed.
Like that:
┌───────┐
│1 │
├───────┤
│0 1 │
├───────┤
│0 0 1 │
├───────┤
│0 0 0 1│
└───────┘
Or I should use strings (e.g. '0 0 1') which will be padded with spaces instead of zeroes.
So, what I'm kindly asking here is:
please provide an idiomatic J solution for this task with explanation;
criticize my attempt and point out how it could be finished.
Thanks in advance!
Like so many challenges in J, sometimes it is better to keep your focus on your result and find a different way to get there. In this case, what your initial approach is doing is creating an identity matrix. I would use
=/~#:i. 4
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
You have correctly identified the issue with the trailing 0's and the fact that J will pad out with 0's to avoid ragged arrays. Boxing avoids this padding since each row is self contained.
So create your lists first. I would use overtake to get the extra 0's
4{.1
1 0 0 0
The next line uses 1: to return 1 as a verb and boxes the overtakes from 1 to 4
(>:#:i. <#:{."0 1:) 4
+-+---+-----+-------+
|1|1 0|1 0 0|1 0 0 0|
+-+---+-----+-------+
Since we want this as reversed and then made into strings, we add ":#:|.#: to the process.
(>:#:i. <#:":#:|.#:{."0 1:) 4
+-+---+-----+-------+
|1|0 1|0 0 1|0 0 0 1|
+-+---+-----+-------+
Then we unbox
>#:(>:#:i. <#:":#:|.#:{."0 1:) 4
1
0 1
0 0 1
0 0 0 1
I am not sure this is the way everyone would solve the problem, but it works.
An alternative solution that does not use boxing and uses the dyadic j. (Complex) and the fact that
1j4 # 1
1 0 0 0 0
(1 j. 4) # 1
1 0 0 0 0
(1 #~ 1 j. ]) 4
1 0 0 0 0
So, I create a list for each integer in i. 4, then reverse them and make them into strings. Since they are now strings, the extra padding is done with blanks.
(1 ":#:|.#:#~ 1 j. ])"0#:i. 4
1
0 1
0 0 1
0 0 0 1
Taking this step by step as to hopefully explain a little better.
i.4
0 1 2 3
Which is then applied to (1 ":#:|.#:#~ 1 j. ]) an atom at a time, hence the use of "0
Breaking down what is going on within the parenthesis. I first take the right three verbs which form a fork.
( 1 j. ])"0#:i.4
1 1j1 1j2 1j3
Now, effectively that gives me
1 ":#:|.#:#~ 1 1j1 1j2 1j3
The middle tine of the fork becomes the verb acting on the two noun arguments.The ~ swaps the arguments. so it becomes equivalent to
1 1j1 1j2 1j3 ":#:|.#:# 1
which because of the way #: works is the same as
": |. 1 1j1 1j2 1j3 # 1
I haven't shown the results of these components because using the "0 on the fork changes how the arguments that are sent to the middle tine and assembled later. I'm hoping that there is enough here that with some hand waving the explanation may suffice
The jump from tacit to explicit can be a big one, so it may be a better exercise to write the same verb explicitly to see if it makes more sense.
lowerTriangle =: 3 : 0
​rightArg=. i. y
​complexCopy=. 1 j. rightArg
​1 (":#:|.#:#~)"0 complexCopy
​)
lowerTriangle 4
1
0 1
0 0 1
0 0 0 1
lowerTriangle 5
1
0 1
0 0 1
0 0 0 1
0 0 0 0 1
See what happens when you 'get the ball rolling'? I guess the thing about J is that the ball goes down a pretty steep slope no matter where you begin. Exciting, eh?

How to make dummy variables with comma separated valued columns? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am working on data preprocessing for machine learning and faced a problem.
Here is what I want to do.
table image:
Table's type is pandas dataframe.
My current table is left one, and I want to transform my table to right one.
The number of movies and actors are not fixed.
EDIT :
Data input
df=pd.DataFrame({'name':['A','B','C'],'actors':['a,b','b,d','c,m']})
Expected output :
a b c d m
A 1 1 0 0 0
B 0 1 0 1 0
C 0 0 1 0 1
Try this ? (BTW , kaggle movie dataset, better using LabelEncoder)
PS: I did not add the column name, you can simply do out['name']=df.name
Option 1 pd.crosstab
df.actors=df.actors.str.split(',')
df1=df.set_index('name').actors.apply(pd.Series).stack()
pd.crosstab(df1.index.get_level_values(0),df1).rename_axis(None).rename_axis(None,1)
Out[246]:
a b c d m
A 1 1 0 0 0
B 0 1 0 1 0
C 0 0 1 0 1
Option 2
get_dummies
pd.get_dummies(df.actors.str.split(',').apply(pd.Series).stack()).sum(level=0)
Out[230]:
a b c d m
0 1 1 0 0 0
1 0 1 0 1 0
2 0 0 1 0 1
Option 3
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.actors.str.split(',')),columns=mlb.classes_,index=df.name).reset_index()
Out[238]:
name a b c d m
0 A 1 1 0 0 0
1 B 0 1 0 1 0
2 C 0 0 1 0 1
Data Input
df=pd.DataFrame({'name':['A','B','C'],'actors':['a,b','b,d','c,m']})

Resources