I have the following two dataframes. I want to plot a lineplot or scatterplot which shall indicate the value counts of "UserId" and "OwnerUserId" on x and y axes. Note that 'UserId' and 'OwnerUserId' indicate the same user, these are just different notations in different dataframes.
I actually want to find out if there is a relation between the total no. of badges a user has (indicated by value counts of 'UserId' to the total no. of posts he has made (indicated by value counts of 'OwnerUserId'.
I have written a code, but don't know if its right.
Dataframe 1 (badges)
rowId UserId Name Date Class TagBased
1 1 Teacher 2009-09-30T15:17:50.660 3 False
2 3 Teacher 2009-09-30T15:17:50.690 3 False
3 13 Teacher 2009-09-30T15:17:50.690 3 False
4 14 Teacher 2009-09-30T15:17:50.690 3 False
5 22 Teacher 2009-09-30T15:17:50.690 3 False
DataFrame 2 (posts)
OwnerUserId Post Title CreationDate
1 Is a quotient of a reductive group reductive? 2009-09-28T16:11:38.533
2 Learning about Lie groups 2009-09-28T17:42:23.207
1 Maths 2009-09-28T19:41:13.933
Code
x = badges['UserId'].value_counts()
y = posts['OwnerUserId'].value_counts()
sns.scatterplot(x,y)
Note
All users of first dataframe may not be present in second dataframe. Is there a way I can plot only for those users common to both dataframes?
If you are sure that the posts users are included in the badges users, you can filter the badges values with the posts index.
y = posts['OwnerUserId'].value_counts()
x = badges['UserId'].value_counts()[y.index]
sns.scatterplot(x,y)
Else, filter x and y with the common attributes of the 2 dataframes
common = list(set(badges.UserId.unique()).union(set(posts.OwnerUserId.unique()))
I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.
I build a function for do a difference between two different pandas columns from different data set. First data set contain the predict value and the second data set contain observed. The problem is that the row of two data set are different and for do a difference I must use the ID of row.
The function is:
def difference(data1,data2):
for i in range(data1.shape[0]):
e_id=data2.iloc[i,0]
p_oss =data1.iloc[int(e_id),9]
diff= p_oss - data2.iloc[i,1]
return diff
difference(df,evaluation)
Where: data1: observed value data2:predict value
The error of functionis:
IndexError: single positional indexer is out-of-bounds
The observed data set is structured like this:
ID Attribute1 Attribute2 ... Prime
1 N 10 123
2 S 10 128
3 N 8 26
4 S 12 567
..
n N 15 5
The predict data set is structured like this:
ID Prime
4 566.89
1 123.03
2 127.95
3 26.01
...
The ID of predict data set change because I use a function (train_test_split) to split the originally df in train e test set.
I want an output like this:
ID difference
1 0.03
2 0.05
3 0.1
4 0.11
..
This is the output of pandas in excel format:
Id comments number
1 so bad 1
1 so far 2
2 always 3
2 very good 4
3 very bad 5
3 very nice 6
3 so far 7
4 very far 8
4 very close 9
4 busy 10
I want to use pandas to give a color (for example: gray color) to rows that their value for Id column is even. For example rows 3 and 4 have even Id numbers, but rows 5, 6 and 7 have odd Id numbers. Is there any possible way to use pandas to do it?
As explained in the documentation http://pandas.pydata.org/pandas-docs/stable/style.html what you basically want to do is write a style function and apply it to the style object.
def _color_if_even(s):
return ['background-color: grey' if val % 2 == 0 else '' for val in s]
and call it on my Styler object, i.e.,
df.style.apply(_color_if_even, subset=['id'])
I'm currently learning the fascinating J programming language, but one thing I have not been able to figure out is how to filter a list.
Suppose I have the arbitrary list 3 2 2 7 7 2 9 and I want to remove the 2s but leave everything else unchanged, i.e., my result would be 3 7 7 9. How on earth do I do this?
The short answer
2 (~: # ]) 3 2 2 7 7 2 9
3 7 7 9
The long answer
I have the answer for you, but before you should get familiar with some details. Here we go.
Monads, dyads
There are two types of verbs in J: monads and dyads. The former accept only one parameter, the latter accept two parameters.
For example passing a sole argument to a monadic verb #, called tally, counts the number of elements in the list:
# 3 2 2 7 7 2 9
7
A verb #, which accepts two arguments (left and right), is called copy, it is dyadic and is used to copy elements from the right list as many times as specified by the respective elements in the left list (there may be a sole element in the list also):
0 0 0 3 0 0 0 # 3 2 2 7 7 2 9
7 7 7
Fork
There's a notion of fork in J, which is a series of 3 verbs applied to their arguments, dyadically or monadically.
Here's the diagram of a kind of fork I used in the first snippet:
x (F G H) y
G
/ \
F H
/ \ / \
x y x y
It describes the order in which verbs are applied to their arguments. Thus these applications occur:
2 ~: 3 2 2 7 7 2 9
1 0 0 1 1 0 1
The ~: (not equal) is dyadic in this example and results in a list of boolean values which are true when an argument doesn't equal 2. This was the F application according to diagram.
The next application is H:
2 ] 3 2 2 7 7 2 9
3 2 2 7 7 2 9
] (identity) can be a monad or a dyad, but it always returns the right argument passed to a verb (there's an opposite verb, [ which returns.. Yes, the left argument! :)
So far, so good. F and H after application returned these values accordingly:
1 0 0 1 1 0 1
3 2 2 7 7 2 9
The only step to perform is the G verb application.
As I noted earlier, the verb #, which is dyadic (accepts two arguments), allows us to duplicate the items from the right argument as many times as specified in the respective positions in the left argument. Hence:
1 0 0 1 1 0 1 # 3 2 2 7 7 2 9
3 7 7 9
We've just got the list filtered out of 2s.
Reference
Slightly different kind of fork, hook and other primitves (including abovementioned ones) are described in these two documents:
A Brief J Reference (175 KiB)
Easy-J. An Introduction to the World's most Remarkable Programming Language (302 KiB)
Other useful sources of information are the Jsoftware site with their wiki and a few mail list archives in internets.
Just to be sure it's clear, the direct way - to answer the original question - is this:
3 2 2 7 7 2 9 -. 2
This returns
3 7 7 9
The more elaborate method - generating the boolean and using it to compress the vector - is more APLish.
To answer the other question in the very long post, to return the first element and the number of times it occurs, is simply this:
({. , {. +/ .= ]) 1 4 1 4 2 1 3 5
1 3
This is a fork using "{." to get the first item, "{. +/ . = ]" to add up the number of times the first item equals each element, and "," as the middle verb to concatenate these two parts.
Also:
2 ( -. ~ ]) 3 2 2 7 7 2 9
3 7 7 9
There are a million ways to do this - it bothers me, vaguely, that these these things don't evaluate strictly right to left, I'm an old APL programmer and I think of things as right to left even when they ain't.
If it were a thing that I was going to put into a program where I wanted to pull out some number and the number was a constant, I would do the following:
(#~ 2&~:) 1 3 2 4 2 5
1 3 4 5
This is a hook sort of thing, I think. The right half of the expression generates the truth vector regarding which are not 2, and then the octothorpe on the left has its arguments swapped so that the truth vector is the left argument to copy and the vector is the right argument. I am not sure that a hook is faster or slower than a fork with an argument copy.
+/3<+/"1(=2&{"1)/:~S:_1{;/5 6$1+i.6
156
This above program answers the question, "For all possible combinations of Yatzee dice, how many have 4 or 5 matching numbers in one roll?" It generates all the permutations, in boxes, sorts each box individually, unboxing them as a side effect, and extracts column 2, comparing the box against their own column 2, in the only successful fork or hook I've ever managed to write. The theory is that if there is a number that appears in a list of 5, three or more times, if you sort the list the middle number will be the number that appears with the greatest frequency. I have attempted a number of other hooks and/or forks and every one has failed because there is something I just do not get. Anyway that truth table is reduced to a vector, and now we know exactly how many times each group of 5 dice matched the median number. Finally, that number is compared to 3, and the number of successful compares (greater than 3, that is, 4 or 5) are counted.
This program answers the question, "For all possible 8 digit numbers made from the symbols 1 through 5, with repetition, how many are divisible by 4?"
I know that you need only determine how many within the first 25 are divisible by 4 and multiply, but the program runs more or less instantly. At one point I had a much more complex version of this program that generated the numbers in base 5 so that individual digits were between 0 and 4, added 1 to the numbers thus generated, and then put them into base 10. That was something like 1+(8$5)#:i.5^8
+/0=4|,(8$10)#. >{ ;/ 8 5$1+i.5
78125
As long as I have solely verb trains and selection, I don't have a problem. When I start having to repeat my argument within the verb so that I'm forced to use forks and hooks I start to get lost.
For example, here is something I can't get to work.
((1&{~+/)*./\(=1&{))1 1 1 3 2 4 1
I always get Index Error.
The point is to output two numbers, one that is the same as the first number in the list, the second which is the same as the number of times that number is repeated.
So this much works:
*./\(=1&{)1 1 1 3 2 4 1
1 1 1 0 0 0 0
I compare the first number against the rest of the list. Then I do an insertion of an and compression - and this gives me a 1 so long as I have an unbroken string of 1's, once it breaks the and fails and the zeros come forth.
I thought that I could then add another set of parens, get the lead element from the list again, and somehow record those numbers, the eventual idea would be to have another stage where I apply the inverse of the vector to the original list, and then use $: to get back for a recursive application of the same verb. Sort of like the quicksort example, which I thought I sort of understood, but I guess I don't.
But I can't even get close. I will ask this as a separate question so that people get proper credit for answering.