I am trying to pass the range of number in index from 0 - 100 or if I have n number then 0 to n. how do I do that ? Could you please help me with the sample code in cucumber / karate ?
Examples:
| index | number | em_number |
| 0 | 1 | 10 |
| 1 | 1 | 10 |
| 2 | 1 | 10 |
| 3 | 1 | 10 |
| 4 | 1 | 10 |
I think you need to spend some time on fundamentals before trying to over-complicate your tests.
That said, karate has a built-in function. Try this:
* def nums = karate.range(5, 9)
* match nums == [5, 6, 7, 8, 9]
And then please read the docs on JSON transforms: https://github.com/karatelabs/karate#json-transforms
The dataframe has the following features:
+--------+--------+--------+------+-------+--------+-----+-------+
| | id | weight | type | value | export | tax | total |
+--------+--------+--------+------+-------+--------+-----+-------+
| 0 | 1 | 4 | 1 | 10 | 1 | 5 | 15 |
+--------+--------+--------+------+-------+--------+-----+-------+
| 1 | 2 | 3 | 1 | 12 | 1 | 6 | 18 |
+--------+--------+--------+------+-------+--------+-----+-------+
| 2 | 3 | 8 | 2 | 15 | 0 | 0 | 15 |
+--------+--------+--------+------+-------+--------+-----+-------+
| ... | ... | ... | ... | ... | | ... | ... |
+--------+--------+--------+------+-------+--------+-----+-------+
| 123004 | 123005 | 5 | 2 | 12 | 0 | 0 | 12 |
+--------+--------+--------+------+-------+--------+-----+-------+
The tax column should be predicted. It is important to consider the relationship between tax and export .
When export == 1 then tax is there.
The following code (Random forest as an example) predicts the tax without considering this rule.
y = df['tax']
X = df.drop(columns=['tax'])
from sklearn.model_selection import train_test_split# Split the data into training and testing sets
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
rf = RandomForestRegressor(max_depth=10, random_state=101, n_estimators =42)
rf.fit(train_X, train_y);
predictions = rf.predict(test_X)
Questions:
1- How to tell the algorithm to consider the above rule?
2- The tax cannot be more than the value. How is it possible to set limitations or a range for the prediction?
3- If there is other method to predict the same result please mention it. (Random forest is not a must)
4- I am beginner in this field so good ideas for this sample are very welcome.
Trying to figure out how to replace a specific column in Pyspark with null values randomly. So changing a dataframe such as this:
| A | B |
|----|----|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
| 7 | 8 |
| 9 | 10 |
| 11 | 12 |
and randomly change 25% of the values in column 'B' to null values:
| A | B |
|----|------|
| 1 | 2 |
| 3 | NULL |
| 5 | 6 |
| 7 | NULL |
| 9 | NULL |
| 11 | 12 |
thanks to #pault I was able to answer my own question using the question he posted that you can find here
Essentially I ran something like this:
import pyspark.sql.functions as f
df1 = df.withColumn('Val', f.when(f.rand() > 0.25, df1['Val']).otherwise(f.lit(None))
Which will randomly select values with the column 'Val' and make it into a None value
In this table, I want to find the Average number of days between actions per each user.
What I mean here is, I want to group by user_id and then I want to subtract each date directly from the date before it by days per each user. and then find the average number of these days per each user (the average number of No_Action days per each user).
+---------+-----------+----------------------+
| User_ID | Action_ID | Action_At |
+---------+-----------+----------------------+
| 1 | 11 | 2019-01-31T23:00:37Z |
+---------+-----------+----------------------+
| 2 | 12 | 2019-01-31T23:11:12Z |
+---------+-----------+----------------------+
| 3 | 13 | 2019-01-31T23:14:53Z |
+---------+-----------+----------------------+
| 1 | 14 | 2019-02-01T00:00:30Z |
+---------+-----------+----------------------+
| 2 | 15 | 2019-02-01T00:01:03Z |
+---------+-----------+----------------------+
| 3 | 16 | 2019-02-01T00:02:32Z |
+---------+-----------+----------------------+
| 1 | 17 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 2 | 18 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 3 | 19 | 2019-02-07T09:09:16Z |
+---------+-----------+----------------------+
| 1 | 20 | 2019-02-11T15:37:24Z |
+---------+-----------+----------------------+
| 2 | 21 | 2019-02-18T10:02:07Z |
+---------+-----------+----------------------+
| 3 | 22 | 2019-02-26T12:01:31Z |
+---------+-----------+----------------------+
You can do it like this (and next time, please provide the data so that it is easy to help you; it took me much longer to enter the data than to get to the solution):
df = pd.DataFrame({'User_ID': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
'Action_ID': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
'Action_At': ['2019-01-31T23:00:37Z', '2019-01-31T23:11:12Z', '2019-01-31T23:14:53Z', '2019-02-01T00:00:30Z', '2019-02-01T00:01:03Z', '2019-02-01T00:02:32Z', '2019-02-06T11:30:28Z', '2019-02-06T11:30:28Z', '2019-02-07T09:09:16Z', '2019-02-11T15:37:24Z', '2019-02-18T10:02:07Z', '2019-02-26T12:01:31Z']})
df.Action_At = pd.to_datetime(df.Action_At)
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).mean())
## User_ID
## 1 3 days 13:32:15.666666
## 2 5 days 19:36:58.333333
## 3 8 days 12:15:32.666666
## dtype: timedelta64[ns]
Or, if you want the solution in days:
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).dt.days.mean())
## User_ID
## 1 3.333333
## 2 5.333333
## 3 8.333333
## dtype: float64
I have a csv file where I want to skip a random percentage of rows but only for rows where one of the columns has a specific entry. For example I might have a csv with contents below and I want to skip a certain percentage of all the apple entries:
| a | b | c | d | e |
|----|----|----|----|--------|
0| 9 | 1 | 2 | 3 | apple |
1| 8 | 4 | 5 | 6 | apple |
2| 7 | 7 | 8 | 9 | apple |
3| 6 | 10 | 11 | 12 | orange |
4| 5 | 13 | 14 | 15 | orange |
5| 4 | 16 | 17 | 18 | orange |
6| 3 | 19 | 20 | 21 | orange |
7| 2 | 22 | 23 | 24 | banana |
8| 1 | 25 | 26 | 27 | banana |
9| 0 | 28 | 29 | 30 | banana |
I know I could skip rows across the entire file with something like
df = pd.read_csv('fruit.csv', skiprows = lambda i: i>0 and random.random() > probability_value)
I know I can also select just the apple entries from the dataframe with
df2 = df.loc[df['e'] == 'apple']
But is there a simple way to select these entries when importing the csv and apply the skip rows so all the non 'apple' entries aren't affected by the skip row?
You can do it as follows, But I would prefer doing it in later stage.
df = pd.read_csv('fruit.csv').query("e != 'apple'")