Pandas read_csv randomly skip rows with specific entries - python-3.x

I have a csv file where I want to skip a random percentage of rows but only for rows where one of the columns has a specific entry. For example I might have a csv with contents below and I want to skip a certain percentage of all the apple entries:
| a | b | c | d | e |
|----|----|----|----|--------|
0| 9 | 1 | 2 | 3 | apple |
1| 8 | 4 | 5 | 6 | apple |
2| 7 | 7 | 8 | 9 | apple |
3| 6 | 10 | 11 | 12 | orange |
4| 5 | 13 | 14 | 15 | orange |
5| 4 | 16 | 17 | 18 | orange |
6| 3 | 19 | 20 | 21 | orange |
7| 2 | 22 | 23 | 24 | banana |
8| 1 | 25 | 26 | 27 | banana |
9| 0 | 28 | 29 | 30 | banana |
I know I could skip rows across the entire file with something like
df = pd.read_csv('fruit.csv', skiprows = lambda i: i>0 and random.random() > probability_value)
I know I can also select just the apple entries from the dataframe with
df2 = df.loc[df['e'] == 'apple']
But is there a simple way to select these entries when importing the csv and apply the skip rows so all the non 'apple' entries aren't affected by the skip row?

You can do it as follows, But I would prefer doing it in later stage.
df = pd.read_csv('fruit.csv').query("e != 'apple'")

Related

Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark

Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
The output table should be the all the SubRange groups who has overlap over the MajorRange groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
In case there is no overlapping between the ranges the Major will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?
With spark, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+

Derive column value from date time difference in a data frame and input it in another column

I am new to spark (with python) and have searched all through for solutions to what I'm trying to do but haven't found anything that relates to this.
I Have two data frames, One called quantity and another called price
Quantity
+----+-----+-----+----+
|ID| Price_perf | Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 220|
| 2 | NULL | 6 | 290|
| 2 | NULL | 6 | 270|
+----+-----+-----+----+
Price
+----+-----+-----+----+
|ID| Price| Size|Date|Sourceid|
+---- +----- +----- +----+
| 1 | 7.5 | 3 |2017-01-03| 223|
| 1 | 39 | 3 |2012-01-06| 223|
| 1 | 12 | 3 |2009-04-01| 223|
| 1 | 28 | 3 |2011-11-08| 223|
| 1 | 9 | 3 |2012-09-12| 223|
| 1 | 15 | 3 |2017-07-03| 220|
| 1 | 10 | 3 |2017-05-03 | 220|
| 1 | 33 | 3 |2012-03-08 | 220|
+----+-----+-----+----+
Firstly, I am trying to join the above two dataframes and return a data frame that contains only values that have the same ID and SourceID
I have tried to do that by doing the following:
c= quantity.join(price,price.id==quantity.id, price.souceid==quantity.sourceid "left")
c.show()
This is the result I want to get but I'm not getting:
+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL |7.5 |2017-01-03 |3 | 223|
| 1 | NULL | 9 |2012-01-06 |3 | 223|
| 1 | NULL | 12 |2009-04-01|3 | 223|
| 1 | NULL | 28 |2011-11-08| 3 | 223|
| 1 | NULL | 9 |2012-09-12| 3 | 223|
| 1 | NULL | 15 |2017-07-03 |3 | 220|
| 1 | NULL | 10 |2017-05-03 |3 | 220|
| 1 | NULL |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+
Secondly, after doing the join, I'm trying to get the difference in price between the min and max dates in the joined data frame and input it as the Price_perf
This is what I've tried:
def modify_values(c):
for x in c:
if quantity.sourceid == price.sourceid:
return price.price(min(Date)) - price.price(max(Date))
else:
return "Not found"
ol_val = udf(modify_values, StringType())
ol_val.show()
So the final output should look something like this:
+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | 4.5 |7.5 |2017-01-03 |3 | 223|
| 1 | 4.5 | 9 |2012-01-06 |3 | 223|
| 1 | 4.5 | 12 |2009-04-01|3 | 223|
| 1 | 4.5 | 28 |2011-11-08| 3 | 223|
| 1 | 4.5 | 9 |2012-09-12| 3 | 223|
| 1 | 18 | 15 |2017-07-03 |3 | 220|
| 1 | 18 | 10 |2017-05-03 |3 | 220|
| 1 | 18 |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+
If you only want matches then you actually want an inner join, which is the default type. And then since your column names are the same you can just list them out so that the resultant join only has one column for each instead of 2. Although normally you need to use && instead of a comma for multiple predicates
c = quantity.join(price,['id','sourceid'])
c.show()
As far as your price_perf, I'm not sure what you really want. The min and max are going to be constant within the same data, so your example doesn't make a lot of sense currently.

How to align timestamps from two Datasets in Apache Spark

I got the following problem, while developing an Apache Spark Application. I have two Datasets (D1 and D2) from a Postgresql Database, that I would like to process using Apache Spark. Both contain a column (ts) with timestamps from the same period. I would like to join D2 with the largest timestamp from D1 that is smaller or equal. It might look like:
D1 D2 DJOIN
|ts| |ts| |D1.ts|D2.ts|
---- ---- -------------
| 1| | 1| | 1 | 1 |
| 3| | 2| | 1 | 2 |
| 5| | 3| | 3 | 3 |
| 7| | 4| | 3 | 4 |
|11| | 5| | 5 | 5 |
|13| | 6| | 5 | 6 |
| 7| = join => | 7 | 7 |
| 8| | 7 | 8 |
| 9| | 7 | 9 |
|10| | 7 | 10 |
|11| | 11 | 11 |
|12| | 11 | 12 |
|13| | 13 | 13 |
|14| | 13 | 14 |
In SQL I can simply write something like:
SELECT D1.ts, D2.ts
FROM D1, D2
WHERE D1.ts = (SELECT max(D1.ts)
FROM D1
WHERE D1.ts <= D2.ts);
There is the possibility for nested SELECT queries in Spark Datasets but unfortunately they support only equality = and no <=. I am a beginner in Spark and currently I am stuck here. Is there someone more knowledgable with a good idea on how to solve that issue?

How to get and concatenate values from one column based on another column in Excel

Excel is beating me up for a day here.
I have this table:
+---+--------+--------+--------+--------+
| | A | B | C | D |
+---+--------+--------+--------+--------+
| 1 | AGE | EX# | DG1 | DG2 |
+---+--------+--------+--------+--------+
| 2 | 19 | C01 | ASC | |
+---+--------+--------+--------+--------+
| 3 | 45 | C02 | ATR | |
+---+--------+--------+--------+--------+
| 4 | 27 | C03 | LSI | |
+---+--------+--------+--------+--------+
| 5 | 15 | C04 | LSI | |
+---+--------+--------+--------+--------+
| 6 | 49 | C05 | ASC | AGC |
+---+--------+--------+--------+--------+
| 7 | 76 | C06 | AGC | |
+---+--------+--------+--------+--------+
| 8 | 33 | C07 | ASC | |
+---+--------+--------+--------+--------+
| 9 | 17 | C08 | LSI | |
+---+--------+--------+--------+--------+
Now, I need to create a new table based on that data, with one row and one column, which I'll fill column A and need a formula to fill column B:
+----+--------+---------------+
| | A | B |
+--=-+--------+---------------+
| | DG | AGE |
+--=-+--------+---------------+
| 10 | AGC | 49, 76 |
+----+--------+---------------+
| 11 | ASC | 19, 33, 49 |
+----+--------+---------------+
| 12 | ATR | 45 |
+----+--------+---------------+
| 13 | LSI | 15, 17, 27 |
+----+--------+---------------+
So I need a formula to check the first table's columns C and D for each DGs, and check the age of each one in column A, and then concatenate all values that match into one cell with a , as a separator.
Can anyone help me?
Thanks
On the great excel website from Chip Pierson, I found the custom function: StringConcat. Just copy-paste the code in a VBA module.
Something like the following formula (in cell B10 & fill down)should work for you. It's an array formula (commit with [ctrl-shift-enter]
=StringConcat(", ",IF(Sheet1!$B$2:$C$100=A10,Sheet1!$A$2:$A$100,""))
You'll have to adjust the ranges off course.

Excel: Sort one column into many column

I would like to do the following data sorting/reshaping in excel. Is there a way to do this?
From this
+--------+-------+
| Sample | Value |
+--------+-------+
| 1 | 30 |
| 1 | 10 |
| 2 | 6 |
| 2 | 5 |
| 3 | 62 |
| 3 | 20 |
+--------+-------+
To this
+---------+---------+---------+
| Sample1 | Sample2 | Sample3 |
+---------+---------+---------+
| 30 | 6 | 62 |
| 10 | 5 | 20 |
+---------+---------+---------+
edit: please excuse my ugly table.
If 30 is in B2, please try in C2:
=OFFSET($B2,2*(COLUMN()-3),0)
copied across and down to suit.

Resources