How to create sketches for million of data using Spark? - apache-spark

I have a data frame something like this
| UserID | Platform | Genre | Publisher |
| -------- | ------- |-----------|-----------|
| 1 | PS2. | FPS | Activision|
| 2 | PS1. | Race |EA Sports. |
| 3 | PS2. |RTS |Microsoft. |
| 4 | Xbox. | Race |EA Sports. |
Now from the above data frame, I want to build a Map that has column name and value as keys and a set of user Id as values.
For Ex
Platform_PS2 = [1,3]
Platform_Xbox = [4]
Platform_PS1 = [2]
Genere_Race = [2,4]
Basically, for these arrays, I want to build sketches at the end

Related

Find cell address of value found in range

tl;dr In Google Sheets/Excel, how do I find the address of a cell with a specified value within a specified range where value may be in any row or column?
My best guess is
=CELL("address",LOOKUP("My search value", $search:$range))
but it doesn't work. When it finds a value at all, it returns the rightmost column every time, rather than the column of the cell it found.
I have a sheet of pretty, formatted tables that represent various concepts. Each table consists of
| Title |
+------+------+-------+------+------+-------+------+------+-------+
| Sub | Prop | Name | Sub | Prop | Name | Sub | Prop | Name |
+------+------+-------+------+------+-------+------+------+-------+
| Sub prop | value | Sub prop | value | Sub prop | value |
+------+------+-------+------+------+-------+------+------+-------+
| data | data | data | data | data | data | data | data | data |
| data | data | data | data | data | data | data | data | data |
⋮
I have 8 such tables of variable height arranged in a grid within the sheet 3 tables wide and 3 tables tall except the last column which has only 2 tables--see image. These fill the range C2:AI78.
Now I have a table off to the right consisting in AK2:AO11 of
| Table title | Table title address | ... |
+---------------+-----------------------+-----+
| Table 1 Title | | ... |
| Table 2 Title | | ... |
⋮
| Table 8 Title | | ... |
I want to fill out the Table title address column. (Would it be easier to do this manually for all of 8 values? Absolutely. Did I need to in order to write this question? Yes. But using static values is not the StackOverflow way, now, is it?)
Based on very limited Excel/Google Sheets experience, I believe I need to use CELL() and LOOKUP() for this.
=CELL("address",LOOKUP($AK4, $C$2:$AI$78))
This retrieves the wrong value. For AL4 (looking for value Death Wave), LOOKUP($AK4, $C$2:$AI$78) should retrieve cell C2 but it finds AI2 instead.
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $AI$3 | 3 | | 15 |
| Poison Swamp | $AI$30 | | | |
| Smart Missiles | $AI$56 | | | |
| Black Hole | #N/A | 1 | | |
| Inner Land Mines | $AI$3 | | | |
| Chain Lightning | #N/A | | | |
| Golden Tower | $AI$3 | | | |
| Chrono Field | #N/A | 25 | | |
The error messages for the #N/A columns is
Did not find value '<Table Title>' in LOOKUP evaluation.
My expected table is
| Max Levels |
+------------------+---------------+----+--+----+
| UW | Table Address | | | |
+------------------+---------------+----+--+----+
| Death Wave | $C$2 | 3 | | 15 |
| Poison Swamp | $C$28 | | | |
| Smart Missiles | $C$54 | | | |
| Black Hole | $O$2 | 1 | | |
| Inner Land Mines | $O$28 | | | |
| Chain Lightning | $O$54 | | | |
| Golden Tower | $AA$2 | | | |
| Chrono Field | $AA$39 | 25 | | |
try:
=INDEX(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&ROW(D2:F4)), "​"), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&COLUMN(D2:F4)), "​"), 2, ), 4))
or if you want to create jump links:
=INDEX(LAMBDA(x, HYPERLINK("#gid=1273961649&range="&x, x))(ADDRESS(
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&ROW(D2:F4)), "​"), 2, ),
VLOOKUP(A2:A3, SPLIT(FLATTEN(D2:F4&"​"&COLUMN(D2:F4)), "​"), 2, ), 4)))
Try this:
=QUERY(
FLATTEN(
ARRAYFORMULA(
IF(
C:AI=$AK4,
ADDRESS(ROW(C:AI), COLUMN(C:AI)),
""
)
)
), "
SELECT
Col1
WHERE
Col1<>''
"
, 0)
Basically, cast all cells in the search range to addresses if they equal the search term. Then flatten that 2D range and filter out non-nulls.

Grouping a list of items in to as equal in numbers as possible

Requirement: Split a list in to 4 separate groups, based on a value for each row.
| Player | Skill |
| ------------- |:-------------:|
| Player 1 | 10000 |
| Player 2 | 50000 |
| Player 3 | 2000 |
| Player 4 | 11000 |
| Player 5 | 7525 |
| Player 6 | 100 |
| Player 7 | 999 |
| Player 8 | 14579 |
| Player 9 | 26700 |
So in the example above, these players would be split in to 4 groups:
| Group | # of players |
| ------------- |:-------------:|
| Group1 | 2 |
| Group2 | 2 |
| Group3 | 2 |
| Group4 | 3 |
The number of players in a group needs to be as close as possible, however, as a group, the groups total Skill needs to around similar in numbers each time.
Before I go too far down the rabbit hole (as wording a question like this in a simple google search is not turning out very well) are there any built in functions of Excel that can be leveraged to achieve this or possible efforts in VBA that can be explored to achieve the required result?
This isn't an answer! But suppose you try a simple algorithm:
Calculate average skill level (ASL) for all 9 players
Set TSG (total skill for group) to zero.
Loop:Take largest skill Level (LSL) of remaining players
If TSG+LSL>ASL
Go to next group
Else
Add to total skill (TSG) for this group
Remove player from list
Repeat loop until no players remaining.
If you apply this by hand to your data you should get:
Average=30725.75
+---------+---------+---------+---------+
| Group 1 | Group 2 | Group 3 | Group 4 |
+---------+---------+---------+---------+
| 50000 | 26700 | 14579 | 10000 |
| | 2000 | 11000 | 7525 |
| | 999 | | |
| | 100 | | |
| | | | |
| 50000 | 29799 | 25579 | 17525 |
+---------+---------+---------+---------+
Clearly there are a couple of issues - you might not want a single group containing only player with highest skill level. Also you might want to re-average the remaining players after taking out the most skilful player. Should be a starting point though - could be implemented fairly easily with formulas or VBA.

Pandas merge have two results with the same code and input data

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations(First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
Then I have a test:
count = 10001
while (count > 10000):
raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
count = len(raw_df2[raw_df2["type"]=="unkown"])
print(count)
If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.
[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:
| type | gene | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619 | Ctrl | SPION10 | OK | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl | SPION10 | OK | 87.7115 | 49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
The successful result:
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| gene | type | locus | sample_1 | sample_2 | status | value_1 | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18 | misc_RNA | chr1:6487230-6860940 | Ctrl | SPION10 | OK | 1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl | SPION10 | OK | 1.33796 | 2.21057 |
| Carf | misc_RNA | chr1:60076867-60153953 | Ctrl | SPION10 | OK | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
I have a solution,but I still don't know what cause the previous problem.
Set the column in two dataframe that I want to merge as the Index.Then use the index to merge two dataframe.
Run the script more than 10 times,the result is no longer wrong.
# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') #
DataQiime.columns=['Feature.ID','Frequency']
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') #
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Microsoft Excel - Use a drop down list to extract data from another workbook

If i select an option from a drop down list, how do i then display data extracted from another workbook. So far, i have seen examples of showing single row data from another worksheet. But i have yet to see a method of extracting multiple rows and columns of data from a different workbook.
Week 1.xls:
A B C
-------------------------------------------------
| SKU | Description | Vendor Style |
-------------------------------------------------
| | | |
| 000001 | Description 1 | CA0080-03E |
| 000002 | Description 2 | EX1134-59D |
| 000003 | Description 3 | EM0132-59A |
| 000004 | Description 4 | EW8694-52D |
| 000005 | Description 5 | FC0003-18D |
| 000006 | Description 6 | EK2273-59E |
Master.xls:
A B C
------------------------------------------------- ________ _
| SKU | Description | Vendor Style | |________|>| <---Drop Down List
-------------------------------------------------
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
If i choose, e.g. Week 1 in the drop down list. I want the data from Week 1.xls to fill in the table in Master.xls.
If i choose, e.g. Week 2 in the drop down list. I want the data from Week 2.xls to fill in the table in Master.xls.
Is this possible? I'm thinking vlookups may be involved here in order to point to each workbook.
Thank you.

Resources