Looping through column to perform amalysis on each - statistics

I have over 338 columns with different drug name in each column. what I want to do is to loop through all the columns using the code. This code is for one specific drug. The problem is I have 338 different drug names. The code is:
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP (drug_1) MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95
Is there any way I can loop through the columns and perform the test without running the code block over and over again?

One thing you can try is restructuring the file to long format so all the drugs are in one column, then you can run the test on all the drugs in parallele at once by splitting the file:
varstocases /make drugval from drug_1 to drug_338/index=drugname(drugval).
sort cases by drugname.
split file by drugname.
*now your code .
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP (drugval) MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95.
split file off.
Alternatively you can use SPSS macro to loop through all the drugs and test them one by one:
define testdrugs ()
!do !drg=1 !to 338
NPTESTS
/INDEPENDENT TEST (PLIN3 CYFIP2 IL2RA HSD3B1 IL2RB PYROXD1 ZBED4 MCTP1 LAMA3 CTSC EDEM1 LIF PIM3
PPARA SLC6A11 THNSL2 ZNF697) GROUP !concat("(drug_", !drg, ")") MANN_WHITNEY KRUSKAL_WALLIS(COMPARE=PAIRWISE)
/MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE
/CRITERIA ALPHA=0.05 CILEVEL=95
!doend
!enddefine.
* the macro is defined, now we can call it.
testdrugs .

Related

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.
wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

How to join records in Easytrieve internal SORT?

I've a requirement, where I need to extract 2 types of records from a single input file & join them for EZT report processing.
Currently, I've written an ICETOOL step to perform the extraction followed by the join. The output of the ICETOOL step is fed to the Easytrieve report step.
Extraction card is as below -
SORT FIELDS=(14,07,PD,A)
OUTFILE FNAMES=FILE010,INCLUDE=(25,03,CH,EQ,C'010')
OUTFILE FNAMES=FILE011,INCLUDE=(25,04,CH,EQ,C'011')
OPTION DYNALLOC=(SYSDA,05)
Here is the join card -
SORT FIELDS=(14,07,PD,A)
JOINKEYS F1=FILE010,FIELDS=(14,07,A),SORTED,NOSEQCHK
JOINKEYS F2=FILE011,FIELDS=(14,07,A),SORTED,NOSEQCHK
REFORMAT FIELDS=(F1:14,07,
F2,25,10)
OUTREC BUILD=(1,17,80:X),VTOF
OPTION DYNALLOC=(SYSDA,05)
I'm wondering if it was possible to perform the above SORT/ICETOOL operations within EasyTrive. I've used the Easytrieve internal SORT but it was for the simple extractions. Can we perform the join operation within the Easytrieve?
Note - The idea is to have a single EZT step.
You can make use Synchronized File Processing facility (SFP) in Easytrieve to acheive the task. Read more about it here.
FILE FILE010
KEY1 14 7 N
*
FILE FILE011
KEY2 14 7 N
FIELD1 25 10 A
*
FILE OUTFILE FB(80 0)
OKEY 1 7 N
OFIELD 8 10 A
*
WS-COUNT W 5 N VALUE 0
*
JOB INPUT FILE010 KEY KEY1 FILE011 KEY KEY2 FINISH(DIS)
*
IF EOF FILE010
STOP
END-IF
*
IF MATCHED
OKEY = KEY1
OFIELD = FIELD1
WS-COUNT = WS-COUNT + 1
PUT OUTFILE
END-IF
*
DIS. PROC
DISPLAY 'RECORDS WRITTEN: ' WS-COUNT
END-PROC
Please note,
Above code isn't tested, it's just a draft showing the idea on
file matching using Easytrieve to achieve the task.
Data types to the data items are assumed. You may have to change them suitably.
You may have to define the variable input datasets in the FILE
statement.
You may add more statements within the IF MATCHED condition for the
creation of report.
Hope this helps!

How do I search and replace in vi to eliminate a random number of random characters preceding a known string?

I have text files which look like this:
0 298047498 /directory1/app/20170417/file1.blob 0 f191
e 6569844 /directory1/app/20170417/file2.blob 0 f191
344 /directory1/app/20170417/file3.blob 0
8946 /directory1/app/20170417/file4.blob 0
196496 /directory1/app/20170417/file5.blob 0
9 182340752 /directory1/app/20170417/file6.blob 0 f191
68802 /directory1/app/20170417/file7.blob 0
I want to remove everything prior to the first / and everything after the file extension.
Results should look like this:
/directory1/app/20170417/file1.blob
/directory1/app/20170417/file2.blob
/directory1/app/20170417/file3.blob
Is there a way to do this using vi search and replace?
This type of question may be better placed here: https://vi.stackexchange.com/
But for now:
Yout can e.g. use a simple vim-macro, in which you collect all the key-strokes you need to edit one line and repeat this macro as many times as you need it.
Here are simply the key-strokes for one line:
dt/WD
d = delete..
t = ..till the first "/"
W = [shift]+[w] jumps to the next Word (after the "file-location-string")
D = [shift]+[d] deletes till the end of the current line
If you want to record this as a macro, do the following, with the keystrokes from above, inbetween - like this:
qmdt/WD[home][down]q
qm = start the recording of a macro in buffer "m"
... key-strokes from above
[home][down] = key [home] followed by [arrow down]-key, to move into the next line (for convenince)
q = end up the macro-recording
Now execute that macro with:
#m
And if you added the [down] key, you can do something like:
7#m
with which you fire your macro 7 times, for all your 7 lines.

If Cell Starts with String from a List Return Number

What I am looking for is either a formula or Macro/code that will allow me to do the following:
I have a list of the beginning of Postcodes, for example, "AB31, AB32, AB33, AB34", I need something that will return a value of 313 if the start of a cell contains a value from this list I have, and a value of 74 if it does not start with a value from that list.
Now the list is quite long, theres probably 100+ values in there. If there weren't so many values in the list I could use something like
=IF(LEFT(K1,4)="AB51","313","72")
The Postcode will be in the K column, and the full list I need it to search through is
AB31
AB32
AB33
AB34
AB35
AB36
AB37
AB38
AB40
AB41
AB42
AB43
AB44
AB45
AB46
AB47
AB48
AB49
AB50
AB51
AB52
AB53
AB54
AB55
AB56
BT
GY
HS
IM
IV1
IV2
IV3
IV4
IV5
IV6
IV7
IV8
IV9
IV10
IV11
IV12
IV13
IV14
IV15
IV16
IV17
IV18
IV19
IV20
IV21
IV22
IV23
IV24
IV25
IV26
IV27
IV28
IV30
IV31
IV32
IV36
IV40
IV41
IV42
IV43
IV44
IV45
IV46
IV47
IV48
IV49
IV51
IV52
IV53
IV54
IV55
IV56
IV63
J3
KA27
KA28
KW1
KW2
KW3
KW4
KW5
KW6
KW7
KW8
KW9
KW10
KW11
KW12
KW13
KW14
KW15
KW16
KW17
PA20
PA21
PA22
PA23
PA24
PA25
PA26
PA27
PA28
PA29
PA30
PA31
PA32
PA33
PA34
PA35
PA36
PA37
PA38
PA41
PA42
PA43
PA44
PA45
PA46
PA47
PA48
PA49
PA60
PA61
PA62
PA63
PA64
PA65
PA66
PA67
PA68
PA69
PA70
PA71
PA72
PA73
PA74
PA75
PA76
PA77
PA78
PH4
PH5
PH6
PH7
PH8
PH9
PH10
PH11
PH12
PH13
PH14
PH15
PH16
PH17
PH18
PH19
PH20
PH21
PH22
PH23
PH24
PH25
PH26
PH27
PH28
PH29
PH30
PH31
PH32
PH33
PH34
PH35
PH36
PH37
PH38
PH39
PH40
PH41
PH42
PH43
PH44
PH49
PH50
TR21
TR22
TR23
TR24
TR25
ZE
So if any Postcode in column K begins with any of those values, I would like the number 313 returning, if not the number 72 returning.
Any help would be much appreciated.
Thanks!
You should give example on how your sheets look like or else it is very difficult to answer.
what I will do
Put the list on a Column
Use combination of IF and Vlookup to loop through the list.
=IF(VLOOKUP(LEFT(K1, 4),A:A,1,0)=LEFT(K1, 4),313,72)
Column E Column F Column G
Post Code Result Formula
AB31456 313 =IF(VLOOKUP(LEFT(E2,4),A:A,1,0)=LEFT(E2,4),313,72)
KZ12398 72 =IF(VLOOKUP(LEFT(E2,4),A:A,1,0)=LEFT(E2,4),313,72)

Use matlab to search excel data file for time range and copy data into variable

In my excel file I have a time column in 12 hr clock time and a bunch of data columns. I have pasted a snippet of it in this post as a code since i cant attach a file. I am trying to build a gui that will take an input from the user like so:
start time: 7:29:32 AM
End time: 7:29:51 AM
Then do the following:
calculate the time that has passed in seconds (should be just a row count, data is gathered once a second)
copy the data in the time range from the "Data 3" column in to a variable perform other calculations on the data copied as needed
I am having some trouble figuring out what to do to search the time data and find its location since it imports as text with xlsread. any ideas?
The data looks like this:
Time Data 1 Data 2 Data 3 Data 4 Data 5
7:29:25 AM 0.878556385 0.388400561 0.076890401 0.93335277 0.884750618
7:29:26 AM 0.695838393 0.712762566 0.014814069 0.81264949 0.450303694
7:29:27 AM 0.250846937 0.508617941 0.24802015 0.722457624 0.47119616
7:29:28 AM 0.206189924 0.82970364 0.819163787 0.060932817 0.73455323
7:29:29 AM 0.161844331 0.768214077 0.154097877 0.988201094 0.951520263
7:29:30 AM 0.704242494 0.371877481 0.944482485 0.79207359 0.57390951
7:29:31 AM 0.072028024 0.120263127 0.577396985 0.694153791 0.341824004
7:29:32 AM 0.241817775 0.32573323 0.484644494 0.377938298 0.090122672
7:29:33 AM 0.500962945 0.540808907 0.582958676 0.043377373 0.041274613
7:29:34 AM 0.087742217 0.596508236 0.020250297 0.926901109 0.45960323
7:29:35 AM 0.268222071 0.291034947 0.598887588 0.575571111 0.136424853
7:29:36 AM 0.42880255 0.349597405 0.936733938 0.232128788 0.555528823
7:29:37 AM 0.380425154 0.162002488 0.208550466 0.776866494 0.79340504
7:29:38 AM 0.727940393 0.622546124 0.716007768 0.660480612 0.02463804
7:29:39 AM 0.582772435 0.713406643 0.306544291 0.225257421 0.043552277
7:29:40 AM 0.371156954 0.163821476 0.780515577 0.032460418 0.356949005
7:29:42 AM 0.484167263 0.377878242 0.044189636 0.718147456 0.603177625
7:29:43 AM 0.294017186 0.463360581 0.962296024 0.504029061 0.183131098
7:29:44 AM 0.95635086 0.367849494 0.362230918 0.984421096 0.41587606
7:29:45 AM 0.198645523 0.754955312 0.280338922 0.79706146 0.730373691
7:29:46 AM 0.058483961 0.46774544 0.86783339 0.147418954 0.941713252
7:29:47 AM 0.411193343 0.340857813 0.162066261 0.943124515 0.722124394
7:29:48 AM 0.389312994 0.129281042 0.732723258 0.803458815 0.045824426
7:29:49 AM 0.549633038 0.73956852 0.542532728 0.618321989 0.358525184
7:29:50 AM 0.269925317 0.501399748 0.938234302 0.997577871 0.318813506
7:29:51 AM 0.798825842 0.24038537 0.958224157 0.660124357 0.07469288
7:29:52 AM 0.963581196 0.390150081 0.077448543 0.294604314 0.903519943
7:29:53 AM 0.890540963 0.50284339 0.229976565 0.664538451 0.926438543
7:29:54 AM 0.46951573 0.192568637 0.506730373 0.060557482 0.922857391
7:29:55 AM 0.56552394 0.952136998 0.739438663 0.107518765 0.911045415
7:29:56 AM 0.433149875 0.957190309 0.475811126 0.855705733 0.942255155
and this is the code I am using:
[Data,Text] = xlsread('C:\Users\data.xlsx',2);
IndexStart=strmatch('7:29:29 AM',Text,'exact'); %start time
IndexEnd=strmatch('2:30:29 PM',Text,'exact'); %end time
seconds = IndexEnd-IndexStart;
TestData = Data([IndexStart: IndexEnd],:);
You probably need to:
Use strfind to find the relevant string in the data imported
Use datenum to convert the date to serial date numbers, to be able to calculate the elapsed time between the two points.
It would help if you posted your code so far though.
EDIT based on comments:
Here's what I would do for cycling through the list of start and end times:
[Data,Text] = xlsread('C:\Users\data.xlsx',2);
start_times = {'7:29:29 AM','7:29:35 AM','7:29:44 AM','7:29:49 AM'}; % etc...
end_times = {'2:30:29 PM','2:30:59 PM','2:31:22 PM','2:32:49 PM'}; % etc...
elapsed_time = zeros(length(start_times),1);
TestData = cell(length(start_times),1); % need a cell array because data can/will be of unequal lengths
for k=1:length(start_times)
IndexStart=strmatch(start_times{k},Text,'exact'); %start time
IndexEnd=strmatch(end_times{k},Text,'exact'); %end time
elapsed_time(k) = IndexEnd-IndexStart;
TestData{k} = Data([IndexStart: IndexEnd],:);
end
Use the "Import Data" from the Variable Tag in the Home menu. There you can set how you want the data to be imported like. With or without heading and the format.

Resources