Modifying A SAS Data set after it was created using PROC IMPORT - excel

I have a dataset like this
Obs MinNo EurNo MinLav EurLav
1 103 15.9 92 21.9
2 68 18.5 126 18.5
3 79 15.9 114 22.3
My goal is to create a data set like this from the dataset above:
Obs Min Eur Lav
1 103 15.9 No
2 92 21.9 Yes
3 68 18.5 No
4 126 18.5 Yes
5 79 15.9 No
6 114 22.3 Yes
Basically I'm taking the 4 columns and appending them into 2 columns + a Categorical indicating which set of 2 columns they came from
Here's what I have so far
PROC IMPORT DATAFILE='f:\data\order_effect.xls' DBMS=XLS OUT=orderEffect;
RUN;
DATA temp;
INFILE orderEffect;
INPUT minutes euros ##;
IF MOD(_N_,2)^=0 THEN lav='Yes';
ELSE lav='No';
RUN;
My question though is how I can I import an Excel sheet but then modify the SAS dataset it creates so I can shove the second two columns below the first two and add a third column based on which columns in came from?
I know how to do this by splitting the dataset into two datasets then appending one onto the other but with the mode function above it would be a lot faster.

You were very close, but misunderstanding what PROC IMPORT does.
When PROC EXPORT completes, it will have created a SAS data set named orderEffect containing SAS variables from the columns in your worksheet. You just need to do a little data step program to give the result you want. Try this:
data want;
/* Define the SAS variables you want to keep */
format Min 8. Eur 8.1;
length Lav $3;
keep Min Eur Lav;
set orderEffect;
Min = MinNo;
Eur = EurNo;
Lav = 'No';
output;
Min = MinLav;
Eur = EurLav;
Lav = 'Yes';
output;
run;
This assumes that the PROC IMPORT step created a data set with those names. Run that step first to be sure and revise the program if necessary.

Related

VBA solution of VIF factors [EXCEL]

I have several multiple linear regressions to carry out, I am wondering if there is a VBA solution for getting the VIF of regression outputs for different equations.
My current data format:
i=1
Year DependantVariable Variable2 Variable3 Variable4 Variable5 ....
2009 100 10 20 -
2010 110 15 25 -
2011 115 20 30 -
2012 125 25 35 -
2013 130 25 40 -
I have the above table, with the value of i determining the value of the variables (essentially, different regression input tables in place for every value of i)
I am looking for a VBA that will check every value of i (stored in a column), calculate the VIF for every value of i and output something like below
ivalue variable1VIF variable2VIF ...
1 1.1 1.3
2 1.2 10.1

How can you reshape data in SAS while keeping data formats for further Excel export?

I have a dataset in SAS that I want to export to Excel in a particular table. I want to do multiple operations on the data :
Reshape the data
Format data
Export to Excel
Here is a sample dataset.
data sample;
input Product $ Year Metric1 Metric2 Metric3;
datalines;
A 2017 74 222 28895
A 2018 45 235 15371
B 2017 88 14 813
B 2018 89 157 2304
;
What I ultimately want is the following.
Using proc transpose I can get the following, which is close but not perfect.
proc transpose data=Sample out=transposed name=Metrics;
by Product ;
id Year;
var Metric1 Metric2 Metric3;
run;
proc print data=transpose noobs; run;
The problem with this output is the formatting. I do not mind adding formatting only when exporting to Excel though, I just don't know how. So I guess my question is in two different parts :
Can I transpose my dataset sample keeping the format ?
Or can I add formatting to the tranposed dataset when exporting to Excel to obtain the format illustrated in 1.
You can use CALL DEFINE() in PROC REPORT to set display properties. Use ODS EXCEL to route the report to an Excel file. I will use the FORMAT attribute in this example, but you probable want to use the STYLE attribute to effect how Excel will display the values.
data sample;
input Product $ Year Metric1 Metric2 Metric3;
metric1=metric1/100;
datalines;
A 2017 74 222 28895
A 2018 45 235 15371
B 2017 88 14 813
B 2018 89 157 2304
;
proc transpose data=sample out=want ;
by product year;
var metric1-metric3 ;
run;
proc report data=want nofs headline ;
column product _name_ col1,year ;
define product / group;
define _name_ / group 'Metric';
define year/across ' ';
define col1/sum ' ';
compute col1 ;
if (_name_='Metric1') then call define(_col_,'format','percent.');
if (_name_='Metric2') then call define(_col_,'format','dollar.');
if (_name_='Metric3') then call define(_col_,'format','comma.');
endcomp;
run;
Output:
Product Metric 2017 2018
----------------------------------------
A Metric1 74% 45%
Metric2 $222 $235
Metric3 28,895 15,371
B Metric1 88% 89%
Metric2 $14 $157
Metric3 813 2,304

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

How to convert bytes data into a python pandas dataframe?

I would like to convert 'bytes' data into a Pandas dataframe.
The data looks like this (few first lines):
(b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'
The columns headers appear at the top. each subsequent line is a timestamp and numbers.
Is there a straightforward way to do this?
Thank you very much
#Paula Livingstone:
This seems to work:
s=str(bytes_data,'utf-8')
file = open("data.txt","w")
file.write(s)
df=pd.read_csv('data.txt')
maybe this can be done without using a file in between.
I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string
Try something like:
from io import StringIO
s=str(bytes_data,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
You can also use BytesIO directly:
from io import BytesIO
df = pd.read_csv(BytesIO(bytes_data))
This will save you the step of transforming bytes_data to a string
Ok cool, your input formatting is quite awkward but the following works:
with open('file.txt', 'r') as myfile:
data=myfile.read().replace('\n', '') #read in file as a string
df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)
print(df)
this produces the following:
0 1 2 3 4 5 6 7 \
0 #Settlement Date Settlement Period CCGT OIL COAL NUCLEAR WIND PS
1 2017-01-01 1 7727 0 3815 7404 3923 0
2 2017-01-01 2 8338 0 3815 7403 3658 16
3 2017-01-01 3 7927 0 3801 7408 3925 0
8 9 10 11 12 13 14 15
0 NPSHYD OCGT OTHER INTFR INTIRL INTNED INTEW BIOMASS
1 944 0 2123 948 296 856 238
2 909 0 2124 998 298 874 288
3 864 0 2122 998 298 816 286 None
In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.
As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.
More detail on this can be found here. Note specifically the section on io (io : str or file-like).

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Resources