I am trying to create a SAS table from a XLSX Excel-file which looks like below. The SAS column names will be 3rd row in the Excel file and reading data from the 5th row.
A B C D F ...
1
2
3 Date Period Rate Rate down Rate up ...
4
5 2015-04-30 1 0.25 0.23 0.27 ...
6 2015-05-31 2 0.21 0.19 0.23 ...
. .........................................
. .........................................
I am using proc import to gather the table as below:
proc import datafile = have out=want DBMS = excel;
GETNAMES=YES; MIXED=YES; SCANTEXT=YES; USEDATE=YES; DATAROW=5;
run;
The problem is that Proc Import takes the column names in the 3rd row in numeric format like the rest of the Excel file, so SAS puts "." instead of column names like Date or Rate because SAS doesn't understand them as numeric values.
I found proc import options like DATAROW=5 to read the data from the fifth row, and MIXED=YES to indicate that the Excel-table include both numeric and character values. GETNAMES=YES to get column names from the table, and SCANTEXT=YES to scan text as you can understand. However, even with those options I got the same SAS table like below. The whole SAS-table is in numeric format, so it can't resolve names from Excel:
F1 F2 F3 F4 F5 ...
1 . . . . . ...
2 . . . . . ...
3 30APR2015 1 0.25 0.23 0.27 ...
4 31MAY2015 2 0.21 0.19 0.23 ...
. ...............................
. ...............................
Any idea about how to import the 3rd row of the XLSX file as my column name in the SAS table?
OK. I found the solution. I should have just added a simple option like RANGE=A3:G2000. In a very strange matter, I got error with the option DATAROW=5, so I removed it. So the code becomes:
proc import datafile = have out=want DBMS = excel;
GETNAMES=YES; MIXED=YES; SCANTEXT=YES; USEDATE=YES; RANGE='A3:G2000';
run;
Now it works. But that RANGE option is not written on every webpage, it was difficult to find.
It was also very strange that SAS couldn't realize that character values like "Date" should be in character format. But it realizes it when you use a Range option?
Related
I have a data frame (called df) which is currently formatted like so:
1 2 3
1 1 0.26 0.02
2 0.26 1 0.61
3 0.02 0.61 1
The IDs are connected by a value and I would like to somehow extract all unique ID values in order to have a more efficient way to add them to my graph on networkx.
The output should look like something like this:
ed_list = [(1,2,{'weight': 0.26}),(1,3,{'weight': 0.02}),(2,3,{'weight':0.61})]
At the moment I use the following method:
# Create matrix
new_ = df.values
A_d = np.matrix(new_)
G = nx.from_numpy_matrix(A_d)
I'm wondering if it would be easier/more efficient to create a List of tuples from my df that I could use to connect my nodes, where I could then add edges like so:
G.add_edges_from(ed_list)
EDIT: I have made a mistake in the previous version of my question - the column and row names are just integers
Can you try:
# this s is what you are looking for
s = df.where(df.index.values > df.columns.values[:,None]).stack().reset_index(name='weight')
# we can use dataframe directly
G = nx.from_pandas_edgelist(s,source='level_0',target='level_1', edge_attr='weight')
Or even simpler:
G = nx.from_pandas_adjacency(df)
I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())
The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())
After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.
Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()
I have excel files (hundreds of them) that look like this (sensor output):
Column1 Column2 Column3
Serial Number:
10004
Ref. Temp:
25C
Ref. Pressure:
1KPa
Time Temp. Pres.
1 21 1
2 22 1.1
3 23 1.2
. . .
. . .
. . .
I want to split this into two parts, the information section (top part) and data section (the rest), something like this:
Information section
Column1 Column2 Column3
Serial Number:
10004
Ref. Temp:
25C
Ref. Pressure:
1KPa
Data section:
Column1 Column2 Column3
Time Temp. Pres.
1 21 1
2 22 1.1
3 23 1.2
. . .
. . .
. . .
if it converts to data frame I don't want the first row and column become header and index of the data frame. I am using python 2.7 and numpy.
Make two copies of the worksheet.
In copy A, start a loop, going on the first column looking for the word Time. Once it finds it, let it delete anything before it.
Remember the row in a variable.
In copy B, delete anything after the remembered row to row number 2^20.
I am extremely new into Machine learning Feature of Python. I wanted to group i.e. create a cluster based on specific texts from rows. In there are 3 columns Sr no, Name and Summary. I wanted to create a cluster based on the specific values from the summary text i.e. if the summary contains the text "Veg", then it should be in one cluster and if the text contains "Non Veg", then it should be in another cluster. Expected Output , where the third column will contain the clustered value. All veg are grouped to Cluster 0 and Non Veg to cluster 1
K-means can solve this for me. But how to cluster based on the text from the summary. Kindly help. Thanks in advance.
I would go one further than suggestions in the comments and say that you don't need to use Python for this task. Why not just include the following formula in the cluster column:
=IF(ISNUMBER(SEARCH("non veg", D3)), 1, IF(ISNUMBER(SEARCH("veg", D3)), 0, -1))
Assuming the top-left corner of your tale is B2, and this is the formula in the first row (i.e. in cell E3 of the table). This should give 0 for any cells containing non veg, 1 for cells containing veg and -1 for any rows containing neither.
You can of course do something similar in Python as suggested by #juanpa.arrivillaga, but if your input and desired output are in excel, and there's an easy way to do it in excel, I would suggest that's easiest option.
You can use xlrd for read Excel file.
You can use pandas to read Excel file also.
Following Demo is with pandas
Steps
Read Excel file and create Dataframe from it. pandas.read_excel method.
Write a function which return cluster number according to Summary value in each row.
Input to this function is row
output is 0(Vegetarian), 1(Non Vegetarian), -1(not define)
Apply this function to each row of Dataframe.
Write final output back to Excel file by pandas.to_excel method.
code:
>>> import pandas as pd
>>> a = "43583564_input.xlsx"
>>> df = pd.read_excel(a)
>>> df
sr. no Name Summary
0 1 T1 I am Vegetarian
1 2 T2 I am Non Vegetarian
2 3 T3 I am Non Vegetarian
3 4 T4 I am Vegetarian
4 5 T5 I am Non Vegetarian
>>> def getCluster(row):
... if row["Summary"]=="I am Non Vegetarian":
... return 1
... elif row["Summary"]=="I am Vegetarian":
... return 0
... else:
... return -1
...
>>> df["Cluster"] = df.apply(getCluster, axis=1)
>>> df
sr. no Name Summary Cluster
0 1 T1 I am Vegetarian 0
1 2 T2 I am Non Vegetarian 1
2 3 T3 I am Non Vegetarian 1
3 4 T4 I am Vegetarian 0
4 5 T5 I am Non Vegetarian 1
>>> df.to_excel("43583564_output.xlsx")
From a text file i copy 10 lines of text in the format productName#qty.
First time around it could be in the following order. I paste this onto excel and separate the data by #
A#10 -> A 10
D#25 -> D 25
Second time around it could be in the following order. I do the same as before.
B#10 -> B 10
A#12 -> A 12
I want to merge the 2 sets of data and want the output to be something like this
A 10 12
B 10
D 25
Any help on how to do this. I don't know programming or macros, so any detailed description will be greatly appreciated.
Add a column for 'time around' and create a PivotTable with that for COLUMNS, Product for ROWS and Sum of Qty for Sigma VALUES: