I am working on a machine learning project and am using Excel to handle the dataset. I am new to both Excel and VBA.
So I am using this dataset, and I just copy pasted the whole thing into an excel spreadsheet. I did text to columns. Here's a snapshot of some of the data:
Snapshot of data
I want to reformat the data in the spreadsheet so that all of the data goes into a single row, then starts a new row after the "name" keyword.
For example, I want this:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18
19 20 21 22 23 name
to become:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 name (all on one line)
without having to do it manually line by line.
I used the below VBA code to format the data how I want it:
Sub separateByName()
Dim lRow As Long
Dim lCol As Long
Dim lCol2 As Long
k = 1
lRow = Cells(Rows.Count, 1).End(xlUp).Row
For i = 1 To lRow
lCol = Cells(i, Columns.Count).End(xlToLeft).Column
For j = 1 To lCol
lCol2 = Sheets("Sheet2").Cells(k, Columns.Count).End(xlToLeft).Column
Sheets("Sheet2").Cells(k, lCol2 + 1).Value = Cells(i, j).Value
If Cells(i, j).Value = "name" Then k = k + 1
Next j
Next i
End Sub
However, when I run I'm getting problems in that the result seems randomly patterned.
This:
1 0 63 1 -9 -9 -9
-9 1 145 1 233 -9 50 20
1 -9 1 2 2 3 81 0
0 0 0 0 1 10.5 6 13
150 60 190 90 145 85 0 0
2.3 3 -9 172 0 -9 -9 -9
-9 -9 -9 6 -9 -9 -9 2
16 81 0 1 1 1 -9 1
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
2 0 67 1 -9 -9 -9
-9 4 160 1 286 -9 40 40
0 -9 1 2 3 5 81 0
1 0 0 0 1 9.5 6 13
108 64 160 90 160 90 1 0
1.5 2 -9 185 3 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
5 81 2 1 2 2 -9 2
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
Became this:
1 0 63 1 -9 -9 -9 1 0 63 1 -9 -9 -9 -9 1 145 1 233 -9 50 20 1 -9 1 2 2 3 81 0 0 0 0 0 1 10.5 6 13 150 60 190 90 145 85 0 0 2.3 3 -9 172 0 -9 -9 -9 -9 -9 -9 6 -9 -9 -9 2 16 81 0 1 1 1 -9 1 -9 1 -9 1 1 1 1 1 1 1 -9 -9 name
-9 1 145 1 233 -9 50 20 2 0 67 1 -9 -9 -9 -9 4 160 1 286 -9 40 40 0 -9 1 2 3 5 81 0 1 0 0 0 1 9.5 6 13 108 64 160 90 160 90 1 0 1.5 2 -9 185 3 -9 -9 -9 -9 -9 -9 3 -9 -9 -9 2 5 81 2 1 2 2 -9 2 -9 1 -9 1 1 1 1 1 1 1 -9 -9 name
The "name" is correctly at the end, but the actual data is messed up.
Could anyone help me to fix this code for my dataset?
Thanks!
I also tested your code with data and i got it to work just fine, just make sure on sheet 1 you have the data and you have empty sheet 2, then use the macro while sheet 1 is open. then your data is in sheet 2.
Related
I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?
import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32
I have a dataset that looks like this.
sample day
1 -10
1 -9
. .
. .
. .
1 10
2 -10
3 -10
. .
. .
. .
3 10
I want only the sample with whole period from -10 to 10. In this case the sample 2 must be deleted. But the missing period for each sample is different some go from -10 to 0, some -10 to -8 (number of rows for each sample is varied). How should I write in pandas or excel to delete incomplete samples?
IIUC, you need to use a boolean expression, if the period is alwas -10 to 10 then the sum of these numbers should always be 0
print(df)
sample day
0 1 -10
0 1 -9
0 1 -8
0 1 -7
0 1 -6
0 1 -5
0 1 -4
0 1 -3
0 1 10
.......
1 2 4
1 2 5
df1 = df[df.groupby(['sample'])['day'].transform('sum').eq(0)]
print(df1)
sample day
0 1 -10
0 1 -9
0 1 -8
0 1 -7
0 1 -6
0 1 -5
0 1 -4
0 1 -3
0 1 -2
0 1 -1
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 1 5
0 1 6
0 1 7
0 1 8
0 1 9
0 1 10
I've the following dataframe:
car_id time(seconds) is_charging
1 1 65 1
2 1 70 1
3 1 67 1
4 1 71 1
5 1 120 0
6 1 124 0
7 1 117 0
8 1 80 1
9 1 74 1
10 1 62 1
11 1 130 0
12 1 124 0
I want to create new column to enumerate the charging and discharging periods of the 'is_charging' column so later on i can groupby that new column and compute means, max, min values, etc, of each period.
The resulting dataframe should be like this:
car_id time(seconds) is_charging periods_id
1 1 65 1 1
2 1 70 1 1
3 1 67 1 1
4 1 71 1 1
5 1 120 0 2
6 1 124 0 2
7 1 117 0 2
8 1 80 1 3
9 1 74 1 3
10 1 62 1 3
11 1 130 0 4
12 1 124 0 4
I've done this using for statment, like this:
df['periods_ids] = 0
period_id = 1
previous_charging_state = df.at[0,'is_charging']
def computePeriodIDs():
for ind in df.index:
if df.at[index, 'is_charging'] != previous_charging_state:
previous_charging_state = df.at[index, 'is_charging']
period_id = period_id + 1
df.at[index, 'periods_id'] = period_id
else:
df.at[index, 'periods_id'] = period_id
This is way too slow for the amount of rows that i have. I'm trying to use a vectorize function, especially the apply() one but due to my lack of understanding i haven't had much success and i can not find a similar problem online.
Can someone help me optimize this problem?
Try this:
df.is_charging.diff().ne(0).cumsum()
Out[115]:
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 4
12 4
Name: is_charging, dtype: int32
For the code:
dataset = pd.read_csv("/Users/Akshita/Desktop/EE660/donor_raw_data_medmean.csv", header=None, names=None)
# Separate data and label
X_label = dataset[1:19373][0]
X_data = dataset[1:19373]
print(X_data[X_label==1])
I get the output:(There are actually 4000~ samples with label=1)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16386 1 17 60 0 1 0 0 0 0 1 ... 0 20 20 20 5 10 15 15
16396 1 137 60 0 1 0 0 0 0 1 ... 15 25 10 15 6 14 16 120
16399 1 89 54 0 1 0 0 0 0 1 ... 10 15 5 15 6 14 16 79
16402 1 89 75 0 1 0 0 0 0 1 ... 25 35 10 35 6 13 15 79
..
..
19356 1 101 80 1 0 0 1 0 0 2 ... 25 30 5 28 7 16 18 101
19363 1 65 70 1 0 0 1 0 0 1 ... 7 12 5 10 4 8 20 63
19372 1 29 70 0 0 0 1 0 0 2 ... 0 25 25 25 4 9 24 24
..
[859 rows x 61 columns]
and for
print(X_data[X_label==0])
I get the output:(There are about 15000~ samples with label=0)
0 1 2 3 4 5 6 7 8 9 ... 51 52 53 54 55 56 57 58 \
16384 0 17 74 0 1 0 0 0 0 1 ... 0 15 15 15 4 10 17 17
16385 0 17 60 0 1 0 0 0 0 2 ... 0 15 15 15 4 11 17 17
16387 0 29 67 0 1 0 0 0 0 1 ... 0 20 20 20 5 11 23 28
16388 0 53 60 0 1 0 0 0 0 1 ... 5 30 25 30 5 11 26 52
16389 0 65 49 0 1 0 0 0 0 1 ... 30 35 5 27 6 13 16 56
..
..
19369 0 137 77 1 0 1 0 0 0 1 ... 9 10 1 10 6 13 21 130
19370 0 29 60 1 0 0 1 0 0 1 ... 0 15 15 15 3 9 23 23
19371 0 129 78 1 0 0 1 0 0 2 ... 20 25 5 25 7 24 8 129
What can I be doing wrong?
My data contain 0 which I want to remove with -9, but not those data point which are like 220 or 120. How to do it? For example data are like:
M1 M2 M3 M4
120 0 125 0
0 123 123 0
123 0 0 123
to
M1 M2 M3 M4
120 -9 125 -9
-9 123 123 -9
123 -9 -9 123
You would search for " 0 " and replace with " -9 "