I have a lot of csv files that I have to drop the date column.
I have a J line that reads in csv file into a numeric array, rdtabfile =: (0&".;.2#:(TAB&,)#:}:);._2) # ReadFile #<
If you know the column number of the date column, I would just use a mask across each line of the array and the copy # dyadic verb.
[ t =: i. 4 5
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
mask=: ~: [: i. # NB. x would be the column to be dropped, y is the numeric matrix
delcol=: (mask # ])"1
1 delcol t
0 2 3 4
5 7 8 9
10 12 13 14
15 17 18 19
delcola=: ((~: [: i. #) # ])"1 NB. can be done in one line
2 delcola t
0 1 3 4
5 6 8 9
10 11 13 14
15 16 18 19
Related
How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans
Does anyone know How to find ranges of repeated categorical values in a column?
I mean, it's something like this:
[Floor] [Height]
1 A 10
2 A 11
3 A 12
4 B 13
5 B 14
6 C 15
7 C 16
8 A 17
9 A 18
10 C 19
11 C 20
12 B 21
13 B 22
14 B 23
What I'm trying to achieve is to determine the Height ranges for each Floor, as shown below:
Floor Height
A [10 - 12]
B [13 - 14]
C [15 - 16]
A [17 - 18]
C [19 - 20]
B [21 - 23]
I was trying with pandas.cut() but I can't find the way to set the intervals for repeated values.
Another way
(df.update((df.astype(str)).groupby((df.Floor!=df.Floor.shift())\
.cumsum())["Height"].transform(lambda x: x.iloc[0]+'-'+x.iloc[-1])))
df=df.drop_duplicates()
print(df)
Floor Height
1 A 10-12
4 B 13-14
6 C 15-16
8 A 17-18
10 C 19-20
12 B 21-23
How it works
(df.Floor!=df.Floor.shift())#Gives a bolean selection where the first in Floor is not eqal to the immidiate or consecutive last
1 True
2 False
3 False
4 True
5 False
6 True
7 False
8 True
9 False
10 True
11 False
12 True
13 False
14 False
(df.Floor!=df.Floor.shift()).cumsum()#gives a new group by cumulatively summing the booleans.Remember True is 1 and Faslse is zero hence the cumulation is by 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 4
10 5
11 5
12 6
13 6
14 6
(df.astype(str)).groupby((df.Floor!=df.Floor.shift()).cumsum())#Insetad of using Floor to classify I use the group derived above. Notice I force the df to be of datatype string and this is because I want to concat the heights. This cannot happen unless they are strings
(df.astype(str)).groupby((df.Floor!=df.Floor.shift())\
.cumsum())["Height"].transform(lambda x: x.iloc[0]+'-'+x.iloc[-1])#I use lambda in transform to concat the heights. You concat strings using +. In this case I introduce - between the heights by simply string + '-'+string
1 10-12
2 10-12
3 10-12
4 13-14
5 13-14
6 15-16
7 15-16
8 17-18
9 17-18
10 19-20
11 19-20
12 21-23
13 21-23
14 21-23
#You notice transform appends values to each row hence I have to drop duplicates later.
#Before dropping duplicates, I have to append the new datframe above to the original.
df.update(newframe above)# gives overwrites the Height with the concatenated heights
df=df.drop_duplicates()#I however have to drop duplicates hence
Try:
(df.groupby(['Floor',
(df['Floor']!=df['Floor'].shift()).cumsum().rename('index')])['Height']
.agg(lambda x: f'{x.min()} - {x.max()}').reset_index(level=0).sort_index())
Output:
Floor Height
index
1 A 10 - 12
2 B 13 - 14
3 C 15 - 16
4 A 17 - 18
5 C 19 - 20
6 B 21 - 23
For those interested, based on #Scott Boston and #wwnde answers.
Just in case you need the ranges in the same row, if you add to both:
df = df[['Floor','Height']]
pd.DataFrame(df.groupby('Floor')['Height'].unique())
The output will be:
Floor Height
A [10 - 12, 17 - 18]
B [13 - 14, 21 - 23]
C [15 - 16, 19 - 20]
Thanks you for your help, and special thanks to #wwnde for that nice explanation.
I'm trying to do this program where given a number N, one has to print out the decimal, octal, hexadecimal and binary for all the numbers in range 1 to N. The trouble is that the platform requires the solution in a particular format.
Suppose the number is 17, so the output should be like :
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 A 1010
11 13 B 1011
12 14 C 1100
13 15 D 1101
14 16 E 1110
15 17 F 1111
16 20 10 10000
17 21 11 10001
For 7 it would be like :
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
If you notice, the above is required to be printed in a way that the decimal, octal and hexadecimal numbers need a minimum of 2 spaces at their left whereas the binary numbers need at least one space at their left. Now, as the length of the numbers increase the space needs to be given accordingly such that the minimum space is there even for the max length number. So, how do I print them using a variable space? So far I have tried this :
Code
def print_formatted(number):
space=len(str(bin(number))[2:])
for i in range(1,number+1):
print('{:2d}'.format(i), end='')
print('{:>3s}'.format(str(oct(i))[2:]), end='')
print('{:>3s}'.format(str(hex(i))[2:]), end='')
print('{:>'+str(space)+'s}'.format(str(bin(i))[2:]))
print_formatted(17)
Here, I just tried doing the required with just the binary numbers but it's giving me an error
print('{:>'+str(space)+'s}'.format(str(bin(i))[2:]))
ValueError: Single '}' encountered in format string
Is there any fix/alternative for this?
Your problem is operator order - the + for string concattenation is weaker then the method call in
'{:>' + str(space) + 's}'.format(str(bin(i))[2:])
. Thats why you call the .format(...) only on "s}" - not the whole string. And thats where the
ValueError: Single '}' encountered in format string
comes from.
Putting the complete formatstring into parenthesis before applying .format to it fixes that.
You also need 1 more space for binary and can skip some str() that are not needed:
def print_formatted(number):
space=len(str(bin(number))[2:])+1 # fix here
for i in range(1,number+1):
print('{:2d}'.format(i), end='')
print('{:>3s}'.format(oct(i)[2:]), end='')
print('{:>3s}'.format(hex(i)[2:]), end='')
print(('{:>'+str(space)+'s}').format(bin(i)[2:])) # fix here
print_formatted(17)
Output:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 a 1010
11 13 b 1011
12 14 c 1100
13 15 d 1101
14 16 e 1110
15 17 f 1111
16 20 10 10000
17 21 11 10001
From your given output above you might need to prepend this by 2 spaces - not sure if its a formatting error in your output above or part of the restrictions.
You could also shorten this by using f-strings (and removing superflous str() around bin, oct, hex: they all return a strings already).
Then you need to calculate the the numbers you use to your space out your input values:
def print_formatted(number):
de,bi,oc,he = len(str(number)), len(bin(number)), len(oct(number)), len(hex(number))
for i in range(1,number+1):
print(f' {i:{de}d}{oct(i)[2:]:>{oc}s}{hex(i)[2:]:>{he}s}{bin(i)[2:]:>{bi}s}')
print_formatted(26)
to accomodate other values then 17, f.e. 128:
1 1 1 1
2 2 2 10
3 3 3 11
...
8 10 8 1000
...
16 20 10 10000
...
32 40 20 100000
...
64 100 40 1000000
...
128 200 80 10000000
In J:
a =: 2 3 $ 1 2 3 4 5 6
Gives:
1 2 3
4 5 6
Which is a 2 3 shaped array.
If I do:
0 1 { a
I (noting that 0 1 is a 2 shaped list) expected to have back:
1 2 3 4 5 6
But got the following instead:
1 2 3
4 5 6
Reading the documentation I was expecting the shape of the index to kinda govern the shape of the answer.
Can someone clarify what I am missing here?
Higher-dimensional arrays may help make this clear. An array with n dimensions has items with n-1 dimensions. When you select an item from ({) a three-dimensional array, your result is a two-dimensional array:
1 { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
When you select multiple items from an array, the items are assembled into a new array, using each atom of x to select a item of y. This might be where you picked up the idea that the shape of x affects the shape of the result.
2 1 0 2 { 'set'
test
$ 2 1 0 2
4
$ 'test'
4
The dimensions of the result is equal to the dimensions of x plus the dimensions of the items of y. So, if you have a two-dimensional x taking two-dimensional items from a three-dimensional y, you will have a four-dimensional result:
(2 2 $ 1 1 0 1) { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
12 13 14 15
16 17 18 19
20 21 22 23
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
$ (2 2 $ 1 1 0 1) { i. 5 3 4
2 2 3 4
One final note: the monadic Ravel (,) will reduce the result to a list (one-dimensional array).
, 0 1 { 2 3 $ 1 2 3 4 5 6
1 2 3 4 5 6
, i. 2 2 2 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
From ({) selects the items of a noun. For 2 3 $ 1 2 3 4 5 6 the items are the two rows because items are the components that make up the noun.
[ a=. 2 3 $ 1 2 3 4 5
1 2 3
4 5 1
0 { a
1 2 3
If you just had 1 2 3 then the items would be the individual atoms.
[ b=. 1 2 3
1 2 3
0 { b
1
If you used 1 3 $ 1 2 3 then there is only one item and the result would be
[ c=. 1 3 $ 1 2 3
1 2 3
0 { c
1 2 3
The number of items can be found with Tally (#), and is the lead dimension of the Shape ($) of the noun.
$ a
2 3
$ b
3
$ c
1 3
# a
2
# b
3
# c
1
There are two parts of my query:
1) I have multiple .xlsx files stored in a folder, a total of 1 year's worth (~ 365 .xlsx files). They are named according to date: ' A_ddmmmyyyy.xlsx' (e.g. A_01Jan2016.xlsx). Each .xlsx has 5 columns of data: Date, Quantity, Latitude, Longitude, Measurement. The problem is, each .xlsx file consists about 400,000 rows of data and although I have scripts in Excel to merge them, the inherent row restriction in Excel prevents me from merging all the data together.
(i) Is there a way to read recursively the data from each .xlsx sheet into MATLAB, and specifying the variable name (i.e. Date, Quantity etc) for each column(variable) within MATLAB (there are no column headings in the .xlsx files)?
(ii) How can I merge the data for each column from each .xlsx together?
Thank you
Jefferson
Let's go by parts
First I do not recommend to join all your files data in one column, there is no need to have this information all together you can work separately with this, using for example datastore
working in matlab in mya directory:
>> pwd
ans =
/home/anquegi/learn/matlab/stackoverflow
I have a folder with a folder that have two sample excel files:
>> ls
20_hz.jpg big_data_store_analysis.m excel_files octave-workspace sample-file.log
40_hz.jpg chirp_signals.m NewCode.m sample.csv
>> ls excel_files/
A_01Jan2016.xlsx A_02Jan2016.xlsx
the content of each file is :
Date Quantity Latitude Longitude Measurement
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
Only to who how it will work.
Reading the data:
>> ssds = spreadsheetDatastore('./excel_files')
ssds =
SpreadsheetDatastore with properties:
Files: {
'/home/anquegi/learn/matlab/stackoverflow/excel_files/A_01Jan2016.xlsx';
'/home/anquegi/learn/matlab/stackoverflow/excel_files/A_02Jan2016.xlsx'
}
Sheets: ''
Range: ''
Sheet Format Properties:
NumHeaderLines: 0
ReadVariableNames: true
VariableNames: {'Date', 'Quantity', 'Latitude' ... and 2 more}
VariableTypes: {'double', 'double', 'double' ... and 2 more}
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'Date', 'Quantity', 'Latitude' ... and 2 more}
SelectedVariableTypes: {'double', 'double', 'double' ... and 2 more}
ReadSize: 'file'
Now you have all your data in tables let's see a preview
>> data = preview(ssds)
data =
Date Quantity Latitude Longitude Measurement
____ ________ ________ _________ ___________
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
The preview is a good point to get sample data to work.
You do not need to merge you can work throught all the elements:
>> ssds.VariableNames
ans =
'Date' 'Quantity' 'Latitude' 'Longitude' 'Measurement'
>> ssds.VariableTypes
ans =
'double' 'double' 'double' 'double' 'double'
% let's get all the Latitude elements that have Date equal 1, in this case the tow files are the same, so we wil get two elements with value 1
>> reset(ssds)
accum = [];
while hasdata(ssds)
T = read(ssds);
accum(end +1) = T(T.Date == 1,:).Latitude;
end
>> accum
accum =
1 1
So you need to work with datastore and tables, is a bit tricky but very useful, you also would like to control the readsize and other variables in datastore objects. but this is a good way working with large data files in matlab
For older versions of matlab you can use a more traditional approximation:
folder='./excel_files';
filetype='*.xlsx';
f=fullfile(folder,filetype);
d=dir(f);
for k=1:numel(d);
data{k}=xlsread(fullfile(folder,d(k).name));
end
Now you have the data stored in data
folder='./excel_files';
filetype='*.xlsx';
f=fullfile(folder,filetype);
d=dir(f);
for k=1:numel(d);
data{k}=xlsread(fullfile(folder,d(k).name));
end
data
data =
[22x5 double] [22x5 double]
data{1}
ans =
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
13 13 13 13 13
14 14 14 14 14
15 15 15 15 15
16 16 16 16 16
17 17 17 17 17
18 18 18 18 18
19 19 19 19 19
20 20 20 20 20
21 21 21 21 21
22 22 22 22 22
But be carefull with a lot of large file