Best Way to "tag" data for fast parsing through matlab? - excel

I collect data into an excel sheet through a labview program, the data is collected continuously at a regular interval and events are marked in the file in one of the columns with TaskA_0 representing the start of an event, and TaskA_1 representing the end. this is a snippet of the data:
Time Data 1 Data 2 Data 3 Data 4 Event Name
13:38:41.888 0.719460527 0.701654664 0.221332969 0.012234448 Task A_0
13:38:41.947 0.437707516 0.588673334 0.524042112 0.309975646 Task A_1
13:38:42.021 0.186847503 0.589175696 0.393891242 0.917737946 Task B_0
13:38:42.115 0.44490411 0.073132298 0.897701096 0.633815257 Task B_1
13:38:42.214 0.833793601 0.004524633 0.40950937 0.808966844 Task C_0
13:38:42.314 0.953997375 0.055717025 0.914080619 0.166492915 Task C_1
13:38:42.414 0.245698313 0.066643778 0.515709814 0.606289696 Task D_0
13:38:42.514 0.248038367 0.862138045 0.025489223 0.352926629 Task D_1
Currently I load this into matlab using xlsread , and then run a strfind to locate the row indices of the event markers in order to break my data up into tasks where each each task is the data in the adjacent columns between TaskA_0 and TaskA_1 (here there is no data between but normally there is, also between event names there are blank cells normally). Is this the best method for doing this? Once I have it in separate variables I then perform identical actions on each variable, usually basic statistics and some data plotting. If I want to batch process my data I have to rewrite these lines over and over to get the data broken up by task. Which even I know is wrong and horribly inefficient but I don't know how better to do this.
[Data,Text]= xlsread('C:\TestData.xlsx',2); %time column and event name column end up in text, as does the data headers, hence the +1 for the row indices
IndexTaskAStart = find(~cellfun(#isempty,strfind(Text(:,2),'TaskA_0')))+1;
IndexTaskAEnd = find(~cellfun(#isempty, strfind(Text(:,2),'TaskA_1')))+1;
TaskAData = Data([IndexTaskAStart:IndexTaskAEnd,:];
Now I can perform analysis on columns in TaskAData, and repeat the process for the remaining tasks.

Presuming you cannot change the format of the files, but do know which tasks you're searching for, you can still automate the search by creating a list of task names, just appending _0 and _1 onto the task names to search. Then do not create individual named variables but store in a cell array for easier looping:
tasknames = {'Task A', 'Task B', 'Task C'}
for n = 1:numel(tasknames)
first = find(~cellfun(#isempty,strfind(Text(:,2),[tasknames{n},'_0'])))+1;
last = find(~cellfun(#isempty, strfind(Text(:,2),[tasknames{n},'_1'])))+1;
task_data{n} = Data(first:last, :);
% whatever other analysis you require goes here
end
If there are a large number of tasknames but they follow some pattern, you might prefer to create them on the fly instead of preallocating a list in tasknames.

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Why is for-loop variable skipping every second row when iterating over sqlite database?

I have set-up a sqlite database which records some world-trading data.
I am iterating over the database and want to extract some data, which will flow into another method of that class (works - no problem).
However, iterating over the database I have two variables "row" & "l":
For Row in Database("..."):
l = self.c.fetchone()
Strangely half of the data is in the variable "row" and the other half is in "l". It took me for ever to figure out, but now I really have no idea, why this problem happens? If I iterate over a list/db for "row" - "row" should have all the data for each iteration?
I tried accessing row through "row" and "l" from different ways, from within a new loop - rewrote the loops and restructured them, but then I have too much data and over 2000 entry points??? I used fetchmany() - and made another (outside) loop to iterate over,...
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
l = self.c.fetchone()
count+=1
print(count,">>",row)
print(count,">>",l)
I expect the data to be accessible through "row" or "l" - but not one half in one variable and the other half in the other?
You are mixing up 2 different ways of accessing the results of your query. The simplest way to do it is:
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
print(count, ">>", row)
Or alternatively:
self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
while l = self.c.fetchone():
print(count, ">>", l)

PowerShell on CSV file - looking for string depending on string

I need your help regarding PowerShell programming on CSV file.
I've made some searches but cannot find what I'm looking for (or perhaps I don't know the technical terms). Basically, I have an Excel workbook with large amount of data (more or less 38 columns x 350.000 rows), and there are a couple of formulas that take hours to calculate.
I was first wondering if PowerShell could speed up a bit the calculation compared to Excel. The calculations taking most of my time are in fact not that complex (at least at first glance). My data is more or less constructed like this:
Ref Title
----- --------------------------
A/001 "free_text"
A/002 "free_text A/001 free_text"
... ...
A/005 "free_text A/004 free_text"
A/006 "free_text"
B/001 "free_text"
B/002 "free_text"
C/001 "free_text"
C/002 "free_text"
...
C/050 "free_text C/047 free_text"
... ...
C/103 "free_text"
D/001 "free_text"
D/002 "free_text D/001 free_text"
... ....
Basically the data is as follows:
the Ref field contains unique values, in {letter}/{incremental value} format.
In some rows, the Title field may call up one of the Ref data. For example, in line 2, the Title calls for the A/001 Ref. In the last row, the Title calls for the D/001 Ref, etc.
There is no logic pattern defining when this ref could be called up in a title. This is random.
However, what I'm 100% sure of is the following:
The Ref called in the Title is always belonging to the same {letter} block. For example: the string 'C/047' in the Title field can only be found in the block where the Ref {letter} is C.
The Ref called in the Title will always be located 'after' (or in a lower row) than the Ref it refers to. In other words, I cannot have a line with following pattern:
Ref Title
------------ -----------------------------------------
{letter/i} {free_text {letter/j} free_text} with j<i
→ This is not possible.
→ j is always > i
I've used these characteristics in Excel to minimize my lookup arrays. But it still takes an hour to calculate everything.
I've therefore looked into PowerShell, and started to 'play' a bit with the CSV, and looping with the ForEach-Object hoping I would have quicker results. Up to now I basically ended-up looping twice on my CSV file.
$CSV1 = myfile.csv
$CSV2 = myfile.csv
$CSV1 | ForEach-Object {
# find Title
$TitSearch = $_.$Ref
$CSV2 | ForEach-Object {
if ($_.$Title -eq $TitSearch) {
myinstructions
}
}
}
It works but it's really really really long. So I then tried the following instead of using the $CSV2 | ForEach...:
$CSV | where {$_.$Title -eq $TitleSearch} | % $Ref
In either case, it's too long and not efficient at all. Additionally with these 2 solutions, I'm not using above characteristics which could reduce the lookup array and as already stated, it seems I end up looping twice on the CSV file from its beginning up to the end.
Questions:
Is there a leaner way to do this?
Am I wasting my time with PowerShell?
I though about creating 1 file per Ref {letter} block (1 file for block A, 1 for B, etc...). However I have about 50.000 blocks to create. Or create them one by one, carry out the analysis, put the results in a new file, and delete them. Would that be quicker?
Note: this is for work, to be used by other colleagues, and Excel and PowerShell are really the only softwares we may use. I know VBA but ok... At the end I'm curious about how and if this can be solved in a simple manner using PowerShell.
As far as I can see your base algorithm do N^2 iteration (~120 billion). There is a standard way to make it efficient - you need to build a hashtable first. Hashtable is a key/value storage, and look up is pretty much instantaneous, so algorithm's time complexity will become ~N.
Powershell has built-in data type for that. In your case the key would be ref, and the value an array of cell data (assuming your table is smth like: ref, title, col1, ..., colN)
$hash = #{}
foreach($row in $table} {$hash.Add($row.ref, #($row.title, $row.col1, ...)}
#it will take 350K steps to generate it
#then you can iterate over it again
foreach($key in $hash.Keys) {
$key # access current ref
$rowData = $hash.$key # access to current row elements (by index)
$refRowData = $hash[$rowData[$j]] # lookup from other rows, assuming lookup reference is in some column
}
So it's a general idea how to solve the time issue. To be honest I don't believe you need to recreate a wheel and code it yourself. What you need is a relational database. Since you have excel, you should have MS ACCESS too. Just import your data in there, make ref and title an index, then all you need to do is self join. MS Access suck, but I'm sure it will handle 350K row just fine.
Ideally you'd need to get a database on some corporate MSSQL server (open a ticket, talk to your manger, etc). It will calculate all that in seconds, and then you can link the output to a spreadsheet as well.

Extracting data from series of excel files (MATLAB)

I'll begin by saying I am really not good in programming especially in extracting data so please bear with me. I think my problem is simple, I just can't figure out how to do it.
My problem is I want to extract part of the data in a series of excel files stored in the same folder. To be specific, let's say I have 10 excel files with 1000 data in each (from A1:A1000). I want to extract the first 100 data (A1:A100) in each excel files and store it in a single variable with a 10x100 size (each row represents each file).
I would really appreciate if any of you can help me. This would make my data processing a lot faster.
EDIT: I have figured out the code but my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
here's the code i've written:
for k=1:1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,'A1:A100'))';
z(k,:)=data(1,:);
end
I'm not sure how i will edit this part data=(xlsread(file,'A1:A100'))' to do the loop i wanted to do.
my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
Why? Why not extract A1:A1000 in one block and then reshape or otherwise split up the data?
data(k,:)=(xlsread(file,'A1:A1000'))';
Then the A1:A100 data is in data(k,1:100), and so on. If you do this:
data = data(reshape, [10 100 10]);
Then data(:,:,1) should be your A1:A100 values as in your original loop, and so on until data(:,:,10).
This should do it:
for sec = 1:1:10
for k=1:1:10
file=['',int2str(k),'.xlsx'];
section = ['A', num2str(1+(100*(sec-1)), ':A', mum2str(100*sec)]
data=(xlsread(file, section))';
z(k,:)=data(1,:);
end
output(sec) = z;
end
Here's a suggestion to loop through the different cells to read. Obviously, you can change how you arrange the collected data in z. I have done it as the first index representing the different cells to read (1 for 1:100, 2 for 101:200, etc...), the second index being the file number (as per your original code) and the third index the data (100 data points).
% pre-allocate data
z = zeros(10,10,100);
for kk=1:10
cells_to_read = ['A' num2str(kk*100-99) ':A' num2str(kk*100)];
for k=1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,cells_to_read))';
z(kk,k,:)=data(1,:);
end
end

Resources