Why is for-loop variable skipping every second row when iterating over sqlite database? - python-3.x

I have set-up a sqlite database which records some world-trading data.
I am iterating over the database and want to extract some data, which will flow into another method of that class (works - no problem).
However, iterating over the database I have two variables "row" & "l":
For Row in Database("..."):
l = self.c.fetchone()
Strangely half of the data is in the variable "row" and the other half is in "l". It took me for ever to figure out, but now I really have no idea, why this problem happens? If I iterate over a list/db for "row" - "row" should have all the data for each iteration?
I tried accessing row through "row" and "l" from different ways, from within a new loop - rewrote the loops and restructured them, but then I have too much data and over 2000 entry points??? I used fetchmany() - and made another (outside) loop to iterate over,...
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
l = self.c.fetchone()
count+=1
print(count,">>",row)
print(count,">>",l)
I expect the data to be accessible through "row" or "l" - but not one half in one variable and the other half in the other?

You are mixing up 2 different ways of accessing the results of your query. The simplest way to do it is:
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
print(count, ">>", row)
Or alternatively:
self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
while l = self.c.fetchone():
print(count, ">>", l)

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Changes in a temporary variable are affecting the variable that feeds from

I'm designing a Mastermind game, which basically compares 2 lists and marks the similarities. When a colour is found at the right place, a flag making the correct position is added and the item found on the reference list is marked off. The reference list is feeding off an array from another function. The problem is at the mark off, as any changes done to the reference list is changing also the original array, which i don't want it to happen
tempCode = mCode #mCode is the array combination randomly generated from another function
for i in range (len(uCode)): #user input array
for j in range (len(tempCode)): #temp array
if uCode[i] == tempCode[j]: # compare individual chars
if i == j: #compare position
flagMark = "*"
tempCode.insert(j+1, "x") #problem starts here
tempCode.remove(tempCode[j])
fCode.append(flagMark)
When the insert is reached both the tempCode and mCode change which it is not intended.
The code is written in a way should the user enter a combination of the same colours, thus checking the chras(the colours are just letters) and the position, and then mark them of with "x"
As it stands, when it gets to
tempCode.insert(j+1, "x")
the arrays will change to
mCode = ["B","R","x","G","Y"]
tempCode = ["B","R","x","G","Y"]
when I would just want
mCode = ["B","R","G","Y"]
tempCode = ["B","R","x","G","Y"]
See also this answer, which is a different presentation of the same problem.
Essentially, when you do tempCode = mCode, you're not making a copy of mCode, you're actually making another reference to it. Anything you do to tempCode thereafter affects the original as well, so at any given time the condition tempCode == mCode will be true (as they're the same object).
You probably want to make a copy of mCode, which could be done in either of the following ways:
tempCode = mCode.copy()
tempCode = mCode[:]
which produces a different list with the same elements, rather than the same list

Best Way to "tag" data for fast parsing through matlab?

I collect data into an excel sheet through a labview program, the data is collected continuously at a regular interval and events are marked in the file in one of the columns with TaskA_0 representing the start of an event, and TaskA_1 representing the end. this is a snippet of the data:
Time Data 1 Data 2 Data 3 Data 4 Event Name
13:38:41.888 0.719460527 0.701654664 0.221332969 0.012234448 Task A_0
13:38:41.947 0.437707516 0.588673334 0.524042112 0.309975646 Task A_1
13:38:42.021 0.186847503 0.589175696 0.393891242 0.917737946 Task B_0
13:38:42.115 0.44490411 0.073132298 0.897701096 0.633815257 Task B_1
13:38:42.214 0.833793601 0.004524633 0.40950937 0.808966844 Task C_0
13:38:42.314 0.953997375 0.055717025 0.914080619 0.166492915 Task C_1
13:38:42.414 0.245698313 0.066643778 0.515709814 0.606289696 Task D_0
13:38:42.514 0.248038367 0.862138045 0.025489223 0.352926629 Task D_1
Currently I load this into matlab using xlsread , and then run a strfind to locate the row indices of the event markers in order to break my data up into tasks where each each task is the data in the adjacent columns between TaskA_0 and TaskA_1 (here there is no data between but normally there is, also between event names there are blank cells normally). Is this the best method for doing this? Once I have it in separate variables I then perform identical actions on each variable, usually basic statistics and some data plotting. If I want to batch process my data I have to rewrite these lines over and over to get the data broken up by task. Which even I know is wrong and horribly inefficient but I don't know how better to do this.
[Data,Text]= xlsread('C:\TestData.xlsx',2); %time column and event name column end up in text, as does the data headers, hence the +1 for the row indices
IndexTaskAStart = find(~cellfun(#isempty,strfind(Text(:,2),'TaskA_0')))+1;
IndexTaskAEnd = find(~cellfun(#isempty, strfind(Text(:,2),'TaskA_1')))+1;
TaskAData = Data([IndexTaskAStart:IndexTaskAEnd,:];
Now I can perform analysis on columns in TaskAData, and repeat the process for the remaining tasks.
Presuming you cannot change the format of the files, but do know which tasks you're searching for, you can still automate the search by creating a list of task names, just appending _0 and _1 onto the task names to search. Then do not create individual named variables but store in a cell array for easier looping:
tasknames = {'Task A', 'Task B', 'Task C'}
for n = 1:numel(tasknames)
first = find(~cellfun(#isempty,strfind(Text(:,2),[tasknames{n},'_0'])))+1;
last = find(~cellfun(#isempty, strfind(Text(:,2),[tasknames{n},'_1'])))+1;
task_data{n} = Data(first:last, :);
% whatever other analysis you require goes here
end
If there are a large number of tasknames but they follow some pattern, you might prefer to create them on the fly instead of preallocating a list in tasknames.

Extracting data from series of excel files (MATLAB)

I'll begin by saying I am really not good in programming especially in extracting data so please bear with me. I think my problem is simple, I just can't figure out how to do it.
My problem is I want to extract part of the data in a series of excel files stored in the same folder. To be specific, let's say I have 10 excel files with 1000 data in each (from A1:A1000). I want to extract the first 100 data (A1:A100) in each excel files and store it in a single variable with a 10x100 size (each row represents each file).
I would really appreciate if any of you can help me. This would make my data processing a lot faster.
EDIT: I have figured out the code but my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
here's the code i've written:
for k=1:1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,'A1:A100'))';
z(k,:)=data(1,:);
end
I'm not sure how i will edit this part data=(xlsread(file,'A1:A100'))' to do the loop i wanted to do.
my next problem is to create another loop such that it will reread again the 10 files but this time extract A101:A200 until A901:A1000.
Why? Why not extract A1:A1000 in one block and then reshape or otherwise split up the data?
data(k,:)=(xlsread(file,'A1:A1000'))';
Then the A1:A100 data is in data(k,1:100), and so on. If you do this:
data = data(reshape, [10 100 10]);
Then data(:,:,1) should be your A1:A100 values as in your original loop, and so on until data(:,:,10).
This should do it:
for sec = 1:1:10
for k=1:1:10
file=['',int2str(k),'.xlsx'];
section = ['A', num2str(1+(100*(sec-1)), ':A', mum2str(100*sec)]
data=(xlsread(file, section))';
z(k,:)=data(1,:);
end
output(sec) = z;
end
Here's a suggestion to loop through the different cells to read. Obviously, you can change how you arrange the collected data in z. I have done it as the first index representing the different cells to read (1 for 1:100, 2 for 101:200, etc...), the second index being the file number (as per your original code) and the third index the data (100 data points).
% pre-allocate data
z = zeros(10,10,100);
for kk=1:10
cells_to_read = ['A' num2str(kk*100-99) ':A' num2str(kk*100)];
for k=1:10
file=['',int2str(k),'.xlsx'];
data=(xlsread(file,cells_to_read))';
z(kk,k,:)=data(1,:);
end
end

Resources