I'm a new R user, and work requires that I use R on linux. I am running into a very strange problem, and hope some of you expert users can provide a solution. :)
I have a large dataset with >200,000 observations/participants and >300 variables, that involves subsetting from various baseline datasets to create the working dataset.
My issue is that an essential variable changes some times when I run the length command.
"Withdrawlevel" is the variable that changes. This is how this variable should be:
describe(tbl$Withdrawlevel)
tbl$Withdrawlevel
n missing unique Mean
2833 218988 3 1.474
I then run several length commands like the following because I'm interested in getting the number of participants that meet certain criteria.
For example:
length(which(tbl[,'Reg_age_dob']>=18 & as.Date(tbl[,'QuestionnaireEndDate'])>='2013-07-21' & as.Date(tbl[,'QuestionnaireEndDate'])< '2013-07-28' & (is.na(tbl$Withdrawlevel) | (tbl$Withdrawlevel!=3) & (tbl$WithdrawDate<'2013-07-28')) | ((tbl$Withdrawlevel=3) & (tbl$WithdrawDate>='2013-07-28')) ))
And, then Withdrawlevel variable changes:
describe(tbl$Withdrawlevel) tbl$Withdrawlevel
n missing unique Mean
221821 0 1 3
Is the length command described above doing something to this variable, because my understanding is that it shouldn't. And, I have run many similar commands with this data, and this issue doesn't occur after each one.
Any insight into what is going on and how I can resolve this issue?
tbl$Withdrawlevel=3 assigns the value 3 to all observations of tbl$Withdrawlevel. You meant tbl$Withdrawlevel==3.
(Joshua's answer is correct.) In the future you can protect yourself against this sort of error by using with:
with( tbl, length( which(Reg_age_dob >=18 &
as.Date(QuestionnaireEndDate) >='2013-07-21' &
as.Date(QuestionnaireEndDate) < '2013-07-28' &
( is.na(tbl$Withdrawlevel) | (Withdrawlevel!=3) & ( WithdrawDate <'2013-07-28') ) |
( (tbl$Withdrawlevel=3) & ( WithdrawDate >='2013-07-28') ) )
)
)
The point is that this does not have the danger of corrupting your data object and it's also much easier to understand.
You should be using booleans for all your expressions in your which function. Make sure to use == instead of = which returns a value of True or False rather than setting the variable to equal the value.
Related
When I implement code for one paricular value of state name (see Last_ residence in code)
andhrapradesh.query('Duration_of_residence=="All durations of residence" & Last_residence_R_or_U=="Urban" & Last_residence=="Jammu & Kashmir"',inplace=True)
print(andhrapradesh['Total_migrants'].sum())
It gives desired sum of outflow value for that state from pandas csv.
But when I tried to calculate for all possible names of state it's giving me error "UndefinedVariableError: name 'Jammu & Kashmir' is not defined"
states = ["Jammu & Kashmir","Punjab",'Himachal Pradesh']
for name in states:
andhrapradesh.query(f'Duration_of_residence=="All durations of residence" & Last_residence_R_or_U=="Urban" & Last_residence=={name}',inplace=True)
print(andhrapradesh['Total_migrants'].sum())
can you please figure out why it showing error and how can I do it for all values in list states.
You need quotes around name like you had in your manual code:
df.query(f"... Last_residence == \"{name}\"")
so that it is seen as, e.g., "Punjab" as a string to check against but not bare Punjab which is sought specially, e.g., as a column name, hence the error.
(can use single quotes too...)
However, there's a better way: you can prefix a variable with # symbol, so you don't even need the f-string:
df.query("... Last_residence == #name")
I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)
I need to create a string of the next type:
"<a=1 b=123 c=15 d=19 e=12345>" (rough example)
BUT if any of these variable doesn't exist, it shouldn't be printed at all. Hard to explain, but here is an example.
Desired output: <a=1 c=15 e=12345>
My current output: <a=1 b= c=15 d= e=12345>
I can do this by many case conditions, but is there a more elegant way to do this, ideally in one statement. May be something like (just what I want to find, it's not my expectation code)):
print "<[if a exists]a=" & a & ", [if b exists] b=" & b ...>"
Thanks!
Sounds like the easiest way to do this is using the iif() function. This is more or less what you ask for with [if a exists]. Not sure what condition to test because the information you have given does not make that clear. But I think you should be able to figure it out.
I am currently attempting to parse data that is sent from an outside source serially. An example is as such:
DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
This data can come in many different lengths, but the first few pieces are all the same. Each "piece" originally comes in with CRLF after, so I've replaced them with string.gsub(input,"\r\n","|") so that is why my input looks the way it does.
The part I would like to parse is:
4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_
The "4" tells me that there will be four lines total to create this file. I'm using this as a means to set the amount of passes in the loop.
The 7x5 is the font height.
The 1 is the xpos.
The 25 is the ypos.
The variable data (172-24 in this case) is the text at these parameters.
As you can see, it should continue to loop this pattern throughout the input string received. Now the "4" can actually be any variable > 0; with each number equaling a set of four variables to capture.
Here is what I have so far. Please excuse the loop variable, start variable, and print commands. I'm using Linux to run this function to try to troubleshoot.
function loop_input(input)
var = tonumber(string.match(val, "DATA|0|(%d*).*"))
loop = string.match(val, "DATA|0|")
start = string.match(val, loop.."(%d*)|.*")
for obj = 1, var do
for i = 1, 4 do
if i == 1 then
i = "font" -- want the first group to be set to font
elseif i == 2 then
i = "xpos" -- want the second group to be set to xpos
elseif i == 3 then
i = "ypos" -- want the third group to be set to ypos
else
i = "txt" -- want the fourth group to be set to text
end
obj = font..xpos..ypos..txt
--print (i)
end
objects = objects..obj -- concatenate newly created obj variables with each pass
end
end
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
print(loop_input(val))
Ideally, I want to create a loop that, depending on the var variable, will plug in the captured variables between the pipe deliminators and then I can use them freely as I wish. When trying to troubleshoot with parenthesis around my four variables (like I have above), I receive the full list of four variables four times in a row. Now I'm having difficulty actually cycling through the input string and actually grabbing them out as the loop moves down the data string. I was thinking that using the pipes as a means to delineate variables from one another would help. Am I wrong? If it doesn't matter and I can keep the [/r/n]+ instead of each "|" then I am definitely all for that.
I've searched around and found some threads that I thought would help but I'm not sure if tables or splitting the inputs would be advisable. Like these threads:
Setting a variable in a for loop (with temporary variable) Lua
How do I make a dynamic variable name in Lua?
Most efficient way to parse a file in Lua
I'm fairly new to programming and trying to teach myself. So please excuse my beginner thread. I have both the "Lua Reference Manual" and "Programming in Lua" books in paperback which is how I've tried to mock my function(s) off of. But I'm having a problem making the connection.
I thank you all for any input or guidance you can offer!
Cheers.
Try this:
val = "DATA|0|4|7x5|1|25|174-24|7x5|1|17|TERW|7x5|1|9|08MN|7x5|1|1|_"
val = val .. "|"
data = val:match("DATA|0|%d+|(.*)$")
for fh,xpos,ypos,text in data:gmatch("(.-)|(.-)|(.-)|(.-)|") do
print(fh,xpos,ypos,text)
end
I have a variable that is created by a loop. The variable is large enough and in a complicated enough form that I want to save the variable each time it comes out of the loop with a different name.
PM25 is my variable. But I want to save it as PM25_year in which the year changes based on `str = fname(13:end)'
PM25 = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]); % Reshape and permute to achieve the right shape. Each face of the 3D should be one day
str = fname(13:end); % The year
% Third dimension is organized so that the data for each site is on a face
save('PM25_str', 'PM25_Daily_US.mat', '-append')
The str would be a year, like 2008. So the variable saved would be PM25_2008, then PM25_2009, etc. as it is created.
Defining new variables based on data isn't considered best practice, but you can store your data more efficiently using a cell array. You can store even a large, complicated variable like your PM25 variable within a single cell. Here's how you could go about doing it:
Place your PM25 data for each year into the cell array C using your loop:
for i = 1:numberOfYears
C{i} = PM25;
end
Resulting in something like this:
C = { PM25_2005, PM25_2006, PM25_2007 };
Now let's say you want to obtain your variable for the year 2006. This is easy (assuming you aren't skipping years). The first year of your data will correspond to position 1, the second year to position 2, etc. So to find the index of the year you want:
minYear = 2005;
yearDesired = 2006;
index = yearDesired - minYear + 1;
PM25_2006 = C{index};
You can do this using eval, but note that it's often not considered good practice. eval may be a security risk, as it allows user input to be executed as code. A better way to do this may be to use a cell array or an array of objects.
That said, I think this will do what you want:
for year = 2008:2014
eval(sprintf('PM25_%d = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]);',year));
save('PM25_Daily_US.mat',sprintf('PM25_%d',year),'-append');
end
I do not recommend to set variables like this since there is no way to track these variables and completely prevents all kind of error checking that MATLAB does beforehand. This kind of code is handled completely in runtime.
Anyway in case you have a really good reason for doing this I recommend that you use the function assignin for this.
assignin('caller', ['myvar',num2str(1)], 63);