GNU Octave/Matlab Matrices Manipulation - struct

I'm quite new to GNU Octave, so can anyone help me for 2 things:
(1) How can I filter that huge dataset in such a way that it will only contain [1x1 struct] persons?
(2) Inside that value of struct, I only want to retain combined_categories. How can I delete the others?
Basically, my end goal is to have a dataset with 2 columns only (filename and combined_categories of the filtered 1x1 structs). And if I can convert that to csv, that would be more awesome.

Regarding your first question, how to filter a struct. First step is to create a vector which decides which ones to keep and which ones to delete:
%Get the data for the relevant field
persons={test.person}
%For each field, check if the size is 1
one_person=cellfun(#numel,persons)==1
%Select those you want
test=test(one_person)
About your second question, please check the documentation for rmfield

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Replacing numeric values in Excel sheet with text values from other sheet

I am using Surveymonkey for a questionnaire. Most of my data has a regular scale from 0-6, and additionally an "Other" option that people can use in case they choose to not answer the item. However, when I download the data, Surveymonkey automatically assigns a value of 0 to that not-answer category, and it appears this cant be changed.
This leads to me not knowing when a zero in my numeric dataset actually means zero or just participants choosing to not answer the question. I can only figure that out by looking at another file that includes the labels of participants answers (all answers are provided by the corresponding labels, so this datafile misses all non-labeled answers...).
This leads me to my problem: I have two excel files of same size. I would need to find a way to find certain values in one dataset (text value, scattered randomly over dataset), and replace the corresponding numeric values in the other dataset (at the same position in the dataset) with those values.
I thought it would just be possible to find all values and copy paste in the same pattern, but I cannot seem to find a way to do that. I feel like I am missing an obvious solution, but after searching for quite a while I really could not find an answer to my specific question.
I have never worked with macros or more advanced excel programming before, but have a bit of knowledge about programming in itself. I hope I explained this well, I would be very thankful for any suggestions or scripts that could help me out here!
Thank you!
Alex
I don't know how your Excel file is organised, but if it's like the legacy Condensed format, all you should need to do is to select the column corresponding to a given question (if that's what you have), and search and replace all 0 (match entire cell) with the text you want.

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

SAS: Match single word within string values of a single variable then replace entire string value with a blank

I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;

Search String in Cell Efficient Way

It's my first post here, so please bear with me :-).
Problem Background:
I've multiple text files of the form:
<ticker>,<date>,<open>,<high>,<low>,<close>,<vol>
A,20120904 0926,37.14,37.14,37.14,37.14,693
.
.
.
ZZ,20120904 1602,1.6,1.6,1.6,1.6,11771
As you might have guessed it's stock ticks. When I load it to matlab, it creates a structure with an array (of the numerical values) and a cell (for the strings) which is fine at this point as I can work with it.
Problem:
I'd like to find the most efficient way to search the array for a specific symbol (~70K lines). While it's easy to do a naive or halving searches, I don't think these approaches are very useful for multiple files and/or multiple searches to extract the beginning and end indices of a given symbol/string.
I've looked into past posts here and read about Rabin-Karp, Bitap and hash tables, but I'm not sure any of them fully answers my needs.
So far, I've leaning towards running through the cell once and creating a hash table for each letter (i.e. 'A', 'B', etc) and then running a naive search or anything else you might suggest :-). The reason for hashing is that I might use the same file to look up different stock symbols, so I think running through it once and labeling letters will reduce the complexity in the long run.
What are your thoughts on the matter? Am I in the right direction?
I'm using matlab btw.
Thank you
You can store all your tickers in a struct array. Each column being a property. Assuming you have non-empty values, you can do the following,
tickers = [S.tickers];
dates = [S.date];
You can easily do queries to get the index you want from your struct array S. You can go further and index tickers by ticker name, by creating an index with ticker name as keys.

Resources