Managing data sets in SPSS where multiple cases appear in one row - reshape

I'm working with a data set which has details on multiple people on one row. How I've dealt with this is to have variables like this:
P1Name P1Age P1Gender P1Ethnicity P2Name P2Age P2Gender... etc
This makes analysis very difficult. I have used multiple response variables which are good frequencies, but its unweildy, takes time to write out the syntax (there's a lot of 'p's) and you can't do other analysis with it.
first of all is there a way to run analyses as if all the name, age, gender and so on variables are all on the same row? (if that makes sense) To do this all I can think of doing is pasting the data into Excel and then cutting and pasting to get them all into the same columns, then pasting back to SPSS. Any other ideas?
Or is this just a matter of having two datasets, one for the case details and one for the people details?
Any advice would be greatly appreciated!

Write the data out using SAVE TRANSLATE and then read it back in removing the P's, like this:
FILE HANDLE MyFile /NAME='/Users/rick/tmp/test.csv'.
DATA LIST FREE /p1x1 p1x2 p1x3 p2x1 p2x2 p3x3.
BEGIN DATA.
1 2 3 1 2 3 1 2 3
END DATA.
LIST.
SAVE TRANSLATE
/OUTFILE=MyFile
/TYPE=CSV /ENCODING='UTF8' /REPLACE
/CELLS=VALUES.
DATA LIST FREE FILE=MyFile /x1 x2 x3.
LIST.
That should do it.

Related

How to grep csv documents to APPEND info and keep the old data intact?

I have huge_database.csv like this:
name,phone,email,check_result,favourite_fruit
sam,64654664,sam#example.com,,
sam2,64654664,sam2#example.com,,
sam3,64654664,sam3#example.com,,
[...]
===============================================
then I have 3 email lists:
good_emails.txt
bad_emails.txt
likes_banana.txt
the contents of which are:
good_emails.txt:
sam#example.com
sam3#example.com
bad_emails.txt:
sam2#example.com
likes_banana.txt:
sam#example.com
sam2#example.com
===============================================
I want to do some grep, so that at the end the output will be like this:
sam,64654664,sam#example.com,y,banana
sam2,64654664,sam2#example.com,n,banana
sam3,64654664,sam3#example.com,y,
I don't mind doing it in multiple steps manually and, perhaps, in some complex algorithm such as copy pasting to multple files. What matters to me is the reliability, and most importantly the ability to process very LARGE csv files with more than 1M lines.
What must also be noted is the lists that I will "grep" to add data to some of the columns will most of the times affect at most 20% of the total csv file rows, meaning the remaining 80% must be intact and if possible not even displace from their current order.
I would also like to note that I will be using a software called EmEditor rather than spreadsheet softwares like Excel due to the speed of it and the fact that Excel simply cannot process large csv files.
How can this be done?
Will appreciate any help.
Thanks.
Googling, trial and error, grabbing my head from frustration.
Filter all good emails with Filter. Open Advanced Filter. Next to the Add button is Add Linked File. Add the good_emails.txt file, set to the email column, and click Filter. Now only records with good emails are shown.
Select column 4 and type y. Now do the same for bad emails and change the column to n. Follow the same steps and change the last column values to the correct string.

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Fetching data from one sheet, to produce organised summary and display it in sections

I am trying generate a summary page for a list of lessons from a different sheet.
I'm currently using the formula =UNIQUE(FILTER('Lessons NEW'!$E2:$E1009,(RIGHT(LEFT('Lessons NEW'!$E2:$E1009,5),1)="1")+(LEN('Lessons NEW'!$E2:$E1009)=3))) to do so.
This is displaying my list like so, with the code column being the only really important one, as the rest could be fetched from it's result.
This works, but there are two features that I want working that I've not been able to find a way to do;
Split the output into groups. I am after a title to each group/section, and a gap between them too. As in screenshot here.
Arrange it to display in multiple columns (As in have half the results in column B, and half in column G for example.) In the process of this, I'd prefer the resulting sections (as in point 1, aren't broken, and kept together instead of being split between columns.
I'm not sure if what I'm asking is too much, or very much doable, but keen for suggestions or ideas if there is a way.
Thanks in advance!
EDIT:
I've updated the formula (above) and added a title to the source column that it's fetching from. It's now producing this.
What I want it to do, is to break it further for aesthetics and for easy separation when others are looking at it, and to bolden the title row for each section. (I think I can work out the conditional formatting for the title row...)
This is what I want it to end up looking like.
Google drive link to demo sheet: https://docs.google.com/spreadsheets/d/1yx9LWeV7RHfmlldUpdUZjaU8eVdOsUVaeFrfoBypeDs/edit?usp=sharing

Excel design consideration - infrequent change and impact for VBA/formula/add-on

nothing stuck or broken, just I am inspired after a discussion with another Excel author.
His situation:
Read from an existing Excel monster file (column FG), and hard-coded the following
Range("FF:FG").Copy
Potential issue:
data in FF:FG will be pushed to GF:GG every couple of months because newer columns will be inserted in between. (It's a pivot-like design... sorry, but end-users need this appearance, but categories are increasing, summary need to be at right end side)
He has 2 other choices (if he don't want to maintain VBA code every few months):
A: Store "FF","FG" in a Cell (fixed location!), then read the location parameter using VBA
B: Read a second dedicated CSV file (copy/paste from the monster file, consumed by another user so available readily), it only has the 2 columns required..
To me, none is obviously better than the other, just a matter of preference.
Similar but simpler Scene of mine
I produce the monstrous file by lots of Vlookup from manual data sources (inherited the design... and I refactored the design using another automation tool but there is license consideration atm).
In a column there is a formula doing something like
=if(A1="SALES PERSON SICK","void result",(if(A1="MACHINE BROKEN",C2*0.8),"")..
say 5000 rows with this formula
To reduce hard-coding I moved
"SALES PERSON SICK","MACHINE BROKEN" to a reference sheet cell A1,A2, and changed formula to:
=if(A1=Ref!$A$1,"void result",(if(A1=Ref!$A$2,C2*0.8),"")
I feel it's a good practice.
Question: Is method A or B better? Considering column position will move every ~3-6 months, still worth choosing 1 from A/B?
data in FF:FG will be pushed to GF:GG every couple of months because newer columns will be inserted in between
Then you should use named ranges in what you call "monster file" (see Define and use names in formulas) and use them in your VBA.
Eg define a name for Columns FF:FG like CopySource (use a name that describes the data in that columns) and finally you can use that in your VBA code.
Range("CopySource").Copy
Whenever the range moves because new columns are inseted before, the named range moves too, so it still points to the same data.

Extracting data from series of excel files

I have a series of Excel data with multiple cities and sensors in a single worksheet. What would be the fastest and most accurate way to extract data from a city X and a Y sensor?
In my case, I need only the data from the sensor "umidade_solo_nivel1".
I could do with pivot table, but it would take a lot of time, since there are 66 cities and a different Excel worksheet for each month of the year.
Following is an image of the worksheet that helps you understand how the data is organized.
Thank you in advance for any help you can provide.
One solution is to set up your file to include "indicators" and the =INDIRECT() function. Using this, you can set up a series of indicators (in separate rows) which include the sheet name, row, and column of data you are looking for. This can be somewhat time-intensive to set up at first, but can definitely pay off in the long run, especially if your data are periodically updated or if more data is entered. All it takes is a little creativity.
If you are not familiar with the function, you could check out this website:
http://www.contextures.com/xlFunctions05.html
As David Lee mentioned in your comments, VBA or other languages like Python or C# could also be of use. As we can not fully understand what you need specifically or what all of your data look like from your description, I would recommend learning the function I mentioned and evaluating if it would be helpful in this context before learning another language like VBA (though I highly recommend learning VBA if you anticipate spending lots of time in Excel).

Resources