How to split a Pandas dataframe into multiple csvs according to when the value of a column changes - python-3.x

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.

First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Related

identify when values in a column change in spotfire

I am trying to create a calculated column that flags/counts the changes in values across rows in another column, in Spotfire. Below is an example of the data types I'm looking at and the desired results.
My hope is that for each Location, and ordered along Time, I can identify when the values of "colors" changes and have running count so that each cluster of similar values between changes is given the same label (Cluster Desire 1) for each Location. It would be best if the running count of clusters can restart at each location but this is not crucial. Any help would be more than appreciated!
I thought of a way to do it, relying on one intermediate column (I used two just to make it a bit clearer).
First: the concatenation of values for each row within its Location: called [concatString]
Concatenate(Concatenate([Color]) over (Intersect([Location],AllPrevious([Time]))),', ')
Spotfire defaults to comma followed by space as a separator: I could not find a way of changing that in this kind of expression.
Then within each [concatString] I remove repeated values. The complication is that the last one did not have the comma+space, and I did not manage to make the regular expression I am using understand that. So my workaround was to add a final comma+space to [concatString]. Hence the extra Concatenate(..).
The formula for the column without repetitions, [consolidatString] is:
RXReplace([concatString],"(\\w+\,\\s)\\1+","$1","g")
Then what we have achieved is an individual value for each line we want to group. We can then simply rank [consolidatString] to achieve the desired column:
DenseRank([consolidatString],[Location])

how to add columns with 'filled data' after filling missing values in pandas or python using different techniques?

How to add columns with 'filled data' after filling missing values in pandas or python, using different or several techniques like various statistical techniques or machine learning techniques.
What I want to do is that, after filling the data let's say with mean, median or standard deviation values or with other machine learning algos, like KNN or XGBoost or some other technique, then I want to add or append those or that particular column(s) at the end of the csv or excel file but not below the actual data, I mean towards the right-end side of the file.
For instance, I've filled the missing data of a particular column using statistical techniques and other ML techniques then I want to add those 'filled values' along with the original values in a new column having it's actual name with underscore and the technique with which the data is filled for that particular feature and add it at the end of the data to the right side of the data. Example, the column or feature is 'phone' then at the right-end side after filling missing values it must show the whole original or actual values plus the values calculated by statistical means or ML means with column name like "phone_Mean" or "phone_interpolation" or 'phone_KNN' or 'phone_XGBoost' like that.
What I've done so far ?
I've applied the ways from the pandas documentation page and stackoverflow as well, the ones which are generally high enlisted and are in top 7/10 links on google or duckduckgo search engines, but all went in vain.
I'm really facing this issue from last few days due to which I'm crippled at convincing my client. So, it will be great help if you can assist me with some code example using pandas or core python code to support your answer.
Here's the snippet of the dataset. Let's say I'm applying techniques on a feature/column named 'phone':
One of the way is by making use of pandas like:-
df_01["phone_mean"] = df_01["phone"].fillna().mean()

Is there a way to replace cells of a particular value with multiple cells?

Let's say I need to replace any cell that has a value of "outgoing" with multiple cells such as (0), (1), (0), (0), (2), in Excel. Is there a way to actually make this happen? I am doing this for a research project. Every item in my data needs to be coded on five different scales. There are 30-or-so items make up for almost half of the data. It would be enormously helpful to be able to simply replace the high frequency items with the five values at once.
I am not sure I completely understand the result you are looking for but here goes:
How about using the Find and Replace functionality to replace all instances of "outgoing" with "(0),(1),(0),(0),(2)" and then use the Text to Columns functionality to split the single column with "(0),(1),(0),(0),(2)" in to five separate columns, thus each value would be in its own cell.
You would need to split based on a delimiter (probably ",") and you should do all your replacing before you start splitting. Obviously you should test on some sample data first - Find and Replace is not your friend if you are not certain about your data set.

Dynamic Range with Categorical Variables

I'd like to sort a time series of exam performance by one of three categories:
Ideally, a function would sort the scores by "difficulty" while still preserving chronological order. I'd like to do this without filters etc. Something like this is very close, but not quite there. Do I need to use dynamic ranges? Or can I just define data ranges in the table dialog with VLOOKUP or INDEX/MATCH?
I'm thinking a bar graph would be the easiest way to illustrate the data, but I'm open to suggestions. New scores are added every day, with varying difficulties.
Here is the spreadsheet if anyone would like to look it over.
EDIT:
The output visualization could be, for example, a clustered bar graph, but with only one label per category. The idea is that I'd like to preserve chronological order without necessarily having to mark it on the graph.
Would there, for instance, be a quick-and easy and formula-driven way to put these 14 and 17 values for "score" all together under one label? I feel like 17 bar graphs clustered too closely would be hard to read.
I realize this is more of a formatting than a formula issue, but I appreciate input with regards to both.
I would recommend you add a Table over the data in the workbook. One for verbal and one for math. The upside is that it will automatically grow with your data as you add new rows. This is very helpful because charts and other things will automatically refer to the new data. Add one with CTRL+T or Insert->Table on the Ribbon.
Once you have the Table, you can easily do the sorting bit by adding a two column sort onto the Table. This menu is accessible by right clicking in the Table and doing Sort->Custom Sort. Again, the Table is nice here because it will only sort the data within it (not the whole sheet) and will remember your settings. This lets you add new data and simply do Data->Reapply to get it to sort again. Your sort on Difficulty is going to be alphabetic unless you add a number at the front. Here is the sorting step:
With this done, you can create a quick chart based on that data. For the "implicit chronology" you can simply plot score vs. difficulty for all of them since they are sorted.
To get closer to that matrix style display, you can easily create a PivotTable based on this Table and let it do the organizing by date/difficulty. Here is the result of that. I am using Average as the aggregation function since it appears that no dates have more than 1 score. If they did, it would be a better choice than Sum.

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

Resources