Giving custom variable to `hue` in sns.pairplot (Seaborn) - python-3.x

I have the air quality(link here) dataset that contains missing values. I've imputed them while creating a dummy dataframe[using df.isnull()] to keep track of the missing values.
My goal is to generate a pairplot using seaborn(or otherwise - if any other simpler method exists) that gives a different color for the imputed values.
This is easily possible in matplotlib, where the parameter c of plt.plot can be assigned a list of values and the points are colored(but the problem is I can plot only against two columns and not a pairplot). A possible solution is to iteratively to create subplots against pairs of columns(which can make the code quite complicated!!)
However, in Seaborn (which already has the builtin function for pairplot) you are supposed to provide hue='column-name' which is not possible in this case as the missingness is stored in the dummy dataframe and need to retrieve the corresponding columns for color coding.
Please let me know how I can accomplish this in the simplest manner possible.

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Alternatives to interpolate three dimensional data

I have a table that shows me a chemical concentration value based on temperature, pH and
ammonia. The way the I measure these variables, the ammonia level are always one of these six values (on top of the table), so it works as a categorical variable.
I need a way to interpolate on this table, based on these 3 variables. I tried using a combination of INDEX and MATCH, but I was not able to achieve what I wanted. Then I thought of "dividing" the table in intervals to "reduce" one variable and use an IF function to select which interval to interpolate based on the third variable (I was thinking pH or Ammonia), but I can't figure out a way to change intervals dynamically like this.
Can anyone think of an alternative to accomplish what I'm trying to do? If possible I would like to avoid using VBA, but if there is no other way I have no problem using it.
Thank you for the help!
I'm attaching an example of the table below.
Assuming that PH is in Column A:
=INDEX(A:H;MATCH(6,8;A:A;0)+MATCH(25;B:B;0)-2;MATCH(2;2:2,0))
Where the -2 needs to be changed to the number of rows BEFORE the first 22 in Temp.
This also assumes that the pattern of 22;25;28 in Temp is the same for every pH

Paraview : grid interpolation and merging data

I have a paraview multiblock dataset containing blocks holding two different vtk UnstructuredGrids. I want to interpolate data from a grid to another and handle them simultaneously.
Here is what I do :
I use the Extract Block filter twice to separate the data from the two blocks (please note that the data are still of the "multiblock" type (seen in the information tab)).
Using the Resample With Dataset filter, I'm able to interpolate the data held on block 2 (coarse grid) on the grid of block 1 (finer grid).
My issue comes on step 3. :
I'd like to use the Append Attributes filter to handle simultaneously data of block 1 and data interpolated from block 2, but my problem is that this filter is not available.
If the two datasets come from two separate UnstructuredGrids (no multi-block) structures, the Append Attributes is available and I can do what I want.
To circumvent this behavior, I have to apply the Merge Blocks filter after step 1. Note that the output of this last filter is not anymore of "multiblock" type but is now of "UnstructuredGrid" type.
This is too tricky and not intuitive, could someone explain what is the rational behind it?
You do not need Append Attributes to get both data. Just check the "Pass Point Data" and "Pass Cell Data" checkbox in the Ressample With DataSet filter.
As per why Append Attributes filter is not available in your case, there can be different reasons. If you are using ParaView 5.8.0, it can tell you why.
Just hover over the grayed-out filter in Filters -> Alphabetical, the reason will be written in the status bar.

Fails to display certain columns data in Matplotlib

Given a dataframe as follows:
date,unit_value,unit_value_cumulative,daily_growth_rate
2019/1/29,1.0139,1.0139,0.22
2019/1/30,1.0057,1.0057,-0.81
2019/1/31,1.0122,1.0122,0.65
2019/2/1,1.0286,1.0286,1.62
2019/2/11,1.0446,1.0446,1.56
2019/2/12,1.0511,1.0511,0.62
2019/2/13,1.0757,1.0757,2.34
2019/2/14,1.0763,1.0763,0.06
2019/2/15,1.0554,1.0554,-1.94
2019/2/18,1.0949,1.0949,3.74
2019/2/19,1.0958,1.0958,0.08
I have used the code below to plot them, but as you can see from out image, one column doesn't display on the plot.
df.plot(x='date', y=['unit_value', 'unit_value_cumulative', 'daily_growth_rate'], kind="line")
Output:
To plot unit_value only, I use: df.plot(x='date', y=['unit_value'], kind="line")
Out:
Anyone could help to figure out why it doesn't work out when I plot three columns on same plot? Thanks.
I just reproduced your results and it actually does work fine. In your case the values of the columns "unit_value" and "unit_value_cumulative" are identical, which is why you only see the one in the front.
Besides of this problem your current data looks like you made a mistake when calculating the cumulative values.

How to display data from two columns in chartify heatmap?

Using the example from the documentation, the heatmap is built and displays the total_price in each cell. I want to add data from another column, e.g. 'fruit' to be displayed below the total_price in each cell. How do I do that?
Adding screenshot of where, ideally, the data would be displayed:
import chartify
# Generate example data
data = chartify.examples.example_data()
average_price_by_fruit_and_country = (data.groupby(
['fruit', 'country'])['total_price'].mean().reset_index())
# Plot the data
(chartify.Chart(
blank_labels=True,
x_axis_type='categorical',
y_axis_type='categorical')
.plot.heatmap(
data_frame=average_price_by_fruit_and_country,
x_column='fruit',
y_column='country',
color_column='total_price',
text_column='total_price',
text_color='white')
.axes.set_xaxis_label('Fruit')
.axes.set_yaxis_label('Country')
.set_title('Heatmap')
.set_subtitle("Plot numeric value grouped by two categorical values")
.show('png'))
Unfortunately there's not an easy solution at the moment, but I'll add an issue to make it easier to solve for this use case in the future.
You can access the Bokeh figure from ch.figure then use bokeh's text plot to achieve what you're looking for. Take a look at the source code for an example here. https://github.com/spotify/chartify/blob/master/chartify/_core/plot.py#L26

Resources