How would you create a dataframe using conditions? - subset

I am trying to create a subset data frame that will include only 1 column with the following conditions:
Trailing PE < 100

In Python:
have[have["<column used to filter>"] < 100]["<column you need>"]
Assuming the column you need is Trailing_PE and the filter is Trailing_PE < 100 you get the following:
have[have["Trailing_PE "] < 100]["Trailing_PE"]
Please read more about subsetting in the Pandas doc.

Related

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

Python .iloc error trying to input values into dataframe

From my df which has a huge amount of rows I attempt to physically enter some values for some of the "NaN". My code is below:
pamap2_df["heartrate"].iloc[0:4]=100
It does the task, however, It also throws this back in my face:
C:\Users\the-e\anaconda3\lib\site-packages\pandas\core\indexing.py:1637:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
df['some_col'].iloc[:4] is so called index chaining, and its behaviour is unpredictable. I recommend reading the link in the error for details
For updating the data, it is recommended that you use a single .loc or iloc for both column/index:
col_idx = pamap2_df.columns.get_loc('heartrate')
pamap2_df.iloc[:4, col_idx] = 100
or:
idx = pamap2_df.index[:4]
pamap2_df.loc[idx, 'heartrate'] = 100
Note the error might still persist if your pamap2_df is a slice of another dataframe. For example:
pamap2_df = df[df['Age'] < some_threshold]
idx = pamap2_df.index[:4]
# this will raise a warning / failure
pamap2_df.loc[idx, 'heartrate'] = 100
# this will do
df.loc[idx, 'heartrate'] = 100

Pandas: Use DataFrameGroupBy.filter() method to select DataFrame's rows with a value greater than the mean of the respective group

I am learning Python and Pandas and I am doing some exercises to understand how things work.
My question is the following: can I use the GroupBy.filter() method to select the DataFrame's rows that have a value (in a specific column) greater than the mean of the respective group?
For this exercise, I am using the "planets" dataset included in Seaborn: 1035 rows x 6 columns (column names: "method", "number", "orbital_period", "mass", "distance", "year").
In python:
import pandas as pd
import seaborn as sns
#Load the "planets" dataset included in Seaborn
data = sns.load_dataset("planets")
#Remove rows with NaN in "orbital_period"
data = data.dropna(how = "all", subset = ["orbital_period"])
#Set display of DataFrames for seeing all the columns:
pd.set_option("display.max_columns", 15)
#Group the DataFrame "data" by "method" ()
group1 = data.groupby("method")
#I obtain a DataFrameGroupBy object (group1) composed of 10 groups.
print(group1)
#Print the composition of the DataFrameGroupBy object "group1".
for lab, datafrm in group1:
print(lab, "\n", datafrm, sep="", end="\n\n")
print()
print()
print()
#Define the filter_function that will be used by the filter method.
#I want a function that returns True whenever the "orbital_period" value for
#a row is greater than the mean of the corresponding group's mean.
#This could have been done also directly with "lambda syntax" as argument
#of filter().
def filter_funct(x):
#print(type(x))
#print(x)
return x["orbital_period"] > x["orbital_period"].mean()
dataFiltered = group1.filter(filter_funct)
print("RESULT OF THE FILTER METHOD:")
print()
print(dataFiltered)
print()
print()
Unluckily, I obtain the following error when I run the script.
TypeError: filter function returned a Series, but expected a scalar bool
It looks like x["orbital_period"] does not behave as a vector, meaning that it does not return the single values of the Series...
Weirdly enough the transform() method does not suffer from this problem. Indeed on the same dataset (prepared as above) if I run the following:
#Define the transform_function that will be used by the transform() method.
#I want this function to subtract from each value in "orbital_period" the mean
#of the corresponding group.
def transf_funct(x):
#print(type(x))
#print(x)
return x-x.mean()
print("Transform method runs:")
print()
#I directly assign the transformed values to the "orbital_period" column of the DataFrame.
data["orbital_period"] = group1["orbital_period"].transform(transf_funct)
print("RESULT OF THE TRANSFORM METHOD:")
print()
print(data)
print()
print()
print()
I obtain the expected result...
Do DataFrameGroupBy.filter() and DataFrameGroupBy.transform() have different behavior?
I know I can achieve what I want in many other ways but my question is:
Is there a way to achieve what I want making use of the DataFrameGroupBy.filter() method?
Can I use DataFrameGroupBy.filter to exclude specific rows within a group?
The answer is No. DataFrameGroupBy.filter uses a single Boolean value to characterize an entire group. The result of the filtering is to remove the entirety of a group if it is characterized as False.
DataFrameGroupBy.filter is very slow, so it's often advised to use transform to broadcast the single truth value to all rows within a group and then to subset the DataFrame1. Here is an example of removing entire groups where the mean is <= 50. The filter method is 100x slower.
import pandas as pd
import numpy as np
N = 10000
df = pd.DataFrame({'grp': np.arange(0,N,1)//10,
'value': np.arange(0,N,1)%100})
# With Filter
%timeit df.groupby('grp').filter(lambda x: x['value'].mean() > 50)
#327 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With Transform
%timeit df[df.groupby('grp')['value'].transform('mean') > 50]
#2.7 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Verify they are equivalent
(df.groupby('grp').filter(lambda x: x['value'].mean() > 50)
== df[df.groupby('grp')['value'].transform('mean') > 50]).all().all()
#True
1The gain in performance comes form the fact that transform may allow you to use a GroupBy operation which is implemented in cython, which is the case for mean. If this is not the case filter may be just as performant, if not slightly better.
Finally, because DataFrameGroupBy.transform broadcasts a result to the entire group, it is the correct tool to use when needing to exclude specific rows within a group based on an overall group characteristic.
In the above example, if you want to keep rows within a group that are above the group mean it is
df[df['value'] > df.groupby('grp')['value'].transform('mean')]
# Compare to the mean of the group the row
# each row belongs to

How to query attribute table in QGIS to return the sum of a subset

I am using Jupyter notebook to query a dataset too big to be opened with excel and my pc is too slow to perform calculations directly on QGIS.
My logic is as follows, after imported pandas:
x = df[(df.OBJECTID == 4440) & (df.Landuse == 'Grass - Urban')]
want_area = x['Area'].sum() #returning the whole dataframe sum for that field!!
summed = x['Area'].sum()
ratio = round(want_area / summed, 2)
How can I tweak the code in order to obtain the sum of the 'Area' of the above subset only, and not for the whole dataframe (800+ thousand features)?
Hope my simple question makes sense and thank you very much!

What is the most effective way to get elements of RDD in spark

I need to get values of two columns of a dataframe converted to RDD.
The first solution I have thought is that
First convert the RDD to List of Row RDD.collect()
then for each element of List, get values by using Row[i].getInt(column_index)
this solution works fine with small and medium size of data. But in large one, I got over memory.
My temporary solution is that I only create newRDD which contains only two Columns instead all columns. And then, apply my solution above, this may reduce most of needed memory.
Current implementation is like this:
Row[] rows = sparkDataFrame.collect();
for (int i = 0; i < rows.length; i++) { //about 50 million rows
int yTrue = rows[i].getInt(0);
int yPredict = rows[i].getInt(1);
}
Could you help me to improve my solution, or suggest me other solutions!
Thanks!
ps: I'm a new spark's user!
First you convert your big RDD into Dataframe and than directly you can select whatever columns you require.
// Create the DataFrame
DataFrame df = sqlContext.jsonFile("examples/src/main/resources/people.json");
// Select only the "name" column
df.select(df.col("name"), df.col("age")).show();
For more detail you can follow this link

Resources