PySpark: how to aggregate over column arrays with variable width?

PySpark: how to aggregate over column arrays with variable width? - python-3.x

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?

With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

Related

Can Pandas DataFrame to_dict and from_dict lose column order?

When I call to_dict it returns a normal dictionary. However normal dictionaries do not preserve order. The key for the dictionary is the column. Therefore, if had called to_dict on a dataframe and later call from_dict to reconstruct the dataframe, would that not suggest that I could potentially lose column order?

In python 3, dictionaries preserve the order in which keys are inserted, so your assertion isn't true:
In [7]: pd.DataFrame.from_dict(pd.DataFrame({'c': [5], 'a': [2], 'b': [1]}).to_dict())
Out[7]:
c a b
0 5 2 1
Additionally, the pandas.DataFrame.to_dict docs provide a number of additional options for data structures such as OrderedDict:
>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.
Here, the original data edited:

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard
to predict whether it will return a view or a copy (it depends on the
memory layout of the array, about which pandas makes no guarantees),
and therefore whether the __setitem__ will modify dfmi or a temporary
object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.

python spark: narrowing down most relevant features using PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
At this time I have run PCA with 2 components and I can look at its values as:
m = model.pc.values.reshape(3, 2)
which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:
| PC1 | PC2 |
---------|-----|-----|
col1 | | |
---------|-----|-----|
col2 | | |
---------|-----|-----|
col3 | | |
---------+-----+-----+
Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?
Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?

are the (...) rows here in the same order in which I had specified my input columns
Yes, they are. Let's trace what is going on:
from pyspark.ml.feature import PCA, VectorAssembler
data = [
(0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0),
(4.0, 0.0, 0.0, 6.0, 7.0)
]
df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler follows the order of columns:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")
vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
# {'idx': 1, 'name': 'v'},
# {'idx': 2, 'name': 'x'},
# {'idx': 3, 'name': 'y'},
# {'idx': 4, 'name': 'z'}]},
# 'num_attrs': 5}}
So are principal components
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)
?model.pc
# Type: property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:
# Returns a principal components Matrix.
# Each column is one principal component.
#
# .. versionadded:: 2.0.0
Finally sanity check:
import numpy as np
x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())
np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16

You can see the actual order of the columns here
df.schema["features"].metadata["ml_attr"]["attrs"]
there will be two classes usually, ["binary] & ["numeric"]
pd.DataFrame(df.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+df.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Should give the exact order of all the columns.
You can verify, the order of input & output remains same.

Read values from text file into 2D numpy array using index values from the text file

I need to read a text file that contains comma-delimited values into a 2D numpy array. The first 2 values on each line contain the index values for the numpy array and the third values contains the value to be stored in the array. As a catch, the index values are 1-based and need to be converted to the 0-based index values used by numpy. I've reviewed documentation and examples using genfromtxt and loadtxt but it's still not clear to me how to go about it. I've also tried the following code with no success:
a = np.arange(6).reshape(2,3)
for line in infile:
fields = line.split() #split fields inti list
rindex = int(fields[0]) - 1
cindex = int(fields[1]) - 1
a[rindex,cindex] = float(fields[2])
Here is an example of the input file:
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
And here is my desired output array. Ideally I'd like it to work on any array size without having to predefine the size of the array.
10.1 11.2 12.3
13.4 14.5 15.6

Here's one way you can do it. numpy.genfromtxt() is used to read the data into a structured array with three fields. The row and column indices are pulled out of the structured array and used to figure out the shape of the desired array, and to assign the values to the new array using numpy's "fancy" indexing:
In [46]: !cat test_data.csv
1,1,10.1
1,2,11.2
1,3,12.3
2,3,13.4
2,2,14.5
2,3,15.6
In [47]: data = np.genfromtxt('test_data.csv', dtype=None, delimiter=',', names=['i', 'j', 'value'])
In [48]: data
Out[48]:
array([(1, 1, 10.1), (1, 2, 11.2), (1, 3, 12.3), (2, 3, 13.4),
(2, 2, 14.5), (2, 3, 15.6)],
dtype=[('i', '<i8'), ('j', '<i8'), ('value', '<f8')])
In [49]: rows = data['i']
In [50]: cols = data['j']
In [51]: nrows = rows.max()
In [52]: ncols = cols.max()
In [53]: a = np.zeros((nrows, ncols))
In [54]: a[rows-1, cols-1] = data['value']
In [55]: a
Out[55]:
array([[ 10.1, 11.2, 12.3],
[ 0. , 14.5, 15.6]])

Insert field into structured array at a specific column index

I'm currently using np.loadtxt to load some mixed data into a structured numpy array. I do some calculations on a few of the columns to output later. For compatibility reasons I need to maintain a specific output format so I'd like to insert those columns at specific points and use np.savetxt to export the array in one shot.
A simple setup:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
newcol = ['abc','def']
For this example I'd like to make newcol the 2nd column. I'm very new to Python (coming from MATLAB). From my searching all I've been able to find so far are ways to append newcol to the end of x to make it the last column, or x to newcol to make it the first column. I also turned up np.insert but it doesn't seem to work on a structured array because it's technically a 1D array (from my understanding).
What's the most efficient way to accomplish this?
EDIT1:
I investigated np.savetxt a little further and it seems like it can't be used with a structured array, so I'm assuming I would need to loop through and write each row with f.write. I could specify each column explicitly (by field name) with that approach and not have to worry about the order in my structured array, but that doesn't seem like a very generic solution.
For the above example my desired output would be:
1, abc, 2.0, Hello
2, def, 3.0, World

This is a way to add a field to the array, at the position you require:
from numpy import zeros, empty
def insert_dtype(x, position, new_dtype, new_column):
if x.dtype.fields is None:
raise ValueError, "`x' must be a structured numpy array"
new_desc = x.dtype.descr
new_desc.insert(position, new_dtype)
y = empty(x.shape, dtype=new_desc)
for name in x.dtype.names:
y[name] = x[name]
y[new_dtype[0]] = new_column
return y
x = zeros((2,), dtype='i4,f4,a10')
x[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
new_dt = ('my_alphabet', '|S3')
new_col = ['abc', 'def']
x = insert_dtype(x, 1, new_dt, new_col)
Now x looks like
array([(1, 'abc', 2.0, 'Hello'), (2, 'def', 3.0, 'World')],
dtype=[('f0', '<i4'), ('my_alphabet', 'S3'), ('f1', '<f4'), ('f2', 'S10')])
The solution is adapted from here.
To print the recarray to file, you could use something like:
from matplotlib.mlab import rec2csv
rec2csv(x,'foo.txt')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark: how to aggregate over column arrays with variable width? - python-3.x

Related

Can Pandas DataFrame to_dict and from_dict lose column order?

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

python spark: narrowing down most relevant features using PCA

Read values from text file into 2D numpy array using index values from the text file

Insert field into structured array at a specific column index

Categories

Resources