Can Pandas DataFrame to_dict and from_dict lose column order? - python-3.x

When I call to_dict it returns a normal dictionary. However normal dictionaries do not preserve order. The key for the dictionary is the column. Therefore, if had called to_dict on a dataframe and later call from_dict to reconstruct the dataframe, would that not suggest that I could potentially lose column order?

In python 3, dictionaries preserve the order in which keys are inserted, so your assertion isn't true:
In [7]: pd.DataFrame.from_dict(pd.DataFrame({'c': [5], 'a': [2], 'b': [1]}).to_dict())
Out[7]:
c a b
0 5 2 1
Additionally, the pandas.DataFrame.to_dict docs provide a number of additional options for data structures such as OrderedDict:
>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

Related

Stacking up dataframes in a 3-dimenional numpy array

I have several pandas dataframe that I would like to stack them up using numpy as a three-dimensional numpy array. I could manually do the job using the following code:
arr = np.array([df1.values, df2.values], dtype="object")
However, since I have many dataframes, I can neither write this line for all the dataframes nor automate it.
I tried to use append function (np.append(df1.values, df2['1002'].values)) but it flattens dataframes and ignores their structure. What I want is a three-dimensional numpy array where the first dimension is the number of dataframes (that I have), the second one is the number of rows in each dataframe, and the third one is the number of columns. In the first example that I mentioned earlier, I get a three-dimensional numpy array. In fact when I run arr.shape the result is (2,) and when I run arr[0].shape and arr[1].shape, I get (26, 7) and (24, 7), respectively which are the structure of their corresponding dataframe.
I even ran np.append(df1.values, df2['1002'].values, axis=0) but I received the error of ValueError: all the input array dimensions for the concatenation axis must match exactly. Is there any way that I can fix this problem and stack up all my dataframes in a 3-dimensional numpy array?
Looks like you start with 2 frames with 7 columns, but different numbers of rows. The equivalent of:
In [1]: arr1 = np.ones((26,7)); arr2 = np.zeros((24,7))
...:
In [2]: arr = np.array([arr1, arr2], object)
In [3]: arr.shape
Out[3]: (2,)
In [4]: arr[0].shape
Out[4]: (26, 7)
You probably tried this without the object and got a 'ragged array' warning. In any case, this is not a 3d array. It is 1d (2,), with two arrays. It's roughly the same as the list
[arr1, arr2]
The np.append docs should make it clear that it flattens the arguments, when you don't specify an axis.
In [6]: np.append(arr1,arr2).shape
Out[6]: (350,)
You could specify an axis, and get a 2d array, where the 50 is the sum of 26 and 24.
In [7]: np.append(arr1,arr2,axis=0).shape
Out[7]: (50, 7)
This is the same as:
In [8]: np.concatenate((arr1,arr2), axis=0).shape
Out[8]: (50, 7)
np.append is poorly name cover for np.concatenate. It is not a list append clone. Learn to use concatenate and its stack derivatives. In
With different dataframe shapes, you cannot make a 3d array. Arrays cannot be 'ragged'.
As for working with more than 2 dataframes, if you can make a list of all the frames, you can use the initial syntax.
alist = []
for a in frame_list:
alist.append(a.values)
arr = np.array(alist, object)
But make such array doesn't do much for you.
If the frames are all the same size, then you can make a 3d array
In [10]: np.array([arr1[:10,:],arr2[:10,:]]).shape
Out[10]: (2, 10, 7)
In [11]: np.stack([arr1[:10,:],arr2[:10,:]]).shape
Out[11]: (2, 10, 7)
But if they differ, stack will complain about that:
In [12]: np.stack([arr1, arr2])
Traceback (most recent call last):
File "<ipython-input-12-23d05d0422dc>", line 1, in <module>
np.stack([arr1, arr2])
File "<__array_function__ internals>", line 180, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.
Here, the original data edited:
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard
to predict whether it will return a view or a copy (it depends on the
memory layout of the array, about which pandas makes no guarantees),
and therefore whether the __setitem__ will modify dfmi or a temporary
object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

How to filter this type of data?

If I have some numpy arrays like
a = np.array([1,2,3,4,5])
b = np.array([4,5,7,8])
c = np.array([4,5])
I need to combine these arrays without repeating a number. My expected output is [1,2,3,4,5,7,8].
How do I combine them? Which function should I use?
One more approach you can give a try is using reduce from functools and union1d from numpy.
For eg -
from functools import reduce
reduce(np.union1d, (a, b, c))
Output -
array([1,2,3,4,5,7,8])
You can use numpy.concatenate with numpy.unique:
d = np.unique(np.concatenate((a,b,c)))
print(d)
Output:
[1 2 3 4 5 7 8]
Python has a datatype called set:
A set is an unordered collection with no duplicate elements
The easiest way to create a set out of your array would be unpacking your arrays into the set:
>>> import numpy as np
>>> a=np.array([1,2,3,4,5])
>>> b=np.array([4,5,7,8])
>>> c=np.array([4,5])
>>> {*a, *b, *c}
{1, 2, 3, 4, 5, 7, 8}
Please note, that the set is unordered. This is not the right answer for you, if the order of the elements in your array is important.

read the scipy.beta distribution parameters from a scipy.stats._continuous_distns.beta_gen object

Having an instance of the beta object, how do I get back the parameters a and b?
There are properties a and b, but it seems they mean something else as I expected:
>>> import scipy
>>> scipy.__version__
'0.19.1'
>>> from scipy import stats
>>> my_beta = stats.beta(a=1, b=5)
>>> my_beta.a, my_beta.b
(0.0, 1.0)
Is there a way to get the parameters of the distribution? I could always fit a huge rvs sample but that seems silly :)
When you create a "frozen" distribution with a call such as my_beta = stats.beta(a=1, b=5), the positional and keyword arguments are saved as the attributes args and kwds, respectively, on the returned object. So in your case, you can access those values in the dictionary my_beta.kwds:
In [10]: from scipy import stats
In [11]: my_beta = stats.beta(a=1, b=5)
In [12]: my_beta.kwds
Out[12]: {'a': 1, 'b': 5}
The attributes my_beta.a and my_beta.b are, as you guessed, something different. They define the end points of the support of the probability distribution:
In [13]: my_beta.a
Out[13]: 0.0
In [14]: my_beta.b
Out[14]: 1.0

Resources