Presto: How to map arrays of different length? - presto

Say I have arrays ['1', '2', '3'] and ['a', 'b', 'c', 'd'] and I want to map them
select map(array ['1', '2', '3'], array ['a', 'b', 'c', 'd'])
This will return an error saying that the arrays need to be of the same length.
How can I replicate the python's zip() which drops the ones without a pair? Or if not, pad the missing ones with NULLs?

You can use slice and cardinality to "fix" sizes:
WITH dataset AS (
SELECT *
FROM
(
VALUES
(ARRAY [1, 2, 3], ARRAY[1, 2, 3, 4])
) AS t (arr1, arr2)
)
SELECT
map (
slice(arr1, 1, m),
slice(arr2, 1, m)
)
FROM
(
SELECT *, LEAST(cardinality(arr1), cardinality(arr2)) as m
FROM
dataset
)
Output:
_col0
{1=1, 2=2, 3=3}
Or just use zip and transform the resuting array of ROW's into map (note, this relies on default naming convention for elements of ROW, as #Martin Traverso points out in comments in Trino you can access row fields by index, so you can change corresponding line to r -> r[1] IS NOT NULL):
WITH dataset AS (
SELECT * FROM (VALUES
(ARRAY [1,2,3], ARRAY[1,2,3,4])
) AS t (arr1, arr2))
SELECT map_from_entries(filter(zip(arr1, arr2), r -> r.field0 is not NULL))
FROM dataset
Output:
_col0
{1=1, 2=2, 3=3}

Related

How to find the combination of a list to a list without rearrangement?

I am want to achieve all possible combinations (w/o rearr.) of partitioning/breaking the string by k elements.
for example,
I have a string "abcd" and k=3, I want achieve the following,
if [a,b,c,d] and k=3 then return,
[ [ [a], [b], [c,d] ]
[ [a], [b,c], [d] ]
[ [a,b], [c], [d] ] ]
for example,
I have a string "abcde" and k=3, I want achieve the following,
if [a,b,c,d,e] and k=3 then return,
[ [ [a], [b], [c,d,e] ]
[ [a], [b,c,d], [e] ]
[ [a,b,c], [d], [e] ]
[ [a], [b,c], [d,e] ]
[ [a,b], [c], [d,e] ]
[ [a,b], [c,d], [e] ] ]
Note, that all the combinations have the a-b-c-d(-e) are in straight order i.e. without rearrangement.
Let's say there's a function "breaklist" which does the work, takes the list and breaks it into k elements of the combination of the elements in the list (w/o rearr.) and finally, returns me a three-dimensional array.
def breakList(l: list, k: int):
...
return partitionedList
My Logic:
Let's say the size of the string be, "z=len(string)" and "k" be an integer by which the string is to be divided.
Now by observation, the maximum size of a element of the combination is "z-k+1", let that be n.
Now start from the first index, go till "n" and the rest by "j=1" then saves in a 3D Array.
Next iteration, will be the same by decrement of n by 1 i.e. "n-1" and the rest by "j=2" then saves to the 3D Array.
Next iteration, will decrement n by another 1 i.e. "(n-1)-1" and the rest by "j=3" then saves to the 3D Array.
"j" runs till "n", and, "n" runs to 1
This gives all the combination w/o rearrangement.
But this is not the most efficient approach I came up with, and at the same time it makes the task somewhat complex and time-consuming.
So, is there any better approach (I know there is...)? and also can I simplify the code (in terms of number of lines of codes) by using python3?
There's a better way... As mentioned in the referenced question, you just need to re-focus your thinking on the slice points. If you want 3 segments, you need 2 partitions. These partitions are all of the possible combinations of index positions between [1, end-1]. Sounds like a job for itertools.combinations!
This is only a couple lines of code, with the most complicated piece being the printout, and if you don't need to print, it gets easier.
from itertools import combinations as c
data = list('abcdefg')
k = 3
slice_point_sets = c(range(1, len(data)), k-1)
# do the slicing
for point_set in slice_point_sets:
start = 0
for end in point_set:
print(data[start:end], end=',')
start = end
print(data[end:])
# or pop it into a 3d array...
slice_point_sets = c(range(1, len(data)), k-1)
result = []
for point_set in slice_point_sets:
sublist = []
start = 0
for end in point_set:
sublist.append(data[start:end])
start = end
sublist.append(data[end:])
result.append(sublist)
Output:
['a'],['b'],['c', 'd', 'e', 'f', 'g']
['a'],['b', 'c'],['d', 'e', 'f', 'g']
['a'],['b', 'c', 'd'],['e', 'f', 'g']
['a'],['b', 'c', 'd', 'e'],['f', 'g']
['a'],['b', 'c', 'd', 'e', 'f'],['g']
['a', 'b'],['c'],['d', 'e', 'f', 'g']
['a', 'b'],['c', 'd'],['e', 'f', 'g']
['a', 'b'],['c', 'd', 'e'],['f', 'g']
['a', 'b'],['c', 'd', 'e', 'f'],['g']
['a', 'b', 'c'],['d'],['e', 'f', 'g']
['a', 'b', 'c'],['d', 'e'],['f', 'g']
['a', 'b', 'c'],['d', 'e', 'f'],['g']
['a', 'b', 'c', 'd'],['e'],['f', 'g']
['a', 'b', 'c', 'd'],['e', 'f'],['g']
['a', 'b', 'c', 'd', 'e'],['f'],['g']
I think this could work:
Get all possible sum partitions of the len(your_list).
Filter them by len(partition) == k.
Get the itertools.permutations by k elements of all this partitions.
Now you have list of lists like this: [(1, 1, 2), (1, 2, 1), (1, 1, 2), (1, 2, 1), (2, 1, 1), (2, 1, 1)]
Clean up this list of duplicates (make the set):
{(1, 2, 1), (2, 1, 1), (1, 1, 2)}
For each permutation pick exact number elements from your_list:
(1, 1, 2) -> [[a], [b], [c, d]],
(1, 2, 1) -> [[a], [b, c], [d]],
and so on..
Code
import itertools as it
start_list = ['a', 'b', 'c', 'd', 'e']
k = 3
def partitions(n, k=None): # step 1 function
if k is None:
k = n
if n == 0:
return []
return ([[n]] if n<=k else []) + [
l + [i]
for i in range(1, 1+min(n, k))
for l in partitions(n-i, i)]
final_list = []
partitions = filter(lambda x: len(x)==k, partitions(len(start_list))) # step 1-2
for partition in partitions:
pickings = set(it.permutations(partition, k)) # step 3-5
for picking in pickings:
temp = []
i = 0
for e in picking: # step 6
temp.append(start_list[i:i+e])
i += e
final_list.append(temp)
print(*final_list, sep='\n')

Return map values sorted by keys

Here is a simple example
from pyspark.sql.functions import map_values
df = spark.sql("SELECT map('a', 1, 'c', 2, 'b', 3) as data")
df.show(20, False)
df.select(map_values("data").alias("values")).show()
What I want is the following (in the order of the keys: 'a', 'b', 'c')
How to achieve this? In addition - does the result from map_values function always maintain the order in the df.show() above, i.e., [1, 2, 3]?
An option using map_keys
from pyspark.sql import functions as F
df = spark.sql("SELECT map('a', 1, 'c', 2, 'b', 3) as data")
df = df.select(
F.transform(F.array_sort(F.map_keys("data")), lambda x: F.col("data")[x]).alias("values")
)
df.show()
# +---------+
# | values|
# +---------+
# |[1, 3, 2]|
# +---------+
The map's contract is that it delivers value for a certain key, and the entries ordering is not preserved. Keeping the order is provided by arrays.
What you can do is turn your map into an array with map_entries function, then sort the entries using array_sort and then use transform to get the values. A little convoluted, but works.
with data as (SELECT map('a', 1, 'c', 2, 'b', 3) as m)
select
transform(
array_sort(
map_entries(m),
(left, right) -> case when left.key < right.key then -1 when left.key > right.key then 1 else 0 end
),
e -> e.value
)
from data;

Python print/display only if the sum is not zero

I have a dataframe below:
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
I am using groupby and sum.
final = df.groupby(['Product'])['Volume'].sum().reset_index()
print(final)
This is ok.
But I only want the print to be carry only those where sum != 0. Like Product C and D
Any idea how can I do that?
I try to use:
if final != 0:
print (final)
But this is throwing error and usually when I get this error, the syntax is definitely wrong...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Your data frame has Volume as strings, is that intended? if you want to sum it like numbers you have to convert it to numbers then you can apply the filter.
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
# convert from string to integers
df.Volume = df.Volume.map(lambda x: int(x))
final = df.groupby(['Product'])['Volume'].sum().reset_index()
#choose ones with sum none zero
print(final[final.Volume != 0])
it will print only the C & D
Given,
import pandas as pd
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': [-3, 3, 1, 5]})
final = df.groupby(['Product'])['Volume'].sum().reset_index()
Use selection to only select rows that match your criteria. df[some_series_of_booleans_based_on_condition]
print(final[final['Volume'] != 0])
#output:
Product Volume
1 C 1
2 D 5
The idea being that if [some series of booleans]: doesn't make sense for python to interpret, and thus it complains about the syntax with the message that you saw.

How to retrieve lists of keys and values of a dictionary through a list comprehension?

This is a MWE that shows what I want to obtain but using a for loop:
a = {'a':1, 'b':2, 'c':3, 'd':4}
b = []
c = []
for key, value in a.items():
b.append(key)
c.append(value)
print(b) # ['a', 'b', 'c', 'd']
print(c) # [1, 2, 3, 4]
I want to obtain the same result in one line using list comprehension.
b,c = [(key, value) for key, value in a.items()] results in an unpack error because it assign to b and c, respectively, the first and second item of a and then it doesn't know where unpack the other items. b,c = [key, value for key, value in a.items()] results again in an error, a syntax one.
b, c = map(list, zip(*a.items()))
print(b)
print(c)
This outputs:
['a', 'b', 'c', 'd']
[1, 2, 3, 4]

Calling a list of DataFrame index with index key value

df = pd.DataFrame([[3,3,3]]*4,index=['a','b','c','d'])
While we can extract a copy of a section of an index via specifying row numbers like below:
i1=df.index[1:3].copy()
Unfortunately we can't extract a copy of a section of an index via specifying the key (like the case of df.loc method). When I try the below:
i2=df.index['a':'c'].copy()
I get the below error:
TypeError: slice indices must be integers or None or have an __index__ method
Is there any alternative to call a subset of an index based on its keys? Thank you
Simpliest is loc with index:
i1 = df.loc['b':'c'].index
print (i1)
Index(['b', 'c'], dtype='object')
Or is possible use get_loc for positions:
i1 = df.index
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c'], dtype='object')
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Alternative:
i1 = i1[i1.searchsorted('b') : i1.searchsorted('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Try using .loc, see this documentation:
i2 = df.loc['a':'c'].index
print(i2)
Output:
Index(['a', 'b', 'c'], dtype='object')
or
df.loc['a':'c'].index.tolist()
Output:
['a', 'b', 'c']

Resources