I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+
Related
I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
Using greatest and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# +----+----+----+------------------+------------------+
# |col1|col2|col3| mean_value| max_diff_val|
# +----+----+----+------------------+------------------+
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# +----+----+----+------------------+------------------+
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
#udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.
how I can pair the elements of each row with respect of a group?
id title comp
1 'A' 45
1 'B' 32
1 'C' 1
2 'D' 5
2 'F' 6
out put:
I wanna pair row if they have the same 'id'
output:
id title comp
1 'A','B' 45,32
1 'B','C' 32,1
2 'D','F' 5,6
Use window function. Collect list of the immediate consecutive elements in the target columns. Remove the array brackets by converting resultant arrays into strings using array_join. Last row will have less elements. filter out where the list has less more than 0ne elements.
from pyspark.sql.functions import *
from pyspark.sql import Window
win=Window.partitionBy().orderBy(F.asc('title')).rowsBetween(0,1)
df.select("id", *[F.array_join(F.collect_list(c).over(win),',').alias(c) for c in df.drop('id').columns]).filter(length(col('comp'))>1).show()
+---+-----+-----+
| id|title| comp|
+---+-----+-----+
| 1| A,B|45,32|
| 1| B,C| 32,1|
| 1| C,D| 1,5|
| 2| D,F| 5,6|
+---+-----+-----+
I have a spark dataframe like the input column below. It has a date column "dates" and a int column "qty". I would like to create a new column "daysout" that has the difference in days between the current date value and the first consecutive date where qty=0. I've provided example input and output below. Any tips are greatly appreciated.
input df:
dates qty
2020-04-01 1
2020-04-02 0
2020-04-03 0
2020-04-04 3
2020-04-05 0
2020-04-06 7
output:
dates qty daysout
2020-04-01 1 0
2020-04-02 0 0
2020-04-03 0 1
2020-04-04 3 2
2020-04-05 0 0
2020-04-06 7 1
Here is a possible approach which compares if current row is 0 and lagged row is not 0 , then takes a sum of that window , which then acts as a window for a row number to be assigned and subtract 1 to get your desired result:
import pyspark.sql.functions as F
w = Window().partitionBy().orderBy(F.col("dates"))
w1 = F.sum(F.when((F.col("qty")==0)&(F.lag("qty").over(w)!=0),1).otherwise(0)).over(w)
w2 = Window.partitionBy(w1).orderBy('dates')
df.withColumn("daysout",F.row_number().over(w2) - 1).show()
+----------+---+-------+
| dates|qty|daysout|
+----------+---+-------+
|2020-04-01| 1| 0|
|2020-04-02| 0| 0|
|2020-04-03| 0| 1|
|2020-04-04| 3| 2|
|2020-04-05| 0| 0|
|2020-04-06| 7| 1|
+----------+---+-------+
I have data frame with different data types in it.
I would like to determine precision of float types.
I can select only float64 with this code:
df_float64 = df.loc[:, df.dtypes == np.float64]
(not sure why columns with only 'Nan' values are also selected but this is just side note)
Now to determine precision I thing abut such approach:
precision = len(cell.split(".")[1]
If cell would be a string.
And have output in form of csv with maximum precision for each column.
So having data frame like this:
| A| B| C| D|
| 0.01|0.0923| 1.0| 1.2|
| 100.1| 203.3| 1.093| 1.9|
| 0.0| 0.23| 1.03| 1.0|
I would like to to have this:
| A| B| C| D|
| 2| 4| 3| 1|
Is this possible using Pandas?
Thanks
You can use:
fillna first for remove NaNs
cast to str by astype
loop by columns by apply or list comprehension with lambda function
for each column split, get second values of list by str[1] and get len
get max values - output is Series
convert Series to one row DataFrame if necessery
a = df.fillna(0).astype(str).apply(lambda x: x.str.split('.').str[1].str.len()).max()
print (a)
A 2
B 4
C 3
D 1
dtype: int64
df = a.to_frame().T
print (df)
A B C D
0 2 4 3 1
Another solution:
df = df.fillna(0).astype(str)
a = [df[x].str.split('.').str[1].str.len().max() for x in df]
df = pd.DataFrame([a], columns=df.columns)
print (df)
A B C D
0 2 4 3 1
I think you are looking for applymap i.e
If you have a dataframe df
A B C D
0 0.01 0.0923 1.000 1.2
1 100.10 203.3000 1.093 1.9
2 0.00 0.2300 1.030 1.0
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1])).max()).T
If you have nan you can use if else i.e
ndf = pd.DataFrame(df.astype(str).applymap(lambda x: len(x.split(".")[-1]) if x != 'nan' else 0 ).max()).T
Output:
A B C D
0 2 4 3 1
The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a
If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+