Question: Draw the diagram for a DFA that accepts all strings over the alphabet {t,u,v,w} such that all occurrences of symbols in {t,u} happen before any occurrences of the symbols in {v,w}, and the number of v’s and w’s are each even.
Hint: First consider the case of detecting {v,w} in various ways, making sure they're each even. Then add in {t,u}, making sure that any occurrences come before the {v,w}. Finally, remember that the empty string must be considered.
+-----+
| q0 |
+-----+
|
|t,u
v
+-----+
| q1 |
+-----+
|
|t,u
v
+-----+
| q2 |
+-----+
|
|t,u
v
+-----+
| q3 |
+-----+
|
|v
v
+-----+
| q4 |
+-----+
|
|v
v
+-----+
| q5 |
+-----+
|
|w
v
+-----+
| q6 |
+-----+
|
|w
v
+-----+
| q7 |
+-----+
Explanation:
The transitions from q0 to q3 are all labeled with t and u to handle all occurrences of the symbols in {t,u} before any occurrences of the symbols in {v,w}. The transitions from q3 to q7 are all labeled with v and w to handle the even number of v and w occurrences. The final state q7 is the accepting state, and all other states are non-accepting states.
Please let me know if there's an error here. Thanks!
Related
I'd like to transpose a data that is in 2 columns (A:B). for example in below input, values A,B,C and D in column A appears 3 times each of them.
With this input (input 1) I use this formula and get the transposition correctly like the image below:
=INDEX($B$1:$B$12,4*ROWS(D$2:D2)+COLUMNS($D2:D2)-4)
INPUT1
+---+----+
| A | 2 |
+---+----+
| B | 3 |
+---+----+
| C | 4 |
+---+----+
| D | 1 |
+---+----+
| A | 6 |
+---+----+
| B | 12 |
+---+----+
| C | 4 |
+---+----+
| D | 76 |
+---+----+
| A | 1 |
+---+----+
| B | 2 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
But if input change (input2) in such way that A, B appears less times than C and D, my output is incorrect.
INPUT2
+---+----+
| A | 2 |
+---+----+
| B | 3 |
+---+----+
| C | 4 |
+---+----+
| D | 1 |
+---+----+
| C | 4 |
+---+----+
| D | 76 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
| A | 47 |
+---+----+
| B | 2 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
I show in image below the incorrect output and expected output.
Thanks in advance for any help.
Here is a fairly simple approach based on counting how many cells have been filled in so far:
=IF(INDEX($A:$A,COUNT($D$1:$G1)+COUNT($C2:C2)+1)=D$1,
INDEX($B:$B,COUNT($D$1:$G1)+COUNT($C2:C2)+1),"")
copied down and across starting from D2. Assumes that a blank column is available in column C.
If you want to make it more dynamic but also want it to work for earlier versions of Excel than Microsoft 365, it gets a bit ugly unfortunately. You can use a conventional way of listing out the unique values in column A in alphabetical order to get the headers:
=IFERROR(INDEX($A:$A, MATCH(SMALL(IF((COUNTIF(C$1:$C1, $A$1:INDEX($A:$A,COUNTA($A:$A)))=0), COUNTIF($A$1:INDEX($A:$A,COUNTA($A:$A)), "<"&$A$1:INDEX($A:$A,COUNTA($A:$A))), ""), 1), COUNTIF($A$1:INDEX($A:$A,COUNTA($A:$A)), "<"&$A$1:INDEX($A:$A,COUNTA($A:$A))), 0)),"")
adapted from this, pulled across as required (say, to column Z).
Then a slight modification to the main formula to avoid zeroes appearing under the blank headers:
=IF(AND(INDEX($A:$A,COUNT($D$1:$Z1)+COUNT($C2:C2)+1)=D$1,D$1<>""),
INDEX($B:$B,COUNT($D$1:$Z1)+COUNT($C2:C2)+1),"")
Copied down and across as far as column Z.
I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+
I have the following Apache Spark DataFrame (DF1):
function_name | param1 | param2 | param3 | result
---------------------------------------------------
f1 | a | b | c | 1
f1 | b | d | m | 0
f2 | a | b | c | 0
f2 | b | d | m | 0
f3 | a | b | c | 1
f3 | b | d | m | 1
f4 | a | b | c | 0
f4 | b | d | m | 0
First of all, I'd like to group DataFrame by function_name, collect results into the ArrayType and receive the new DataFrame (DF2):
function_name | result_list
--------------------------------
f1 | [1,0]
f2 | [0,0]
f3 | [1,1]
f4 | [0,0]
rigth after that, I need to collect function_name into ArrayType by grouping result_list and I'll receive new DataFrame like the following (DF3):
result_list | function_name_lists
------------------------------------
[1,0] | [f1]
[0,0] | [f2,f4]
[1,1] | [f3]
So, I have a question - first of all, can I use grouping by ArrayType column in Apache Spark? If so, I can potentially have tens of millions values in result_list ArrayType single field. Will Apache Spark be able to group by result_list column in this case?
Yes you can do that.
Creating your data frame:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.types import *
list=[['f1','a','b','c',1],
['f1','b','d','m',0],
['f2','a','b','c',0],
['f2','b','d','m',0],
['f3','a','b','c',1],
['f3','b','d','m',1],
['f4','a','b','c',0],
['f4','b','d','m',0]]
df= spark.createDataFrame(list,['function_name','param1','param2','param3','result'])
df.show()
+-------------+------+------+------+------+
|function_name|param1|param2|param3|result|
+-------------+------+------+------+------+
| f1| a| b| c| 1|
| f1| b| d| m| 0|
| f2| a| b| c| 0|
| f2| b| d| m| 0|
| f3| a| b| c| 1|
| f3| b| d| m| 1|
| f4| a| b| c| 0|
| f4| b| d| m| 0|
+-------------+------+------+------+------+
Grouping by function_name, then grouping by result_list(using collect_list), using order of param1,param2,param3:
w=Window().partitionBy("function_name").orderBy(F.col("param1"),F.col("param2"),F.col("param3"))
w1=Window().partitionBy("function_name")
df1=df.withColumn("result_list", F.collect_list("result").over(w)).withColumn("result2",F.row_number().over(w))\
.withColumn("result3",F.max("result2").over(w1))\
.filter(F.col("result2")==F.col("result3")).drop("param1","param2","param3","result","result2","result3")
df1.groupBy("result_list")\
.agg(F.collect_list("function_name").alias("function_name_list")).show()
+-----------+------------------+
|result_list|function_name_list|
+-----------+------------------+
| [1, 0]| [f1]|
| [1, 1]| [f3]|
| [0, 0]| [f2, f4]|
+-----------+------------------+
For doing further anaylsis, transformation or cleaning on array type columns I would recommend you check out the new higher order functions in spark2.4 and above.
(collect_list will work for spark1.6 and above)
Higher order functions in open source:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_contains onwards
Databricks releases:
Link:https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html
I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set
I have this big dataframe, 7 million lines long, and I need to add a column that counts how many times a certain person (identified by and Integer) has come up before, like:
| Reg | randomdata |
| 123 | yadayadayada |
| 246 | yedayedayeda |
| 123 | yadeyadeyade |
|369 | adayeadayead |
| 123 | yadyadyadyad |
to ->
| Reg | randomdata | count
| 123 | yadayadayada | 1
| 246 | yedayedayeda | 1
| 123 | yadeyadeyade | 2
| 369 | adayeadayead | 1
| 123 | yadyadyadyad | 3
I already done a groupBy to know how many times each got repeated, but I need to get this count for a Machine learning exercise to get the probability of repetition according to how many times that happened before.
The following whereby we assume randomness can mean the same random value occurring and using spark sql with tempview, but can be done also with DF with select:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window._
case class xyz(k: Int, v: String)
val ds = Seq(
xyz(1,"917799423934"),
xyz(2,"019331224595"),
xyz(3,"8981251522"),
xyz(3,"8981251522"),
xyz(4,"8981251522"),
xyz(1,"8981251522"),
xyz(1,"uuu4553")).toDS()
ds.createOrReplaceTempView("XYZ")
spark.sql("""select z.k, z.v, dense_rank() over (partition by z.k order by z.seq) as seq from (select k,v, row_number() over (order by k) as seq from XYZ) z""").show
returning:
+---+------------+---+
| k| v|seq|
+---+------------+---+
| 1|917799423934| 1|
| 1| 8981251522| 2|
| 1| uuu4553| 3|
| 2|019331224595| 1|
| 3| 8981251522| 1|
| 3| 8981251522| 2|
| 4| 8981251522| 1|
+---+------------+---+
You can do something like this
def countrds = udf((rds: Seq[String]) => {rds.length})
val df2 = df1.groupBy(col("Reg")).agg(collect_list(col("randomdata")).alias("rds"))
.withColumn("count", countrds(col("rds")))
df2.select('Reg', 'randomdata', 'count').show()