Analysis Exception in spark NLP - apache-spark

strong textBelow was my code block:
conll_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"))\
.groupBy('ground_truth')\
.count()\
.orderBy('count', ascending=False)\
.show(100,truncate=False)
and am getting below error:
AnalysisException: No such struct field 0 in result, result
and my requirements.txt is like below:
jupyterlab
SQLAlchemy==0.7.1
spark-nlp==3.4.4
pyspark==3.1.2
numpy== 1.19.2
pandas==1.3.2
openpyxl==3.0.9
jupyter_contrib_nbextensions
spark-nlp-display
pyarrow==3.0.0
streamlit==1.1.0
scipy==1.7.3
Tensorflow==2.5.0
tensorflow-addons
python==3.7.4

I ran into this issue the other day. Zip actually uses columns that are named 1 and 2. This is an issues as these are reserved words. (Numbers) so you need to escape the column name so that SQL knows to use it as a column name that is reserved. This is done but using ` so the columns names can be referred to as : `1` and `2`. You need to refer to them as columns not as an array[1].
conll_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.col("cols.`0`").alias("token"),
F.col("cols.`1`").alias("ground_truth"))\
.groupBy('ground_truth')\
.count()\
.orderBy('count', ascending=False)\
.show(100,truncate=False)
#minimal example to show this works:
df.select(arrays_zip( array_repeat(lit(0), 10), array_repeat(lit(1), 10)).alias("cols")).select("cols.`1`").show()
+--------------------+
| 1|
+--------------------+
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
|[1, 1, 1, 1, 1, 1...|
+--------------------+

Related

Calculate cumulative sum of pyspark array column

I have a spark dataframe with an array column that looks like this:
+--------------+
| x |
+--------------+
| [1, 1, 0, 1] |
| [0, 0, 0, 0] |
| [0, 0, 1, 1] |
| [0, 0, 0, 1] |
| [1, 0, 1] |
+--------------+
I want to add a new column with another array that contains the cumulative sum of x at each index. The result should look like this:
+--------------+---------------+
| x | x_running_sum |
+--------------+---------------+
| [1, 1, 0, 1] | [1, 2, 2, 3] |
| [0, 0, 0, 0] | [0, 0, 0, 0] |
| [0, 0, 1, 1] | [0, 0, 1, 2] |
| [0, 0, 0, 1] | [0, 0, 0, 1] |
| [1, 0, 1] | [1, 1, 2] |
+--------------+---------------+
How can I create the x_running_sum column? I've tried using some of the higher order functions like transform, aggregate, and zip_with, but I haven't found a solution yet.
To perform a cumulative sum I sliced the array by index position and reduce the values from it:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(x=[1, 1, 0, 1]),
Row(x=[0, 0, 0, 0]),
Row(x=[0, 0, 1, 1]),
Row(x=[0, 0, 0, 1]),
Row(x=[1, 0, 1])
])
(df
.selectExpr('x', "TRANSFORM(sequence(1, size(x)), index -> REDUCE(slice(x, 1, index), CAST(0 as BIGINT), (acc, el) -> acc + el)) AS x_running_sum")
.show(truncate=False))
Output
+------------+-------------+
|x |x_running_sum|
+------------+-------------+
|[1, 1, 0, 1]|[1, 2, 2, 3] |
|[0, 0, 0, 0]|[0, 0, 0, 0] |
|[0, 0, 1, 1]|[0, 0, 1, 2] |
|[0, 0, 0, 1]|[0, 0, 0, 1] |
|[1, 0, 1] |[1, 1, 2] |
+------------+-------------+

Why do other values change in an ndarray when I try to change a specific cell value?

For example, I have a 3D ndarray of the shape (10,10,10) and whenever I try to change all the cells in this section [5,:,9] to a specific single value I end up changing values in this section too [4,:,9]. Which to me makes no sense. I do not get this behavior when I convert to a list of lists.
I use a simply for loop:
For i in range(0,10):
matrix[5,i, 9]= matrix[5,9,9]
Is there anyway to avoid this? I do not get this behavior when using a list of lists but I don’t wanna convert back and forth between the two as it takes too much processing time.
Doesn't happen that way for me:
In [232]: arr = np.ones((10,10,10),int)
In [233]: arr[5,9,9] = 10
In [234]: for i in range(10): arr[5,i,9]=arr[5,9,9]
In [235]: arr[5,:,9]
Out[235]: array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
In [236]: arr[4,:,9]
Out[236]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
or assigning a whole "column" at once:
In [237]: arr[5,:,9] = np.arange(10)
In [239]: arr[5]
Out[239]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 7],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 8],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 9]])

How to get the topic using pyspark LDA

I have used LDA for finding the topic
ref:
from pyspark.ml.clustering import LDA
lda = LDA(k=30, seed=123, optimizer="em", maxIter=10, featuresCol="features")
ldamodel = lda.fit(rescaledData)
when i run the below code i find the result with topic, termIndices and termWeights
ldatopics = ldamodel.describeTopics()
+-----+-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices |termWeights |
+-----+-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13643518321703474, 0.09550636070102715, 0.0948398694152729, 0.05766480874922468, 0.04482014536392716, 0.04435413761288223, 0.04390248330822808, 0.042949351875241376, 0.039792489008412854, 0.03959557452696915] |
|1 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13650714786977475, 0.09549302392463813, 0.09473573233391298, 0.05765172919949817, 0.04483412655497465, 0.04437734133645192, 0.043917280973977395, 0.042966507550942196, 0.03978157454461109, 0.039593095881148545] |
|2 |[18, 14, 1, 2, 6, 15, 19, 0, 5, 11]|[0.13628285951099658, 0.0960680469091329, 0.0957882659160997, 0.05753830851677705, 0.04476298485932895, 0.04426297294164529, 0.043901044582340995, 0.04291495373567377, 0.039780501358024425, 0.03946491774918109] |
|3 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13674345908386384, 0.09541103313663375, 0.09430192089972703, 0.05770255869404108, 0.044821207138198024, 0.04441671155466873, 0.043902024784365994, 0.04294093494951478, 0.03981953622791824, 0.039614320304130965] |
|4 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.1364122091134319, 0.0953545543783122, 0.09429906593265366, 0.05772444907583193, 0.044862863343866806, 0.04442734201228477, 0.04389557512474934, 0.04296443267889805, 0.03982659626276644, 0.03962640467091713] |
|5 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13670139028438008, 0.09553267874876908, 0.0946605951853061, 0.05768305061621247, 0.04480375989378822, 0.04435320773212808, 0.043914956101645836, 0.0429208816887896, 0.03981783866996495, 0.0395571526370012] |
|6 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13655841155238757, 0.09554895256001286, 0.0949549658561804, 0.05762248195720487, 0.04480233168639564, 0.04433230827791344, 0.04390710584467472, 0.042930742282966325, 0.039814043132483344, 0.039566695745809] |
|7 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13663380500078914, 0.09550193655114965, 0.09478283807049456, 0.05766262514700366, 0.04480957336655386, 0.044349558779903736, 0.04392217503495675, 0.04293742117414056, 0.03978830895805316, 0.03953804442817271] |
|8 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13657941125100242, 0.09553782026082008, 0.09488224957791296, 0.0576562888735838, 0.04478320710416449, 0.044348589637858433, 0.043920567136734125, 0.04291002130418712, 0.03979768943659999, 0.039567161368513126] |
|9 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13679375890331896, 0.09550217013809542, 0.09456405056150366, 0.057658739173097585, 0.04482551122651224, 0.0443527568707439, 0.04392317149752475, 0.04290757658934661, 0.03979803663250257, 0.03956130101562635] |
|10 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.1363532425412472, 0.09567087909028642, 0.09540557719048867, 0.0576159180949091, 0.044781629751243876, 0.04429913174899199, 0.04391116942509204, 0.042932143462292065, 0.03978668768070522, 0.03951443679027362] |
|11 |[18, 1, 14, 2, 6, 15, 19, 0, 5, 11]|[0.13652547151289018, 0.09563212101150799, 0.09559109602844051, 0.057554258392101834, 0.044746350635605996, 0.044288429293621624, 0.04393701569930993, 0.04291245197810508, 0.03976607824588965, 0.039502323331991364]|
+---------++--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
then i ran the below code for transforming, which gives my whole input data with "topicDistribution", i want to know, how can i map their corresponding topic from above.
transformed = ldamodel.transform(rescaledData)
transformed.show(1000,truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |topicDistribution |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(20,[0,1,2,5,6,7,9,10,11,12,14,15,17,18],[1.104871083111864,2.992600400740947,2.6744520065699304,1.3684833461617696,2.2505718610977468,1.667281425641495,0.4887254650789885,1.6871692149804491,1.2395584596083011,1.1654793679713804,1.1856289804763125,0.8839314023462759,1.3072768360965874,1.8484873888968552]) |[0.033335920691559356,0.03333070077028474,0.033333697225675085,0.03333561564482505,0.03333642468043084,0.033329348877568624,0.033332822489008374,0.0333260618235691,0.0333303226504357,0.03333271017792987,0.03332747782859632,0.03332947925144302,0.03332588375760626,0.03333450406000825,0.03333113684458691,0.03333479529731098,0.03333650038104409,0.03333779171890448,0.033330895429371024,0.03333346818374588,0.03333587374669536,0.03333749064143864,0.033334190553966234,0.03333150654339392,0.03333448663832666,0.03333989460983954,0.03333245526240933,0.03333885169495637,0.03333318824494455,0.03333650428012542] |
|(20,[1,3,9,10,11,18],[0.997533466913649,1.2726157969776226,1.4661763952369655,1.6871692149804491,1.2395584596083011,1.2323249259312368]) |[0.03333282248535624,0.033332445609854544,0.033328001364314915,0.03333556680774651,0.03333511480351705,0.03333350870876004,0.03333187395000147,0.03333445718728295,0.03333444971729848,0.03333408945721941,0.033330403594416406,0.03333043254420633,0.03333157466407796,0.033332770049699215,0.033328359180981044,0.03333228442030213,0.03333371386494732,0.03333681533414124,0.0333317544337193,0.0333325923806738,0.03333507835375717,0.0333340926283776,0.033334289011977415,0.03333710235559251,0.033333843031028085,0.03333431818041727,0.03333321848222645,0.03333712581455898,0.03333342489994699,0.033334476683601115] |
|(20,[0,1,2,5,6,7,9,10,11,12,14,15,17,18],[1.104871083111864,2.992600400740947,2.6744520065699304,1.3684833461617696,2.2505718610977468,1.667281425641495,0.4887254650789885,1.6871692149804491,1.2395584596083011,1.1654793679713804,1.1856289804763125,0.8839314023462759,1.3072768360965874,1.8484873888968552]) |[0.03333351262221663,0.03334603652469889,0.03332161942705381,0.03333027102774255,0.03333390930010211,0.03333191764996061,0.033329934469446876,0.033333299309223546,0.033341041908943575,0.0333327792244044,0.0333317683142701,0.03332955596634091,0.033328132222973345,0.033337885062247365,0.03332933802183438,0.033340558825094374,0.03332932381968907,0.033330915221236455,0.03333188695334856,0.033327196223241304,0.0333344223760765,0.03332799716207421,0.033338186672894565,0.033336730538183736,0.03333440038338333,0.033337949318794795,0.033333179017769575,0.03333720003330217,0.0333312634707931,0.033337788932659276] |
|(20,[7,10,14,15,18],[1.667281425641495,1.6871692149804491,1.1856289804763125,0.8839314023462759,1.2323249259312368]) |[0.03333299162768077,0.03333267864965697,0.03333419860453984,0.0333335178997543,0.03333360724598342,0.03333242559458429,0.03333347136144222,0.03333355459499807,0.03333358147457144,0.03333288221459248,0.033333183195193086,0.03333412789444317,0.033334780263427975,0.033333456529848454,0.033336348770946406,0.03333413810342763,0.03333364116227056,0.03333271328980694,0.033333953427323364,0.03333358082109339,0.03333219641721605,0.03333309878777325,0.03333357447436248,0.03333149527336071,0.03333364840046119,0.03333405705432205,0.03333193039244976,0.033332142999523764,0.0333329476480237,0.033332075826922346] |
|(20,[7,10,14,15,18],[1.667281425641495,1.6871692149804491,1.1856289804763125,0.8839314023462759,1.2323249259312368]) |[0.033333498405871936,0.033333280285774286,0.03333426144981626,0.033332860785526656,0.033333582231251914,0.03333268142499894,0.0333323342942777,0.033333434542768936,0.03333306424165755,0.0333326212718143,0.03333359821538673,0.03333408970522017,0.03333440903364261,0.033333628480934824,0.033335988240581156,0.03333411840979216,0.03333341827050789,0.033332367335435646,0.033334058739684515,0.03333355988453005,0.033332524988062315,0.03333411721155432,0.033333323503042835,0.03333212784312619,0.0333335375674124,0.03333359055906317,0.033332183680577915,0.033332671344198254,0.03333288800754154,0.03333218004594688] |
|(20,[2,5,9,16,17,18],[0.8914840021899768,1.3684833461617696,0.4887254650789885,1.63668714555745,2.614553672193175,0.6161624629656184]) |[0.033334335349062126,0.033335078498013884,0.03332587515068904,0.03333663657766274,0.03333680954467168,0.03333423911759284,0.03333228558123402,0.03333307701307248,0.03333230070449398,0.03333357555321742,0.03333013898196302,0.033329558017855816,0.033331357742536066,0.03333281031681836,0.03332620861451679,0.03333164736607878,0.033333411422713344,0.03333811924641809,0.03333032668669555,0.03333155312673253,0.03333601811939769,0.03333261877594549,0.03333490102064452,0.033338814952549214,0.03333297326837463,0.03333518474628696,0.03333659535650031,0.033334882391957275,0.03333262407795885,0.03333604267834657] |
|(20,[9,14,17],[0.4887254650789885,1.1856289804763125,1.3072768360965874]) |[0.03333373548732593,0.03333366047427678,0.0333388921917903,0.0333324500674453,0.03333165231618069,0.033333396263220766,0.03333392152736817,0.03333280784507405,0.03333341964532808,0.033332027450640234,0.03333590751407632,0.03333625709954709,0.033335027917460715,0.033332984179205334,0.0333395676850832,0.03333465555831355,0.03333317988602309,0.03332999573625669,0.033335686662720146,0.03333444862817325,0.033331118794333245,0.03333227594267816,0.03333337815403729,0.03332815318734183,0.03333329306551438,0.033332463398394074,0.03333211180252395,0.03332955750997137,0.03333267853415129,0.033331295475544864] |
|(20,[0,1,2,5,6,8,9,12,14,19],[4.419484332447456,8.97780120222284,0.8914840021899768,1.3684833461617696,1.1252859305488734,1.233681992250468,0.4887254650789885,1.1654793679713804,42.68264329714725,3.685472474556911]) |[0.03333847676223903,0.03333217319316553,0.03361190962478921,0.03322296565909072,0.033224152351515865,0.033305780550557426,0.0333623247307499,0.03333530792027193,0.03335048739584223,0.03328160880463601,0.0334726823486143,0.03350523767287266,0.0334249096308398,0.0333474750843313,0.033626678634369765,0.033399676341965154,0.033312708717161896,0.03316394279048787,0.033455734324711994,0.033388088146836026,0.03321934579333965,0.03331319196962562,0.03332496308901043,0.03308256004388311,0.033355283259340625,0.03328280644324275,0.0332436575330529,0.03319398288438216,0.03331402807784297,0.03320786022123095] |
|(20,[2,8,10,13,15,16],[0.8914840021899768,3.701045976751404,1.6871692149804491,1.4594938792746623,0.8839314023462759,1.63668714555745]) |[0.0333356286958193,0.033332968808115676,0.03332196061288262,0.03333752852282297,0.03334171827141663,0.033333866506025614,0.03332902614535797,0.03332999721261129,0.03333228759033127,0.03333419089119785,0.033326295405433394,0.03332727748411373,0.03332970492260472,0.03333218324896459,0.033323419320752445,0.03333484772134479,0.033331408271507906,0.03333886892339302,0.033330223503670105,0.03333357612919576,0.033337166312023804,0.03333359583540509,0.033335058229509947,0.03334147190850166,0.03333378789809434,0.033336369234199595,0.033338417163238966,0.03333500103939193,0.03333456204893585,0.03333759214313734] |
|(20,[0,1,5,9,11,14,18],[2.209742166223728,1.995066933827298,1.3684833461617696,0.4887254650789885,1.2395584596083011,1.1856289804763125,4.313137240759328]) |[0.033331568372787335,0.03333199604802443,0.03333616626240888,0.03333311434679797,0.03332871558684149,0.03333377685027234,0.03333412369772984,0.03333326223955321,0.03333302350234804,0.033334011390192264,0.033332998734646985,0.03333572338531835,0.03333519102733107,0.033333669722909055,0.03333367298003543,0.03333316052548548,0.03333434503087792,0.03333204101430685,0.03333390212404242,0.03333317825522851,0.03333468470990933,0.0333342373715466,0.03333384912615554,0.03333262239313073,0.033332130871257,0.03333157173270879,0.033331620213171494,0.03333491236454045,0.03333364390007425,0.03333308622036798] |
|(20,[0,1,2,7,9,12,14,16,17],[1.104871083111864,1.995066933827298,1.7829680043799536,1.667281425641495,0.4887254650789885,3.4964381039141412,1.1856289804763125,1.63668714555745,1.3072768360965874]) |[0.033335136368759725,0.033335270064671074,0.033333243143688636,0.03333135191450168,0.03333521337181425,0.03333250351809707,0.03333535436589356,0.03333322757251989,0.033333747503130604,0.03333235416690325,0.03333273826809218,0.03333297626192852,0.033333297925505816,0.03333285267847805,0.033334741557111344,0.03333582976624498,0.03333192986853772,0.0333314389004771,0.03333186692894785,0.033332539165631093,0.03333280721245933,0.03333326073074502,0.03333359623114709,0.03333261337071022,0.03333404937430939,0.03333536871228382,0.033334559939835924,0.03333095099987788,0.0333326510660671,0.033332529051629894] |
|(20,[2,5],[0.8914840021899768,1.3684833461617696]) |[0.033333008832987246,0.03333320773569408,0.03333212285409669,0.03333370183210235,0.03333408881228382,0.03333369017540667,0.03333315896362525,0.03333316321736793,0.033333273291611384,0.033332955144148683,0.03333299410668819,0.03333254807040678,0.03333302471306508,0.03333324003641127,0.033332032586187325,0.0333328866035473,0.033333366369446636,0.03333432146865401,0.033333080986559745,0.033332919796171534,0.03333365180032741,0.03333319727206391,0.0333336947744764,0.033334184463059975,0.033333428924387516,0.033333998899679654,0.033333803141837516,0.03333374053474702,0.03333339151988759,0.03333412307307108] |
In order to remap the terminindices to words you have to access the vocabulary of the CountVectorizer model. Please have a look at the pseudocode below:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#Reading your data...
cv = CountVectorizer(inputCol="some", outputCol="features", vocabSize=2000)
cvmodel = cv.fit(df)
#other stuff
vocab = cvmodel.vocabulary
vocab_broadcast = sc.broadcast(vocab)
#creating LDA model
ldatopics = ldamodel.describeTopics()
def map_termID_to_Word(termIndices):
words = []
for termID in termIndices:
words.append(vocab_broadcast.value[termID])
return words
udf_map_termID_to_Word = udf(map_termID_to_Word , ArrayType(StringType()))
ldatopics_mapped = ldatopics.withColumn("topic_desc", udf_map_termID_to_Word(ldatopics.termIndices))

PySpark Count Distinct By Group In A RDD

I have an RDD of date-time and hostname as tuple and I want to count the unique hostnames by date.
RDD:
X = [(datetime.datetime(1995, 8, 1, 0, 0, 1), u'in24.inetnebr.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 7), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 9), u'ix-esc-ca2-07.ix.netcom.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'slppp6.intermind.net'),
(datetime.datetime(1995, 8, 4, 0, 0, 10), u'piweba4y.prodigy.com'),
(datetime.datetime(1995, 8, 5, 0, 0, 11), u'slppp6.intermind.net')]
DESIRED OUTPUT:
[(datetime.datetime(1995, 8, 1, 0, 0, 1), 2),
(datetime.datetime(1995, 8, 2, 0, 0, 8), 2),
(datetime.datetime(1995, 8, 3, 0, 0, 10), 2),
(datetime.datetime(1995, 8, 4, 0, 0, 10), 1),
(datetime.datetime(1995, 8, 5, 0, 0, 11), 1)]
MY ATTEMPT:
dayGroupedHosts = X.groupBy(lambda x: x[0]).distinct()
dayHostCount = dayGroupedHosts.count()
I am getting an error while performing count operation. I am new to Spark and I would like to know the correct and efficient transformation to achieve such tasks.
Thanks a lot in advance.
You need to first convert the keys into dates. Then group by the key, and count the distinct values:
X.map(lambda x: (x[0].date(), x[1]))\
.groupByKey()\
.mapValues(lambda vals: len(set(vals)))\
.sortByKey()\
.collect()
#[(datetime.date(1995, 8, 1), 2),
# (datetime.date(1995, 8, 2), 2),
# (datetime.date(1995, 8, 3), 2),
# (datetime.date(1995, 8, 4), 1),
# (datetime.date(1995, 8, 5), 1)]
Or convert to a DataFrame and use countDistinct method:
import pyspark.sql.functions as f
df = spark.createDataFrame(X, ["dt", "hostname"])
df.show()
+-------------------+--------------------+
| dt| hostname|
+-------------------+--------------------+
|1995-08-01 00:00:01| in24.inetnebr.com|
|1995-08-01 00:00:07| uplherc.upl.com|
|1995-08-01 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:09|ix-esc-ca2-07.ix....|
|1995-08-03 00:00:10| uplherc.upl.com|
|1995-08-03 00:00:10|slppp6.intermind.net|
|1995-08-04 00:00:10|piweba4y.prodigy.com|
|1995-08-05 00:00:11|slppp6.intermind.net|
+-------------------+--------------------+
df.groupBy(f.to_date('dt').alias('date')).agg(
f.countDistinct('hostname').alias('hostname')
).show()
+----------+--------+
| date|hostname|
+----------+--------+
|1995-08-02| 2|
|1995-08-03| 2|
|1995-08-01| 2|
|1995-08-04| 1|
|1995-08-05| 1|
+----------+--------+

Strange Behaviour when Updating Cassandra row

I am using pyspark and pyspark-cassandra.
I have noticed this behaviour on multiple versions of Cassandra(3.0.x and 3.6.x) using COPY, sstableloader, and now saveToCassandra in pyspark.
I have the following schema
CREATE TABLE test (
id int,
time timestamp,
a int,
b int,
c int,
PRIMARY KEY ((id), time)
) WITH CLUSTERING ORDER BY (time DESC);
and the following data
(1, datetime.datetime(2015, 3, 1, 0, 18, 18, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 19, 12, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 22, 59, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 23, 52, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 2, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 8, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 43, 30, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 44, 12, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 48, 49, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 49, 7, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 5, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 53, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 53, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 59, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 54, 35, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 28, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 55, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 56, 24, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 14, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 17, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 8, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 10, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 43, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 49, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 12, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 24, tzinfo=<UTC>), 2, 1, 0)
Towards the end of the data, there are two rows which have the same timestamp.
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
It is my understanding that when I save to Cassandra, one of these will "win" - there will only be one row.
After writing to cassandra using
rdd.saveToCassandra(keyspace, table, ['id', 'time', 'a', 'b', 'c'])
Neither row appears to have won. Rather, the rows seem to have "merged".
1 | 2015-03-01 01:17:43+0000 | 1 | 2 | 0
1 | 2015-03-01 01:17:49+0000 | 0 | 3 | 0
1 | 2015-03-01 01:24:12+0000 | 1 | 2 | 0
1 | 2015-03-01 01:24:18+0000 | 2 | 2 | 0
1 | 2015-03-01 01:24:24+0000 | 2 | 1 | 0
Rather than the 2015-03-01 01:24:18+0000 containing (1, 2, 0) or (2, 1, 0), it contains (2, 2, 0).
What is happening here? I can't for the life of me figure out this behaviour is being caused.
This is a little known effect that comes from the batching together of data. Batching writes assigns the same timestamp to all Inserts in the batch. Next, if two writes are done with the exact same timestamp then there is a special merge rule since there was no "last" write. The Spark Cassandra Connector uses intra-partition batches by default so this is very likely to happen if you have this kind of clobbering of values.
The behavior with two identical write timestamps is a merge based on the Greater value.
Given Table (key, a, b)
Batch
Insert "foo", 2, 1
Insert "foo", 1, 2
End batch
The batch gives both mutations the same timestamp. Cassandra cannot chose a "last-written" since they both happened at the same time, instead it just chooses the greater value of the two. The merged result will be
"foo", 2, 2

Resources