Filter non equal values in pyspark, using condition.\ where(array_contains())

Filter non equal values in pyspark, using condition.\ where(array_contains()) - apache-spark

I have a pyspark code
condition_no_hypertension = condition.\
where(array_contains('clinicalStatus.coding.code', 'active')).\
where(array_contains('verificationStatus.coding.code', 'confirmed')).\
where(array_contains('code.coding.code', '38341003')).\
where(condition.onsetDateTime > '1900-01-01').\
withColumn('condition_status', condition['clinicalStatus.coding.code'].getItem(0)).\
withColumn('verification_status', condition['verificationStatus.coding.code'].getItem(0)).\
withColumn('snomed_code', condition['code.coding.code'].getItem(0)).\
withColumn('snomed_name', condition['code.coding.display'].getItem(0)).\
select(\
(condition['subject.reference'].substr(10, 40).alias('patient_id')),
'condition_status',\
'verification_status',\
'snomed_code', \
'snomed_name',\
to_date(condition['onsetDateTime']).alias('first_observation_date'))
How to change this code and pick up everything but code?
I tried
where(array_contains('code.coding.code', !='38341003')).\
but it does not work.

You can use ~ (not):
where(~array_contains('code.coding.code', '38341003'))

Related

How do I in FloPy Modflow6 output MAW head values for all timesteps?

I am creating a MAW well and want to use it as an observation well to compare it later to field data, it should be screened over multiple layers. However, I am only getting the head value in the well of the very last timestep in my output file. Any ideas on how to get all timesteps in the output?
The FloPy manual says something about it needing to be in Output Control, but I can't figure out how to do that:
print_head (boolean) – print_head (boolean) keyword to indicate that the list of multi-aquifer well heads will be printed to the listing file for every stress period in which “HEAD PRINT” is specified in Output Control. If there is no Output Control option and PRINT_HEAD is specified, then heads are printed for the last time step of each stress period.
In the MODFLOW6 manual I see that it is possible to make a continuous output:
modflow6
My MAW definition looks like this:
maw = flopy.mf6.ModflowGwfmaw(gwf,
nmawwells=1,
packagedata=[0, Rwell, minbot, wellhead,'MEAN',OBS1welllayers],
connectiondata=OBS1connectiondata,
perioddata=[(0,'STATUS','ACTIVE')],
flowing_wells=False,
save_flows=True,
mover=True,
flow_correction=True,
budget_filerecord='OBS1wellbudget',
print_flows=True,
print_head=True,
head_filerecord='OBS1wellhead',
)
My output control looks like this:
oc = flopy.mf6.ModflowGwfoc(gwf,
budget_filerecord=budget_file,
head_filerecord=head_file,
saverecord=[('HEAD', 'ALL'), ('BUDGET', 'ALL'), ],
)
Hope this is all clear and someone can help me, thanks!

You need to initialise the MAW observations file... it's not done in the OC package.
You can find the scripts for the three MAW examples in the MF6 documentation here:
https://github.com/MODFLOW-USGS/modflow6-examples/tree/master/notebooks
It looks something like this:
obs_file = "{}.maw.obs".format(name)
csv_file = obs_file + ".csv"
obs_dict = {csv_file: [
("head", "head", (0,)),
("Q1", "maw", (0,), (0,)),
("Q2", "maw", (0,), (1,)),
("Q3", "maw", (0,), (2,)),
]}
maw.obs.initialize(filename=obs_file, digits=10, print_input=True, continuous=obs_dict)

How to add two maps of key to values(array) together in Presto?

I have this presto query that gives me this.
but what I am looking for is this.
{"accountnumber":"A00000065","invoice":{"ids":["2c92c09a693316310169384472126a0d"], "numbers":["INV00000270"]}}
I have tried using the map_concat to no luck.
[![enter image description here][2]][2]
https://prestodb.io/docs/current/functions/map.html
[2]: https://i.stack.imgur.com/pq5iu.png
UPDATE: If I do the following it works.
map_concat(multimap_agg('number', invoice.invoicenumber), multimap_agg('id', invoice.id))
but if I change to
map_concat(multimap_agg('number', invoice.invoicenumber), multimap_agg('id', invoice.balance))
I get this error
line 1:23: Unexpected parameters (map(varchar(6),array(varchar)), map(varchar(2),array(decimal(18,6)))) for function map_concat. Expected: map_concat(map(K,V)) K, V

It should go in a comment, but that's too long.
If it doesn't work, I'll delete the answer.
try this:
SELECT map_concat(multimap_agg('ids', invoice.id), multimap_agg('numbers', invoice.invoicenumber))
FROM ...

Write a pyspark dataframe to text without changing its structure

I have a pyspark dataframe as shown below
+--------------------+
| speed|
+--------------------+
|[5.59239, 2.51329...|
|[0.0191166, 0.169...|
|[0.561913, 0.4098...|
|[0.393343, 0.3580...|
|[0.118315, 0.1183...|
|[0.831407, 0.4470...|
|[1.49012e-08, 0.1...|
|[0.0411047, 0.152...|
|[0.620069, 0.8262...|
|[0.20373, 0.20373...|
+--------------------+
How can I write this dataframe to CSV such that I save it as it is shown above.Currently I tried coalesce but it saved as below
"[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941, 1.11321, 1.37977, 0.849902, 1.24436, 1.24436, 0.698651, 0.791467, 0.636781, 0.666729, 0.666729, 0.45688, 0.45688, 0.158829, 2.12693, 29.8682, 29.8682, 9.62536, 3.40384, 2.51002, 1.55077, 1.01774, 0.922753, 0.922753, 0.0438924, 0.530669, 0.879573, 0.627267, 0.0532846, 0.0890066, 0.0884833, 0.140008, 0.147534, 0.0180038, 0.0132851, 0.112785, 0.112785, 0.22997, 0.22997, 0.0524423, 0.141886, 0.328422,............]"
But I want to save it in the format such that it is a proper excel file,with speed as column name and its values as a list of lists.
I dont want to use topandas() as it is memory intensive
If i have over emphasised/under emphasised sth,please let me know in the comments.

df.coalesce(1).write.option("header","true").csv("file:///s/tesing")

I resolved this!
df_Welding_amp.rdd.coalesce(1).saveAsTextFile('home/ram/file.csv')
Though I didnt get exactly as a list of lists, I was able to successfully get in row format as below
Row(speed='[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941........
.....]
Row(speed='[0.0191166, 0.169978, 0.226254, 0.149923, 0.149923, 0.505102, 0.505102, 0.369975, 0.305384, 0.154693, 0.224818, 0.875909, 0.875909, 2.5506, 6.06761, 5.0829, 4.46667, 2.16333, 3.74257, 3.74257, 2.33873, 1.39336, 1.56772, 0.889895, 0.249284, 0.249284, 0.132409, 0.177825, 0.270215, 0.398466, 2.3726, 4.87186, 4.05198, 2.23753, 0.266356, 0.513157, 0.78962, 0.523164, 0.138469, 0.315834, 0.315834]

pd.to_datetime to solve '2010/1/1' rather than '2010/01/01'

I have a dataframe which contain a column 'trade_dt' like this
2009/12/1
2009/12/2
2009/12/3
2009/12/4
I got this problem
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y-&m-%d')
ValueError: time data '2009/12/1' does not match format '%Y-&m-%d' (match)
how to solve it? Thanks~

Need change format for match - replace & and - to % and /:
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y/%m/%d')
Also working with sample data removing format (but not sure with real data):
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'])
print (benchmark)
trade_dt
0 2009-12-01
1 2009-12-02
2 2009-12-03
3 2009-12-04

using SVM for binary classification

I am using sVM-light for binary classification.and I am using SVM in the learning mode.
I have my train.dat file ready.but when i run this command ,instead of creating file model ,it writes somethings in terminal:
my command:
./svm_learn example1/train.dat example1/model
output:
Scanning examples...done
Reading examples into memory...Feature numbers must be larger or equal to 1!!!
: Success
LINE: -1 0:1.0 6:1.0 16:1.0 18:1.0 28:1.0 29:1.0 31:1.0 48:1.0 58:1.0 73:1.0 82:1.0 93:1.0 95:1.0 106:1.0 108:1.0 118:1.0 121:1.0 122:1.0151:1.0 164:1.0 167:1.0 169:1.0 170:1.0 179:1.0 190:1.0 193:1.0 220:1.0 237:1.0250:1.0 252:1.0 267:1.0 268:1.0 269:1.0 278:1.0 283:1.0 291:1.0 300:1.0 305:1.0320:1.0 332:1.0 336:1.0 342:1.0 345:1.0 348:1.0 349:1.0 350:1.0 368:1.0 370:1.0384:1.0 390:1.0 394:1.0 395:1.0 396:1.0 397:1.0 400:1.0 401:1.0 408:1.0 416:1.0427:1.0 433:1.0 435:1.0 438:1.0 441:1.0 446:1.0 456:1.0 471:1.0 485:1.0 510:1.0523:1.0 525:1.0 526:1.0 532:1.0 540:1.0 553:1.0 567:1.0 568:1.0 581:1.0 583:1.0604:1.0 611:1.0 615:1.0 616:1.0 618:1.0 623:1.0 624:1.0 626:1.0 651:1.0 659:1.0677:1.0 678:1.0 683:1.0 690:1.0 694:1.0 699:1.0 713:1.0 714:1.0 720:1.0 722:1.0731:1.0 738:1.0 755:1.0 761:1.0 763:1.0 768:1.0 776:1.0 782:1.0 792:1.0 817:1.0823:1.0 827:1.0 833:1.0 834:1.0 838:1.0 842:1.0 848:1.0 851:1.0 863:1.0 867:1.0890:1.0 900:1.0 903:1.0 923:1.0 935:1.0 942:1.0 946:1.0 947:1.0 949:1.0 956:1.0962:1.0 965:1.0 968:1.0 983:1.0 986:1.0 987:1.0 990:1.0 998:1.0 1007:1.0 1014:1.0 1019:1.0 1022:1.0 1024:1.0 1029:1.0 1030:1.01032:1.0 1047:1.0 1054:1.0 1063:1.0 1069:1.0 1076:1.0 1085:1.0 1093:1.0 1098:1.0 1108:1.0 1109:1.01116:1.0 1120:1.0 1133:1.0 1134:1.0 1135:1.0 1138:1.0 1139:1.0 1144:1.0 1146:1.0 1148:1.0 1149:1.01161:1.0 1165:1.0 1169:1.0 1170:1.0 1177:1.0 1187:1.0 1194:1.0 1212:1.0 1214:1.0 1239:1.0 1243:1.01251:1.0 1257:1.0 1274:1.0 1278:1.0 1292:1.0 1297:1.0 1304:1.0 1319:1.0 1324:1.0 1325:1.0 1353:1.01357:1.0 1366:1.0 1374:1.0 1379:1.0 1392:1.0 1394:1.0 1407:1.0 1412:1.0 1414:1.0 1419:1.0 1433:1.01435:1.0 1437:1.0 1453:1.0 1463:1.0 1464:1.0 1469:1.0 1477:1.0 1481:1.0 1487:1.0 1506:1.0 1514:1.01519:1.0 1526:1.0 1536:1.0 1549:1.0 1551:1.0 1553:1.0 1561:1.0 1569:1.0 1578:1.0 1603:1.0 1610:1.01615:1.0 1617:1.0 1625:1.0 1638:1.0 1646:1.0 1663:1.0 1666:1.0 1672:1.0 1681:1.0 1690:1.0 1697:1.01699:1.0 1706:1.0 1708:1.0 1717:1.0 1719:1.0 1732:1.0 1737:1.0 1756:1.0 1766:1.0 1771:1.0 1789:1.01804:1.0 1805:1.0 1808:1.0 1814:1.0 1815:1.0 1820:1.0 1824:1.0 1832:1.0 1841:1.0 1844:1.0 1852:1.01861:1.0 1875:1.0 1899:1.0 1902:1.0 1904:1.0 1905:1.0 1917:1.0 1918:1.0 1919:1.0 1921:1.0 1926:1.01934:1.0 1937:1.0 1942:1.0 1956:1.0 1965:1.0 1966:1.0 1970:1.0 1971:1.0 1980:1.0 1995:1.0 2000:1.02009:1.0 2010:1.0 2012:1.0 2015:1.0 2018:1.0 2022:1.0 2047:1.0 2076:1.0 2082:1.0 2095:1.0 2108:1.02114:1.0 2123:1.0 2130:1.0 2133:1.0 2141:1.0 2142:1.0 2143:1.0 2148:1.0 2157:1.0 2160:1.0 2162:1.02170:1.0 2195:1.0 2199:1.0 2201:1.0 2202:1.0 2205:1.0 2211:1.0 2218:1.0
I dont know what to do.
p.s.when i make my train.dat very shorter ,everything works fine!!!
Thank you

From what I could interpret from the log, your training set has an issue.
The first few characters of the training row that has issue are
-1 0:1.0 6:1.0
The issue is not with the size but with feature indexing. You are starting your feature index at 0 (0:1) whereas svmlight requires that all feature index be equal or greater than 1.
Change the indexing to start at 1 and it should work fine.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Filter non equal values in pyspark, using condition.\ where(array_contains()) - apache-spark

You can use ~ (not): where(~array_contains('code.coding.code', '38341003'))

Related

How do I in FloPy Modflow6 output MAW head values for all timesteps?

How to add two maps of key to values(array) together in Presto?

Write a pyspark dataframe to text without changing its structure

pd.to_datetime to solve '2010/1/1' rather than '2010/01/01'

using SVM for binary classification

Categories

Resources