PySpark: How to make `from_json` work with unicode? - apache-spark

I a dataframe lke this
data_frame = spark.CreateDataFrame(
tmp = ["[{\"Julia is awesome δ; 🐔 or 🥚\": 1, \"ok2\": 2}, {\"ok\":1 , \"ok3\": 3}]"])
so it looks like this
+--------------------------------------------------------------------+
|tmp |
+--------------------------------------------------------------------+
|[{"Julia is awesome δ; 🐔 or 🥚": 1, "ok2": 2}, {"ok":1 , "ok3": 3}]|
|[{"ok": 1, "ok2": 2}, {"ok":1 , "ok3": 3}, {"ok2":3}] |
+--------------------------------------------------------------------+
As you can see there are unicode?
I want to read it as JSON array and then do some further processing
from pyspark.sql import functions as F
data_frame.withColumn("tmp1", F.from_json(F.col("tmp"), "array<string>").\
withColumn("tmp2", F.col("tmp1").getItem(0))
But I get
{"Julia is awesome δ; \uD83D\uDC14 or \uD83E\uDD5A":1,...
as return so the unicodes are gone. So I tried to use
F.decode() like so
data_frame.withColumn("tmp1", F.from_json(F.col("tmp"), "array<string>").\
withColumn("tmp2", F.decode(F.col("tmp1").getItem(0), "utf-8")
``
but it gives the same output.
The issue is that
data_frame.withColumn("tmp1", F.from_json(F.col("tmp"), "array").select("tmp1").collect()
already show the unicdes are messed up but there is no "use_unicode=True" optin in `from_json`.
So how do I handle unicode correctly with `from_json`?
The documentation here is not very clear: http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.functions.from_json

Related

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

How do I show all values in an aggregated tooltip?

I would like myval to show the name of each car for each aggregated year, ex. "chevrolet chevelle malibu".
The [object Object] thing appears to be JavaScript related.
import altair as alt
from vega_datasets import data
import pandas as pd
import pdb
df = data.cars()
alt.renderers.enable("altair_viewer")
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["myval:N"]),
)
)
mychart.show()
This is a great question, and I'm not sure there's a satisfactory answer. The reason this is displayed as [object Object], [object Object], etc. is because the values aggregate returns a list of the entire row for each value. So the full representation would be something like this:
[{'Name': 'chevrolet chevelle malibu', 'Miles_per_Gallon': 18.0, 'Cylinders': 8, 'Displacement': 307.0, 'Horsepower': 130.0, 'Weight_in_lbs': 3504, 'Acceleration': 12.0, 'Year': 1970, 'Origin': 'USA'}, {'Name': 'buick skylark 320', 'Miles_per_Gallon': 15.0, 'Cylinders': 8, 'Displacement': 350.0, 'Horsepower': 165.0, 'Weight_in_lbs': 3693, 'Acceleration': 11.5, 'Year': 1970, 'Origin': 'USA'}, ...]
and those are just the first two entries! So clearly it won't really fit in a tooltip. For what it's worth, newer versions of Vega improve on this (which you can see by viewing the equivalent chart in the vega editor) but it's still not what you're looking for.
What you need is a way to extract just the name from each value in the list... and I'm sure that Vega-Lite transforms provide any good way to do that (the vega expression language does not have anything that resembles list comprehensions or function mapping).
The best I can think of is something like this, to display, say, the first 4 values:
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.transform_calculate(
first_val="datum.myval[0].Name",
second_val="datum.myval[1].Name",
third_val="datum.myval[2].Name",
fourth_val="datum.myval[3].Name",
)
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["first_val:N", "second_val:N", "third_val:N", "fourth_val:N"]),
)
)
Another option would be, instead of using a tooltip, to use a second chart that updates on mouseover:
base = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", values="values(Name)", groupby=["Year"])
)
selection = alt.selection_single(fields=['Year'], on='mouseover', empty='none')
bars = (
base
.mark_bar()
.encode(
x=alt.X(
"Year:N",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
)
).add_selection(selection)
text = (
base
.transform_filter(selection)
.transform_flatten(['values'])
.transform_calculate(Name="datum.values.Name")
.mark_text()
.encode(
y=alt.Y('Name:N', axis=None),
text='Name:N'
)
).properties(width=300)
chart2 = bars | text
I'd be interested to see if anyone knows of a more complete solution.

Assigning values to imported variables from excel

I need to import an excel document into mathematica which has 2000 compounds in it, with each compound have 6 numerical constants assigned to it. The end goal is to type a compound name into mathematica and have the 6 numerical constants be outputted. So far my code is:
t = Import["Titles.txt.", {"Text", "Lines"}] (imports compound names)
n = Import["NA.txt.", "List"] (imports the 6 values for each compound)
n[[2]] (outputs the second compounds 6 values)
Instead of n[[#]] i would like to know how to type in a compound from the imported compound names and have the 6 values be outputted .
I'm not sure if I understand your question - you have two text files, rather than an Excel file, for example, and it's not clear what the data looks like. But there are probably plenty of ways to do this. Here's a suggestion (it might not be the best way):
Let's assume that you've got all your data into a table (a list of lists):
pt = {
{"Hydrogen", "H", 1, 1.0079, -259, -253, 0.09, 0.14, 1776, 1, 13.5984},
{"Helium", "He", 2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874},
{"Lithium" , "Li", 3, 6.941, 180, 1347, 0.53, 0, 1817, 1, 5.3917}
}
To find the information associated with a particular string:
Cases[pt, {"Helium", rest__} -> rest]
{"He", 2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874}
where the pattern rest__ holds everything that was found after "Helium".
To look for the second item:
Cases[pt, {_, "Li", rest__} -> rest]
{2, 4.0026, -272, -269, 0, 0, 1895, 18, 24.5874}
If you add more information to the patterns, you have more flexibility in how you choose elements from the table:
Cases[pt, {name_, symbol_, aNumber_, aWeight_, mp_, bp_, density_,
crust_, discovered_, rest__}
/; discovered > 1850 -> {name, symbol, discovered}]
{{"Helium", "He", 1895}}
For something interactive, you could knock up a Manipulate:
elements = pt[[All, 1]];
headings = {"symbol", "aNumber", "aWeight", "mp", "bp", "density", "crust", "discovered", "group", "ion"};
Manipulate[
Column[{
elements[[x]],
TableForm[{
headings, Cases[pt, {elements[[x]], rest__} -> rest]}]}],
{x, 1, Length[elements], 1}]

CouchDB historical view snapshots

I have a database with documents that are roughly of the form:
{"created_at": some_datetime, "deleted_at": another_datetime, "foo": "bar"}
It is trivial to get a count of non-deleted documents in the DB, assuming that we don't need to handle "deleted_at" in the future. It's also trivial to create a view that reduces to something like the following (using UTC):
[
{"key": ["created", 2012, 7, 30], "value": 39},
{"key": ["deleted", 2012, 7, 31], "value": 12}
{"key": ["created", 2012, 8, 2], "value": 6}
]
...which means that 39 documents were marked as created on 2012-07-30, 12 were marked as deleted on 2012-07-31, and so on. What I want is an efficient mechanism for getting the snapshot of how many documents "existed" on 2012-08-01 (0+39-12 == 27). Ideally, I'd like to be able to query a view or a DB (e.g. something that's been precomputed and saved to disk) with the date as the key or index, and get the count as the value or document. e.g.:
[
{"key": [2012, 7, 30], "value": 39},
{"key": [2012, 7, 31], "value": 27},
{"key": [2012, 8, 1], "value": 27},
{"key": [2012, 8, 2], "value": 33}
]
This can be computed easily enough by iterating through all of the rows in the view, keeping a running counter and summing up each day as I go, but that approach slows down as the data set grows larger, unless I'm smart about caching or storing the results. Is there a smarter way to tackle this?
Just for the sake of comparison (I'm hoping someone has a better solution), here's (more or less) how I'm currently solving it (in untested ruby pseudocode):
require 'date'
def date_snapshots(rows)
current_date = nil
current_count = 0
rows.inject({}) {|hash, reduced_row|
type, *ymd = reduced_row["key"]
this_date = Date.new(*ymd)
if current_date
# deal with the days where nothing changed
(current_date.succ ... this_date).each do |date|
key = date.strftime("%Y-%m-%d")
hash[key] = current_count
end
end
# update the counter and deal with the current day
current_date = this_date
current_count += reduced_row["value"] if type == "created_at"
current_count -= reduced_row["value"] if type == "deleted_at"
key = current_date.strftime("%Y-%m-%d")
hash[key] = current_count
hash
}
end
Which can then be used like so:
rows = couch_server.db(foo).design(bar).view(baz).reduce.group_level(3).rows
date_snapshots(rows)["2012-08-01"]
Obvious small improvement would be to add a caching layer, although it isn't quite as trivial to make that caching layer play nicely incremental updates (e.g. the changes feed).
I found an approach that seems much better than my original one, assuming that you only care about a single date:
def size_at(date=Time.now.to_date)
ymd = [date.year, date.month, date.day]
added = view.reduce.
startkey(["created_at"]).
endkey( ["created_at", *ymd, {}]).rows.first || {}
deleted = view.reduce.
startkey(["deleted_at"]).
endkey( ["deleted_at", *ymd, {}]).rows.first || {}
added.fetch("value", 0) - deleted.fetch("value", 0)
end
Basically, let CouchDB do the reduction for you. I didn't originally realize that you could mix and match reduce with startkey/endkey.
Unfortunately, this approach requires two hits to the DB (although those could be parallelized or pipelined). And it doesn't work as well when you want to get a lot of these sizes at once (e.g. view the whole history, rather than just look at one date).

How to select based on a partial string match in Mathematica

Say I have a matrix that looks something like this:
{{foobar, 77},{faabar, 81},{foobur, 22},{faabaa, 8},
{faabian, 88},{foobar, 27}, {fiijii, 52}}
and a list like this:
{foo, faa}
Now I would like to add up the numbers for each line in the matrix based on the partial match of the strings in the list so that I end up with this:
{{foo, 126},{faa, 177}}
I assume I need to map a Select command, but I am not quite sure how to do that and match only the partial string. Can anybody help me? Now my real matrix is around 1.5 million lines so something that isn't too slow would be of added value.
Here is a starting point:
data={{"foobar",77},{"faabar",81},{"foobur",22},{"faabaa",8},{"faabian",88},{"foobar",27},{"fiijii",52}};
{str,vals}=Transpose[data];
vals=Developer`ToPackedArray[vals];
findValPos[str_List,strPat_String]:=
Flatten[Developer`ToPackedArray[
Position[StringPosition[str,strPat],Except[{}],{1},Heads->False]]]
Total[vals[[findValPos[str,"faa"]]]]
Here is yet another approach. It is reasonably fast, and also concise.
data =
{{"foobar", 77},
{"faabar", 81},
{"foobur", 22},
{"faabaa", 8},
{"faabian", 88},
{"foobar", 27},
{"fiijii", 52}};
match = {"foo", "faa"};
f = {#2, Tr # Pick[#[[All, 2]], StringMatchQ[#[[All, 1]], #2 <> "*"]]} &;
f[data, #]& /# match
{{"foo", 126}, {"faa", 177}}
You can use ruebenko's pre-processing for greater speed.
This is about twice as fast as his method on my system:
{str, vals} = Transpose[data];
vals = Developer`ToPackedArray[vals];
f2 = {#, Tr # Pick[vals, StringMatchQ[str, "*" <> # <> "*"]]} &;
f2 /# match
Notice that in this version I test substrings that are not at the beginning, to match ruebenko's output. If you want to only match at the beginning of strings, which is what I assumed in the first function, it will be faster still.
make data
mat = {{"foobar", 77},
{"faabar", 81},
{"foobur", 22},
{"faabaa", 8},
{"faabian", 88},
{"foobar", 27},
{"fiijii", 52}};
lst = {"foo", "faa"};
now select
r1 = Select[mat, StringMatchQ[lst[[1]], StringTake[#[[1]], 3]] &];
r2 = Select[mat, StringMatchQ[lst[[2]], StringTake[#[[1]], 3]] &];
{{lst[[1]], Total#r1[[All, 2]]}, {lst[[2]], Total#r2[[All, 2]]}}
gives
{{"foo", 126}, {"faa", 177}}
I'll try to make it more functional/general if I can...
edit(1)
This below makes it more general. (using same data as above):
foo[mat_, lst_] := Select[mat, StringMatchQ[lst, StringTake[#[[1]], 3]] &]
r = Map[foo[mat, #] &, lst];
MapThread[ {#1, Total[#2[[All, 2]]]} &, {lst, r}]
gives
{{"foo", 126}, {"faa", 177}}
So now same code above will work if lst was changed to 3 items instead of 2:
lst = {"foo", "faa", "fii"};
How about:
list = {{"foobar", 77}, {"faabar", 81}, {"foobur", 22}, {"faabaa",
8}, {"faabian", 88}, {"foobar", 27}, {"fiijii", 52}};
t = StringTake[#[[1]], 3] &;
{t[#[[1]]], Total[#[[All, 2]]]} & /# SplitBy[SortBy[list, t], t]
{{"faa", 177}, {"fii", 52}, {"foo", 126}}
I am sure I have read a post, maybe here, in which someone described a function that effectively combined sorting and splitting but I cannot remember it. Maybe someone else can add a comment if they know of it.
Edit
ok must be bedtime -- how could I forget Gatherby
{t[#[[1]]], Total[#[[All, 2]]]} & /# GatherBy[list, t]
{{"foo", 126}, {"faa", 177}, {"fii", 52}}
Note that for a dummy list of 1.4 million pairs this took a couple of seconds so not exactly a super fast method.

Resources