Spark dataframe from dictionary - apache-spark

I'm trying to create a spark dataframe from a dictionary which has data in the format
{'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc.
dict_pairs={'33_45677': 0, '45_3233': 25, '56_4599': 43524}
df=spark.createDataFrame(data=dict_pairs)
It throws:
TypeError: can not infer schema for type: <class 'str'>
Is it because of the underscore in the keys of the dictionary?

Enclose dict using square braces '[]'. Its not because of _ in your keys.
dict_pairs={'33_45677': 0, '45_3233': 25, '56_4599': 43524}
df=spark.createDataFrame(data=[dict_pairs])
df.show()
or
dict_pairs=[{'33_45677': 0, '45_3233': 25, '56_4599': 43524}]
df=spark.createDataFrame(data=dict_pairs)
df.show()

Related

How to get the first column (index) in the dictionary output with Pandas?

I have not used pandas before but it looks like it could be a really nice tool for data manipulation. I am using python 3.7 and pandas 1.2.3.
I am passing a list of dictionaries to the dataframe that has 2 pieces to it. A sample of the dictionary would look like this:
data = [
{"Knowledge_Base_Link__c": null, "ClosedDate": "2021-01-06T19:02:14.000+0000"},
{"Knowledge_Base_Link__c": "http://someurl.com", "ClosedDate": "2021-01-08T21:26:49.000+0000"},
{"Knowledge_Base_Link__c": "http://someotherurl.com", "ClosedDate": "2021-01-09T20:35:58.000+0000"}
]
df = pd.DataFrame(data)
# Then I format the ClosedDate like so
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'], format="%y-%m-%d", exact=False)
# Next i get a count of the data
articles = df.resample('M', on='ClosedDate').count()
# print the results to the console
print(articles)
These are the results and exactly what i want.
However, if i convert that to a list or when i push it to a dictionary to use the data like below, the first column (index i presume) is missing from the output.
articles_by_month = articles.to_dict('records')
This final output is almost what i want but it is missing the index column.
This is what i am getting:
[{'ClosedDate': 15, 'Knowledge_Base_Link__c': 5}, {'ClosedDate': 18, 'Knowledge_Base_Link__c': 11}, {'ClosedDate': 12, 'Knowledge_Base_Link__c': 6}]
This is what i want:
[{'Date': '2021-01-31', 'ClosedDate': 15, 'Knowledge_Base_Link__c': 5}, {'Date': '2021-02-28', 'ClosedDate': 18, 'Knowledge_Base_Link__c': 11}, {'Date': '2021-03-31', 'ClosedDate': 12, 'Knowledge_Base_Link__c': 6}]
Couple things i have tried:
df.reset_index(level=0, inplace=True)
# This just takes the sum and puts it in a column called index, not sure how to get date like it is displayed in the first column of the screenshot
# I also tried this
df['ClosedDate'] = df.index
# however this gives me a Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index' error.
I thought this would be simple and checked the pandas docs and many other stacked articles but i cannot find a way to do this. Any thoughts on this would be appreciated.
Thanks
You can get an additional key in the dict with
articles.reset_index().to_dict('records')
But BEFORE that you have to rename your index since ClosedDate (the index' name) is already a column:
articles.index = articles.index.rename('Date')

Assign array of datetime type into panda dataframe

I have these arrays to assign into a pandata frame.
date_quote = []
price1 = []
price2 = []
The arrays have been filled with values. price1[], price2[] contains floating values while date_quote[] contains datetype values.
This is how I assign the arrays into the panda dataframe.
df = pd.DataFrame({'price1 ':price1 ,
'price2 ': price2 ,
'date': date_quote
})
I get the following error;
File "pandas\_libs\tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
File "pandas\_libs\tslibs\conversion.pyx", line 178, in pandas._libs.tslibs.conversion.datetime_to_datetime64
File "pandas\_libs\tslibs\conversion.pyx", line 387, in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject
AttributeError: 'pywintypes.datetime' object has no attribute 'nanosecond'
The problem comes from assigning date_quote[] which is datetime type. The code runs successfully if I do not assign date_quote[] into the dataframe.
Contents of date_quote[1] looks like 2018-07-26 00:00:00+00:00. I only need the date and do not need the time information in date_quote[]. Do I need to do any extra conversion to store this datetime type date_quote[] array into the dataframe?
The output of print (date_quote[:3]) is
[pywintypes.datetime(2018, 7, 26, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), pywintypes.datetime(2018, 7, 27, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), pywintypes.datetime(2018, 7, 30, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))]
I am using python v3.6
I found the answer to my own question. The key lies in removing the time information from date_quote[], leaving behind only the date information.
for x in range(0,int(num_elements)):
date_quote[x] = date_quote[x].date()
The assignment works without error after the time information is removed.
You can also use the dateutil module to extract the date and time from the string representation of the pywintypes.datetime object. This way you can keep the time part too. Code below tested with Python 3.6.
import datetime, dateutil, pywintypes
today = datetime.datetime.now() # get today's date as an example
mypydate = pywintypes.Time(today) # create a demo pywintypes object
mydate = dateutil.parser.parse(str(mypydate)) # convert it to a python datetime
print(mydate)
print(type(mydate))
Output:
2019-05-11 12:44:03.320533
<class 'datetime.datetime'>

How to infer a schema for a pyspark dataframe?

There are many question on this site regarding how to convert a pyspark rdd to a dataframe. But none of them answer the question of how to convert a SQL table style rdd to a dataframe while preserving type.
I have an rdd that is exactly a list of dicts in python:
>>> rdd.take(1)
[{'se_error': 0, 'se_subjective_count': 0, 'se_word_count': 10, 'se_entity_summary_topic_phrases': {}, 'se_entity_hits': 1, 'se_entity_summary': 'rt #mercuryinrx: disgusting. cut it out FOCALENTITY twitter.com/anons4cetacean', 'se_query_with_hits': 0, 'id': 180034992495.0, 'se_objective_count': 2, 'se_category': {}, 'se_sentence_count': 2, 'se_entity_sentiment': 0.0, 'se_document_sentiment': -0.49000000953674316, 'se_entity_themes': {}, 'se_query_hits': 0, 'se_named_entities': {}}]
>>> rdd.take(1)[0].keys()
dict_keys(['se_error', 'se_subjective_count', 'se_word_count', 'se_entity_summary_topic_phrases', 'se_entity_hits', 'se_entity_summary', 'se_query_with_hits', 'id', 'se_objective_count', 'se_category', 'se_sentence_count', 'se_entity_sentiment', 'se_document_sentiment', 'se_entity_themes', 'se_query_hits', 'se_named_entities'])
All rows have the same columns. All columns have the same datatype. This is trivial to turn into a dataframe in pandas.
out = rdd.take(rdd.count())
outdf = pd.DataFrame(out)
This of course defeats the purpose of using spark! I can demonstrate that the columns are all the same datatype as well.
>>> typemap = [{key: type(val) for key, val in row.items()} for row in out]
>>> typedf = pd.DataFrame(typemap)
>>> for col in list(typedf):
>>> typedf[col].value_counts()
<class 'float'> 1016
Name: id, dtype: int64
<class 'dict'> 1010
Name: se_category, dtype: int64
<class 'float'> 1010
Name: se_document_sentiment, dtype: int64
<class 'int'> 1010
Name: se_entity_hits, dtype: int64
...
It goes on farther, but they are all one type; or else they are nones.
How do I do this in spark? Here are some tries that don't work:
>>> outputDf = rdd.toDF()
...
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
>>> outputDf = rdd.toDF(sampleRatio=0.1)
...
File "/usr/hdp/current/spark-client/python/pyspark/sql/types.py", line 905, in <lambda>
return lambda row: dict((kconv(k), vconv(v)) for k, v in row.items())
AttributeError: 'NoneType' object has no attribute 'items'
What is the issue here? Why is it so hard to figure out the datatype in a column that only has one python datatype?
The solution here is in the line
<class 'float'> 1016
Name: id, dtype: int64
<class 'dict'> 1010
Name: se_category, dtype: int64
There are 1016 rows total in this rdd; but in 6 of those rose, the column se_category is absent. That is why you only see 1010 dict objects. This is no problem for pandas, which simply infers type from the rest of the column and uses an empty object of whatever appropriate type (list -> []; dict -> {}; float or int -> NaN) to fill in the blanks.
Spark doesn't do that. If you think about it from the perspective of Java, which is the language underlying the rdd objects, this makes complete sense. Since I have been programming mostly python, a dynamically-typed language, for some time, it didn't occur to me immediately that this was a problem. But in a statically-typed language, it would be expected that something has a defined type at compile time.
The solution is to 'declare' each row to be returned to an rdd as a set of objects with types; thus imitating the static typing. So I declare
{"int_field": 0; "list_field": []; "float_field": 0.0, "string_field": ""}
before I fill in any of the values. This way, if the value is not updated by my function that generates the rdd; the row still has all the correct types in place, and
outputDf = rdd.map(lambda x: Row(**x)).toDF()
successfully converts this rdd to a dataframe.

Python 3 convert int to dictonary

I want to convert this
(554, 334, 24, 15)
to
[554', ' 334', ' 24', ' 15]
If they are a similar question then sorry i didnt find one.
print(list(map(str, list((554, 334, 24, 15)))))
result = ['554', '334', '24', '15']
The tuple is converted into a list(), then map() applies the same function to each element of the list, in this case converting the int-type list elements to a string, the resulting list after this conversion is then printed.

Convert map object into numpy ndarray in python3

The following code works well in python2, but after migration to python3, it does not work.
How do I change this code in python3?
for i, idx in enumerate(indices):
user_id, item_id = idx
feature_seq = np.array(map(lambda x: user_id, item_id))
X[i, :len(item_id), :] = feature_seq # ---- error here ----
error:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'map'
Thank you.
In python3, Map returns a iterator, not a list.
You can also try numpy.fromiter to get an array from a map obj,only for 1-D data.
Example:
a=map(lambda x:x,range(10))
b=np.fromiter(a,dtype=np.int)
b
Ouput:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
For multidimensional arrays, refer to Reconcile np.fromiter and multidimensional arrays in Python
In PY3, map is like a generator. You need to wrap it in list() to produce a list that np.array can use: e.g.
np.array(list(map(...)))

Resources