How to add 3 dataframes in a dataframe dict - python-3.x

I have 3 dataframes: train, validation, test.
I want to create a dictionary with these 3 dataframes to get the output below:
features are the dataframe columns's name
How ca I do this dictionary of dataframe?
DatasetDict({
train: Dataset({
features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
num_rows: 6838
})
test: Dataset({
features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
num_rows: 3259
})
validation: Dataset({
features: ['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
num_rows: 886
})
})
I am trying this:
DatasetDict = {}
dataframes = [train, validation, test]
for grp in dataframes:
DatasetDict[grp] = df
But it's not working

train.name = 'train'
test.name = 'test'
validation.name = 'validation'
datasetdict = {}
dataframes = [train, validation, test]
for df in dataframes:
datasetdict[df.name] = {'features': df.columns.to_list(), 'num_rows': len(df)}

Related

Unique nested dictionary from the 'for' loop Python3

I've got the host which executes commands via 'subprocess' and gets output list of several parameters. The problem is that output can not be correctly modified to be translated to the dictionary, whether it's yaml or json. After list is received Regexp takes part to match valuable information and to perform grouping. I am interested in getting a unique dictionary, where crossing keys are put into nested dictionary.
Here is the code and the example of output list:
from re import compile,match
# Output can differ from request to request, the "keys" from the #
# list_of_values can dublicate or appear more than two times. The values #
# mapped to the keys can differ too. #
list_of_values = [
"paramId: '11'", "valueId*: '11'",
"elementId: '010_541'", 'mappingType: Both',
"startRng: ''", "finishRng: ''",
'DbType: sql', "activeSt: 'false'",
'profile: TestPr1', "specificHost: ''",
'hostGroup: tstGroup10', 'balance: all',
"paramId: '194'", "valueId*: '194'",
"elementId: '010_541'", 'mappingType: Both',
"startRng: '1020304050'", "finishRng: '1020304050'",
'DbType: sql', "activeSt: 'true'",
'profile: TestPr1', "specificHost: ''",
'hostGroup: tstGroup10', 'balance: all']
re_compile_valueId = compile(
"valueId\*:\s.(?P<valueId>\d{1,5})"
"|elementId:\s.(?P<elementId>\d{3}_\d{2,3})"
"|startRng:\s.(?P<startRng>\d{1,10})"
"|finishRng:\s.(?P<finishRng>\d{1,10})"
"|DbType:\s(?P<DbType>nosql|sql)"
"|activeSt:\s.(?P<activeSt>true|false)"
"|profile:\s(?P<profile>[A-z0-9]+)"
"|hostGroup:\s(?P<hostGroup>[A-z0-9]+)"
"|balance:\s(?P<balance>none|all|priority group)"
)
iterator_loop = 0
uniq_dict = dict()
next_dict = dict()
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'valueId':
uniq_dict['valueId'+str(iterator_loop)] = ''
iterator_loop +=1
next_dict.update({key: value})
else:
next_dict.update({key: value})
uniq_dict['valueId'+str(iterator_loop-1)] = next_dict
print(uniq_dict)
This code right here responses with:
{
'valueId0':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
},
'valueId1':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
}
}
And I was waiting for something like:
{
'valueId0':
{
'valueId': '11',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'false',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '',
'finishRng': ''
},
'valueId1':
{
'valueId': '194',
'elementId': '010_541',
'DbType': 'sql',
'activeSt': 'true',
'profile': 'TestPr1',
'hostGroup': 'tstGroup10',
'balance': 'all',
'startRng': '1020304050',
'finishRng': '1020304050'
}
}
I've also got another code below, which runs and puts values as expected. But the structure breaks the idea of having this all looped around, because each dictionary result key has its own order number mapped. The example below. The list_of_values and re_compile_valueId can be used from previous example.
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'balance':
key = key+str(iterator_loop)
uniq_dict.update({key: value})
iterator_loop +=1
else:
key = key+str(iterator_loop)
uniq_dict.update({key: value})
print(uniq_dict)
The output will look like:
{
'valueId1': '11', 'elementId1': '010_541',
'DbType1': 'sql', 'activeSt1': 'false',
'profile1': 'TestPr1', 'hostGroup1': 'tstGroup10',
'balance1': 'all', 'valueId2': '194',
'elementId2': '010_541', 'startRng2': '1020304050',
'finishRng2': '1020304050', 'DbType2': 'sql',
'activeSt2': 'true', 'profile2': 'TestPr1',
'hostGroup2': 'tstGroup10', 'balance2': 'all'
}
Would appreciate any help! Thanks!
It appeared that some documentation reading needs to be performed :D
The copy() of the next_dict under else statement needs to be applied. Thanks to thread:
Why does updating one dictionary object affect other?
Many thanks to answer's author #thefourtheye (https://stackoverflow.com/users/1903116/thefourtheye)
The final code:
for element in list_of_values:
match_result = match(re_compile_valueId,element)
if match_result:
temp_dict = match_result.groupdict()
for key, value in temp_dict.items():
if value:
if key == 'valueId':
uniq_dict['valueId'+str(iterator_loop)] = ''
iterator_loop +=1
next_dict.update({key: value})
else:
next_dict.update({key: value})
uniq_dict['valueId'+str(iterator_loop-1)] = next_dict.copy()
Thanks for the involvement to everyone.

Write DataFrame to parquet on HDFS partitioned by multiple columns with dynamic partitionOverwriteMode

I have a dataframe which I want to save in parquet format to HDFS. I'd like to partition it by multiple columns.
When I'm writing data to HDFS - directory itself and only _SUCCESS file in it are created, but no data. I use partitionOverwriteMode=dynamic and overwrite as save mode. By the time I execute code path does not exist. If I change save mode to append then it works fine.
I also tried to write to local file system. In that case, both modes work correctly.
If only 1 partition column specified, then it works fine too.
Any ideas on how I can make overwrite works with multi-columns partitioning? Any tips appreciated. Thanks!
Code sample:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
data = [
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 15},
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 14},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 12},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 13},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 4},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 2},
{'country': 'US', 'fk_imported_at': '20191213', 'user_id': 1},
]
if __name__ == '__main__':
conf = SparkConf()
conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
spark = (
SparkSession
.builder
.config(conf=conf)
.appName('test partitioning')
.enableHiveSupport()
.getOrCreate()
)
df = spark.createDataFrame(data)
df.show()
df.repartition(1).write.parquet('/tmp/spark_save_mode', 'overwrite', ['fk_imported_at', 'country'])
spark.stop()
I'm submitting application in client mode. Spark version is 2.3.0.
Hadoop version is 2.6.0

How do I know which topic this word comes in?

This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this word comes in?
for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))
This is ouput
Topic: 0
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']
I want "Topic Name" instead of Topic : 0
Topic: 0
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']
This might work (Untested)
for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(lda_model.print_topic(index), [w[0] for w in topic]))
Try changing the Formatted parameter to True like this:
for index, topic in lda_model.show_topics(formatted=True, num_words= 30):
print('Topic: {} \nWords: {}'.format(topic[0], [w[0] for w in topic[1]]))
You can also check out the documentation for more information:
https://radimrehurek.com/gensim/models/ldamodel.html

Python nested json to csv

I can't convert this Json to csv. I have been trying with different solutions posted here using panda or other parser but non solved this.
This is a small extract of the big json
{'data': {'items': [{'category': 'cat',
'coupon_code': 'cupon 1',
'coupon_name': '$829.99/€705.79 ',
'coupon_url': 'link3',
'end_time': '2017-12-31 00:00:00',
'language': 'sp',
'start_time': '2017-12-19 00:00:00'},
{'category': 'LED ',
'coupon_code': 'code',
'coupon_name': 'text',
'coupon_url': 'link',
'end_time': '2018-01-31 00:00:00',
'language': 'sp',
'start_time': '2017-10-07 00:00:00'}],
'total_pages': 1,
'total_results': 137},
'error_no': 0,
'msg': '',
'request': 'GET api/ #2017-12-26 04:50:02'}
I'd like to get an output like this with the columns:
category, coupon_code, coupon_name, coupon_url, end_time, language, start_time
I'm running python 3.6 with no restrictions.

Export nested dictionary to csv

I need to export a nested dictionary to CSV. Here's what each entry looks like (that needs to be one line in the csv later):
{'createdTime': '2017-10-30T12:33:02.000Z',
'fields': {'Date': '2017-10-30T12:32:56.000Z',
'field1': 'example#gmail.com',
'field2': 1474538185964188,
'field3': 6337,
....},
'id': 'reca7LBr64XM1ClWy'}
I think I need to iterate through the dictionary and create a list of lists(?) to create the csv from using the csv module.
['Date', 'field1', 'field2', 'field3', ...],
['2017-10-30T12:32:56.000Z', 'example#gmail.com', 1474538185964188, 6337 ...]
My problem is to find a smart way to iterate through the dict to get to a list like this.
You can get the values in the below way:
def process_data():
csv_data = [{'createdTime': '2017-10-30T12:33:02.000Z',
'fields': {'Date': '2017-10-30T12:32:56.000Z',
'field1': 'example#gmail.com',
'field2': 1474538185964188,
'field3': 6337},
'id': 'reca7LBr64XM1ClWy'},
{'createdTime': '2017-10-30T12:33:02.000Z',
'fields': {'Date': '2017-10-30T12:32:56.000Z',
'field1': 'example#gmail.com',
'field2': 1474538185964188,
'field3': 6337},
'id': 'reca7LBr64XM1ClWy'}]
headers = [key for key in csv_data[0]['fields'].keys()]
body = []
for row in csv_data:
body_row = []
for colomn_header in headers:
body_row.append(row['fields'][colomn_header])
body.append(body_row)
process_data()
#header -- ['Date', 'field1', 'field2', 'field3']
#body -- [['2017-10-30T12:32:56.000Z', 'example#gmail.com',
# 1474538185964188, 6337], ['2017-10-30T12:32:56.000Z',
# 'example#gmail.com', 1474538185964188, 6337]]

Resources