NodeJS, MongoDB data modelling for CSV data - node.js

Problem
I have a CSV file with some data like below.
PK,title,year,length,budget,rating,votes,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,mpaa,Action,Animation,Comedy,Drama,Documentary,Romance,Short
1,$,1971,121,NA,6.4,348,4.5,4.5,4.5,4.5,14.5,24.5,24.5,14.5,4.5,4.5,,0,0,1,1,0,0,0
2,$1000 a Touchdown,1939,71,NA,6,20,0,14.5,4.5,24.5,14.5,14.5,14.5,4.5,4.5,14.5,,0,0,1,0,0,0,0
The CSV file is around 5MB in size and has around 58,000+ lines like the sample above.
Current Scenario
Currently I am parsing the above data and converting it to objects and saving to MongoDB in an array of the objects. Something like below
{ PK: '1',
title: '$',
year: '1971',
length: '121',
budget: 'NA',
rating: '6.4',
votes: '348',
r1: '4.5',
r2: '4.5',
r3: '4.5',
r4: '4.5',
r5: '14.5',
r6: '24.5',
r7: '24.5',
r8: '14.5',
r9: '4.5',
r10: '4.5',
mpaa: '',
Action: '0',
Animation: '0',
Comedy: '1',
Drama: '1',
Documentary: '0',
Romance: '0',
Short: '0' }
{ PK: '2',
title: '$1000 a Touchdown',
year: '1939',
length: '71',
budget: 'NA',
rating: '6',
votes: '20',
r1: '0',
r2: '14.5',
r3: '4.5',
r4: '24.5',
r5: '14.5',
r6: '14.5',
r7: '14.5',
r8: '4.5',
r9: '4.5',
r10: '14.5',
mpaa: '',
Action: '0',
Animation: '0',
Comedy: '1',
Drama: '0',
Documentary: '0',
Romance: '0',
Short: '0' }
Although, when I upload the file I receive the Document exceeds maximum allowed bson size of 16777216 bytes error.
I tried using GridFS. The file is getting uploaded and the chunks are prepared fine as well. But, I am not sure if I can retrieve the data back as an array of objects.
I need to retrieve all the data to crunch and create some analysis.
The CSV file is one source of data. Another source would be getting data from a web service on a proprietary system where the same process is followed.
Question
I think there is a flaw in the data model and in the way I am saving the data to MongoDB. If yes, then what is the optimal way to handle the large amount of data.
Would really appreciate any help.

Related

Extraxt a specific field from a list in python

I am having a list in python. I would like to extract the field countryRegion only. How am I supposed to do that. The code lst['countryRegion'] procduces the error : TypeError: list indices must be integers or slices, not str.
Help me please
[{'__type': 'Location:http://schemas.microsoft.com/search/local/ws/rest/v1',
'address': {'adminDistrict': 'Tamil Nadu',
'adminDistrict2': 'Chennai',
'countryRegion': 'India',
'formattedAddress': 'Chennai, Tamil Nadu',
'locality': 'Chennai'},
'bbox': [12.85071, 79.97689, 13.23403, 80.33292],
'confidence': 'High',
'entityType': 'PopulatedPlace',
'geocodePoints': [{'calculationMethod': 'None',
'coordinates': [13.07209, 80.20186],
'type': 'Point',
'usageTypes': ['Display']}],
'matchCodes': ['Good'],
'name': 'Chennai, Tamil Nadu',
'point': {'coordinates': [13.07209, 80.20186], 'type': 'Point'}}]
Given your input
lst = [{'__type': 'Location:http://schemas.microsoft.com/search/local/ws/rest/v1',
'address': {'adminDistrict': 'Tamil Nadu',
'adminDistrict2': 'Chennai',
'countryRegion': 'India',
'formattedAddress': 'Chennai, Tamil Nadu',
'locality': 'Chennai'},
'bbox': [12.85071, 79.97689, 13.23403, 80.33292],
'confidence': 'High',
'entityType': 'PopulatedPlace',
'geocodePoints': [{'calculationMethod': 'None',
'coordinates': [13.07209, 80.20186],
'type': 'Point',
'usageTypes': ['Display']}],
'matchCodes': ['Good'],
'name': 'Chennai, Tamil Nadu',
'point': {'coordinates': [13.07209, 80.20186], 'type': 'Point'}}]
you get your data from
lst[0]['address']['countryRegion']
First, TypeError is due to lst being a list type and not Dictionary type
Second, countryRegion is not a key of your dictionary, but inside the dictionary value from key address
It is producing that error because of the list containing a single element that is of type dictionary.
so to access the desired value you can just use the command
list_dic = [{'__type': 'Location:http://schemas.microsoft.com/search/local/ws/rest/v1',
'address': {'adminDistrict': 'Tamil Nadu',
'adminDistrict2': 'Chennai',
'countryRegion': 'India',
'formattedAddress': 'Chennai, Tamil Nadu',
'locality': 'Chennai'},
'bbox': [12.85071, 79.97689, 13.23403, 80.33292],
'confidence': 'High',
'entityType': 'PopulatedPlace',
'geocodePoints': [{'calculationMethod': 'None',
'coordinates': [13.07209, 80.20186],
'type': 'Point',
'usageTypes': ['Display']}],
'matchCodes': ['Good'],
'name': 'Chennai, Tamil Nadu',
'point': {'coordinates': [13.07209, 80.20186], 'type': 'Point'}}]
lst_dic[0]['address']['countryRegion']
That's because it is a list of dictionary.
I dont know how you obtain the json but here's what i was able to get:
js=[{'__type': 'Location:http://schemas.microsoft.com/search/local/ws/rest/v1',
'address': {'adminDistrict': 'Tamil Nadu',
'adminDistrict2': 'Chennai',
'countryRegion': 'India',
'formattedAddress': 'Chennai, Tamil Nadu',
'locality': 'Chennai'},
'bbox': [12.85071, 79.97689, 13.23403, 80.33292],
'confidence': 'High',
'entityType': 'PopulatedPlace',
'geocodePoints': [{'calculationMethod': 'None',
'coordinates': [13.07209, 80.20186],
'type': 'Point',
'usageTypes': ['Display']}],
'matchCodes': ['Good'],
'name': 'Chennai, Tamil Nadu',
'point': {'coordinates': [13.07209, 80.20186], 'type': 'Point'}}]
for i in range(len(js)):
print(js[i]["address"]["countryRegion"])

AWK + gsub - how to round floating number

Do you have an idea of how I can round float numbers after multiplying?
I have the following SQL dump:
INSERT INTO
`honzavolfcz_product` (`product_id`, `feed_product_id`, `import_id`,
`import_active_product`, `model`, `sku`, `upc`, `ean`, `jan`, `isbn`, `mpn`,
`location`, `quantity`, `stock_status_id`, `product_status_id`, `image`,
`manufacturer_id`, `shipping`, `price`, `points`, `tax_class_id`,
`date_available`, `weight`, `weight_class_id`, `length`, `width`, `height`,
`length_class_id`, `subtract`, `minimum`, `sort_order`, `status`, `date_added`,
`date_modified`, `viewed`)
VALUES ('10', '0', '1',
'1', 'model', '', '', '', '', '', '',
'', '1', '1', '0', 'catalog/zbozi/bozi_laska_obal.jpg',
'0', '1', '**112.50**', '0', '1',
'2019-01-15', '0.00', '1', '0.00', '0.00', '0.00',
'1', '0', '1', '0', '1', '2019-02-15 16:16:29',
'2019-02-15 16:16:29', '293');
And I want to multiply the price value (112.50) by 1.21 (taxes) and the round-up or down. I wrote the following command which does the multiplication but I do not know how to round it:
awk '{a=substr($58,2,length($58)-3);gsub(a,a*1.21);print}' a > b
The result:
INSERT INTO
`honzavolfcz_product` (`product_id`, `feed_product_id`, `import_id`,
`import_active_product`, `model`, `sku`, `upc`, `ean`, `jan`, `isbn`, `mpn`,
`location`, `quantity`, `stock_status_id`, `product_status_id`, `image`,
`manufacturer_id`, `shipping`, `price`, `points`, `tax_class_id`,
`date_available`, `weight`, `weight_class_id`, `length`, `width`, `height`,
`length_class_id`, `subtract`, `minimum`, `sort_order`, `status`, `date_added`,
`date_modified`, `viewed`)
VALUES ('10', '0', '1',
'1', 'model', '', '', '', '', '', '',
'', '1', '1', '0', 'catalog/zbozi/bozi_laska_obal.jpg',
'0', '1', '**136.125**', '0', '1',
'2019-01-15', '0.00', '1', '0.00', '0.00', '0.00',
'1', '0', '1', '0', '1', '2019-02-15 16:16:29',
'2019-02-15 16:16:29', '293');
I would like to have there 136 instead of 136.125. Of course, 137 if it would be 136.555.
Thank you in advance.
This may be what you want:
$ awk '{a=substr($58,2); $58=sprintf("\047%d\047,",a*1.21)} 1' file
INSERT INTO honzavolfcz_product (product_id, feed_product_id, import_id, import_active_product, model, sku, upc, ean, jan, isbn, mpn, location, quantity, stock_status_id, product_status_id, image, manufacturer_id, shipping, price, points, tax_class_id, date_available, weight, weight_class_id, length, width, height, length_class_id, subtract, minimum, sort_order, status, date_added, date_modified, viewed) VALUES ('10', '0', '1', '1', 'model', '', '', '', '', '', '', '', '1', '1', '0', 'catalog/zbozi/bozi_laska_obal.jpg', '0', '1', '136', '0', '1', '2019-01-15', '0.00', '1', '0.00', '0.00', '0.00', '1', '0', '1', '0', '1', '2019-02-15 16:16:29', '2019-02-15 16:16:29', '293');
but the rounding probably won't go quite as you'd like by default. See https://www.gnu.org/software/gawk/manual/gawk.html#Round-Function and https://www.gnu.org/software/gawk/manual/gawk.html#Setting-the-rounding-mode for how to control it with GNU awk.

Pulling item properties from Microsoft Sharepoint document library with Microsoft Graph API

I'm able to successfully pull file metadata from my SharePoint library with the Microsoft Graph API, but am having trouble pulling the properties of an item:
I can get a partial list of properties using this endpoint:
https://graph.microsoft.com/v1.0/sites/{site-id}/drives/{}/items/{}/children?$expand=listItem($expand=fields)
But the list that comes from this endpoint doesn't match the list of properties that exists on the item.
For example, below is a list of fields that come from that endpoint - you can see that '.Push Too Salsify.' (one of the fields I need) is not present. There are also other fields that exist but don't appear in the item properties:
{'ParentLeafNameLookupId': '466', 'CLIPPING_x0020_STATUS': 'Not Started', 'Edit': '0', 'EditorLookupId': '67', '_ComplianceTagWrittenTime': '', 'RequiredField': 'teams/WORKFLOWDEMO/Shared Documents/1062CQP6.Phase4/1062CQP-Phase4-Size.tif', 'PM_x0020_SIGN_x0020_OFF': 'No', 'QA_x0020_APPROVED': 'No', 'ImageWidth': 3648, 'PM_x0020_Approval_x0020_Status': '-', 'AuthorLookupId': '6', 'SelectedFlag': '0', 'NameOrTitle': '1062CQP-Phase4-Size.tif', 'ItemChildCount': '0', 'FolderChildCount': '0', 'LinkFilename': '1062CQP-Phase4-Size.tif', 'ParentVersionStringLookupId': '466', 'PHOTOSTATUS': 'Not Started', '#odata.etag': '"c4b7516e-64df-46d2-b916-a1ee6f29d24a,8"', 'Thumbnail': '3648', '_x002e_Approval_x0020_Status_x002e_': 'Approved', 'Date_x0020_Created': '2019-10-09T04:25:40Z', '_CommentCount': '', 'Created': '2019-10-09T04:25:33Z', 'PreviewOnForm': '0', '_ComplianceTag': '', 'FileLeafRef': '1062CQP-Phase4-Size.tif', 'ImageHeight': 3648, 'LinkFilenameNoMenu': '1062CQP-Phase4-Size.tif', '_ComplianceFlags': '', 'ContentType': 'Document', 'Preview': '3648', 'ImageSize': '3648', 'Product_x0020_Category': 'Baseball', 'DATE_x0020_ASSIGNED': '2019-10-09T04:25:40Z', 'DateCreated': '2019-10-09T04:25:40Z', 'WORKFLOW_x0020_SELECTION': ['Select'], 'Predecessors': [], 'FileType': 'tif', 'LEGAL_x0020_APPROVED': 'No', 'PUSH_x0020_READY': False, 'FileSizeDisplay': '74966432', 'id': '466', '_LikeCount': '', '_ComplianceTagUserId': '', 'Modified': '2019-10-09T14:41:25Z', 'DocIcon': 'tif', '_UIVersionString': '0.7', '_CheckinComment': ''}
Any help would be greatly appreciated. I've scoured the documentation and can't seem to find the correct endpoint to pull item properties from a Sharepoint DriveItem.

cx_Oracle version check

I'm using PyCharm 2019.1 Professional and am able to connect to an Oracle JDBC database using a thin driver (jdbc:oracle:thin:#host:PORT:SID). I'm trying to use the cx_Oracle library (version 1.1.9) and Anaconda 3.6, but do not seem to have the functions .connect or .makedsn with the library. I find this unusual, and at a loss.
Do I just have the wrong cx_Oracle version even though I installed using pip?
Is the 1.1.9 version that works with Anaconda 3.6 just not have these functions?
Or is there a different/easier library I can use to connect with jdbc:oracle:thin:#host:PORT:SID?
dir(cx_Oracle)
Outputs:
['ARRAY', 'BIGINT', 'BINARY', 'BLANK_SCHEMA', 'BLOB', 'BOOLEAN',
'BigInteger', 'Binary', 'Boolean', 'CHAR', 'CLOB',
'CheckConstraint', 'Column', 'ColumnDefault', 'Constraint',
'DATE', 'DATETIME', 'DDL', 'DECIMAL', 'Date', 'DateTime',
'DefaultClause', 'Enum', 'FLOAT', 'FetchedValue', 'Float',
'ForeignKey', 'ForeignKeyConstraint', 'INT', 'INTEGER', 'Index',
'Integer', 'Interval', 'JSON', 'LargeBinary', 'MetaData',
'NCHAR', 'NUMERIC', 'NVARCHAR', 'Numeric', 'PassiveDefault',
'PickleType', 'PrimaryKeyConstraint', 'REAL', 'SMALLINT',
'Sequence', 'SmallInteger', 'String', 'TEXT', 'TIME',
'TIMESTAMP', 'Table', 'Text', 'ThreadLocalMetaData', 'Time',
'TypeDecorator', 'Unicode', 'UnicodeText', 'UniqueConstraint',
'VARBINARY', 'VARCHAR', 'all', 'builtins', 'cached',
'doc', 'file', 'go', 'loader', 'name',
'package', 'path', 'spec', 'version', 'alias',
'all', 'and', 'any_', 'asc', 'between', 'bindparam', 'case',
'cast', 'collate', 'column', 'create_engine', 'delete', 'desc',
'distinct', 'engine', 'engine_from_config', 'event', 'events',
'exc', 'except_', 'except_all', 'exists', 'extract', 'false',
'func', 'funcfilter', 'insert', 'inspect', 'inspection',
'interfaces', 'intersect', 'intersect_all', 'join', 'lateral',
'literal', 'literal_column', 'log', 'modifier', 'not_', 'null',
'or_', 'outerjoin', 'outparam', 'over', 'pool', 'processors',
'schema', 'select', 'sql', 'subquery', 'table', 'tablesample',
'text', 'true', 'tuple_', 'type_coerce', 'types', 'union',
'union_all', 'update', 'util', 'within_group']
Print out the value of cx_Oracle.version. The version number 1.1.9 is not a valid cx_Oracle version! The latest version is 7.2.1 and has a much different set of values than the ones you printed! Take a look at the cx_Oracle installation documentation and the top level module cx_Oracle documentation to get an idea of what I am talking about. If you have further questions, adjust your question above and add a comment below and I'll see if I can help further.
To check the version of cx_Oracle
You must have python
Use your python shell, command prompt, or your code/text editor
Print out the following code
C:\Users>python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cx_Oracle
>>> print(cx_Oracle.version)
8.3.0
>>> exit()

Select first element for one sublist in a 2d list python

QUAKE_DATA = [
['2017-11-16T18:42:11.676Z', '61.7647', '-153.9615', '0.8', '2.1', 'ml',
'', '', '', '0.64', 'ak', 'ak17253456',
'2017-11-16T18:58:24.707Z', '156km NNW of Redoubt Volcano, Alaska', 'earthquake',
'', '0.2', '', '', 'automatic', 'ak', 'ak'],
['2017-11-16T18:35:00.940Z', '34.1638333', '-116.4253333', '10.17', '1.76', 'ml',
'58', '33', '0.03663', '0.17', 'ci', 'ci37812975',
'2017-11-16T19:14:13.440Z', '6km N of Yucca Valley, CA', 'earthquake',
'0.14', '0.32', '0.18', '50', 'reviewed', 'ci', 'ci'],
['2017-11-16T18:06:15.460Z', '34.0181667', '-116.862', '17.3', '0.9', 'ml',
'23', '108', '0.04811', '0.12', 'ci', 'ci37812967',
'2017-11-16T19:23:12.335Z', '10km N of Banning, CA', 'earthquake',
'0.23', '0.61', '0.068', '13', 'reviewed', 'ci', 'ci'],
['2017-11-16T17:59:31.810Z', '34.1671667', '-116.4225', '10.6', '1.08', 'ml',
'33', '61', '0.03261', '0.17', 'ci', 'ci37812951',
'2017-11-16T18:57:01.554Z', '6km N of Yucca Valley, CA', 'earthquake',
'0.25', '0.37', '0.169', '13', 'reviewed', 'ci', 'ci'],
['2017-11-16T17:47:50.270Z', '37.7361679', '-122.1466675', '4.09', '1.52', 'md',
'12', '126', '0.0248', '0.04', 'nc', 'nc72925680',
'2017-11-16T18:34:02.533Z', '1km NNE of San Leandro, California', 'earthquake',
'0.25', '0.29', '0.13', '8', 'automatic', 'nc', 'nc'],
['2017-11-16T17:44:51.030Z', '37.5636673', '-118.8346634', '1.8', '1.66', 'md',
'16', '196', '0.02668', '0.04', 'nc', 'nc72925675',
'2017-11-16T18:23:03.511Z', '15km SE of Mammoth Lakes, California', 'earthquake',
'0.63', '0.43', '0.25', '13', 'automatic', 'nc', 'nc'],
['2017-11-16T17:34:22.310Z', '33.9796667', '-118.782', '14.78', '2.47', 'ml',
'41', '97', '0.06482', '0.25', 'ci', 'ci37812839',
'2017-11-16T19:11:53.824Z', '4km SE of Malibu, CA', 'earthquake',
'0.36', '0.68', '0.13', '94', 'reviewed', 'ci', 'ci']
]
for data in QUAKE_DATA:
print (data[0])
result I am getting:
2017-11-16T18:42:11.676Z
2017-11-16T18:35:00.940Z
2017-11-16T18:06:15.460Z
2017-11-16T17:59:31.810Z
2017-11-16T17:47:50.270Z
2017-11-16T17:44:51.030Z
2017-11-16T17:34:22.310Z
If you want the first element from the first sub-list, just take the first sub-list with:
QUAKE_DATA[0]
and then take the first element from that sub-list by indexing again:
QUAKE_DATA[0][0]
Simple as that, giving:
'2017-11-16T18:42:11.676Z'
There is no need for a for-loop as you just want to get one element which you can index directly. As it is, you are looping through every list in QUAKE_DATA and printing the first item from that list.

Resources