how to scrape data from a website using node - node.js

I am new in Node JS. from morning i am trying to scrape data from website https://www.pmkisan.gov.in/StateDist_Beneficiery.aspx
[![enter image description here][1]][1]
I want to store that date also in db table. Currently i am putting hard coded date. But how can i scrape that date?
my code
(async () => {
await performDbActions(webData);
})();
async function performDbActions(data) {
let dataToBeInsert = {};
// console.log(data);
for (const d of data) {
if (Object.keys(d).length) {
// console.log(d);
let district = await db.sequelize.query("select * from abc where name=:districtName or other_names like :districtLike", {
replacements: {districtName: d['name'], districtLike: '%' + d['name'] + '%'},
raw: true,
type: db.sequelize.QueryTypes.SELECT
});
delete d['sno'];
delete d['name'];
d['as_on'] = '2020-02-06';
}
}
}
}

According to the page's source code the date you're looking for is inside a <span> that has the id ContentPlaceHolder1_lbldate. So you can just use cheerio to get its text-content and pass the result to performDbActions as an additional parameter:
//...
const date = $('#ContentPlaceHolder1_lbldate').text();
//...
await performDbActions(webData, date);
// ...
async function performDbActions(data, date) {
// ...
// it would be safer to use an external date-library like moment.js but here's a way to convert the date in plain js
const dateParts =date.split('/');
const dateObj = new Date(dateParts[2], dateParts[1] - 1, dateParts[0]);
d['created_at'] = dateObj;
}
Note that the date is in format dd/mm/yyyy, so you may have to convert it to your desired format.

Related

Express/MongoDB date formatting to match req.params to date in mongo document

I have a MongoDB collection (SlopeDay) that has dates stored.
In my express routing, I'm looking to format the date to MM-DD-YYYY so that I can use that for the URL. That URL will find all documents with matching dates AND matching resortNames.
dateRouter.get("/:formattedDate", (req, res) => {
const formattedDate = req.params.formattedDate;
SlopeDay.find({}) // isolate dates
.then((dateObj) => {
dateObj.forEach((date, i) => {
let dateStr =
// MM-DD-YYYY reformatting to string
("0" + (date.date.getMonth() + 1)).slice(-2) +
"-" +
("0" + date.date.getDate()).slice(-2) +
"-" +
date.date.getFullYear();
// map below doesn't seem to be doing much
const objWithFormattedDate = dateObj.map((obj) => {
return { ...obj, formattedDate: dateStr, isNew: true };
});
// console.log(objWithFormattedDate);
});
});
});
I'm at a loss for how to do this properly. I need the get route to access all SlopeDay documents matching dates to the MM-DD-YYYY parameter URL.
I'm able to get it working by breaking the strings up and querying that way:
dateRouter.get("/:formattedDate", (req, res) => {
const formattedDate = req.params.formattedDate;
// break up the date
const targetChars = formattedDate.substring(3, 5);
const beforeTargetChar = formattedDate.substring(0, 3);
const afterTargetChar = formattedDate.substring(5);
// create lower and upper boundaries that straddle the formatted date
const lowerbound = beforeTargetChar + (targetChars - 1) + afterTargetChar;
const upperbound =
beforeTargetChar + (Number(targetChars) + 1) + afterTargetChar;
SlopeDay.find({
date: {
// find docs with dates between the boundaries (THIS SHOULD EQUAL req.params.formattedDate)
$gte: new Date(lowerbound),
$lt: new Date(upperbound),
}, // add 2nd query here
}).then((dateData) => res.send(dateData));
});
Just using Javascript I can recommend this post that might help
Otherwise, there are many libraries you could use to do this as well. Personally I like using Day.JS. Their format function it would look something like this and should fit your needs and more if you wanted to take that route.
dayjs(yourDateHere).format('MM-DD-YYYY')
cheers!

Get map of map field from Firestore

I have a field of type map that contains Maps of data in firestore.
I am trying to retrieve this data using a cloud function in node.js. I can get the document and the data from the field but i can't get it in a usable way. I have tried every solution i can find on SO and google but the below is the only code that can give me access to the data. I obviously need to be able to access each field with in the Map individually. in swift i build an array of String:Any but i can get that to work in Node.JS
const docRef = dbConst.collection('Comps').doc('XEVDk6e4AXZPkNprQRn5Imfcsah11598092006.724980');
return docRef.get().then(docSnap => {
const tagets = docSnap.get('targets')
console.log(tagets);
}).catch(result => { console.log(result) });
this is what i am getting back in the console.
In Swift i do the following and am so far not able to find an equivalent in typescript. (i don't need to build the custom object just ability to access the keys and values)
let obj1 = doc.get("targets") as! [String:Any]
for objs in obj1{
let obs = objs.value as! [String:Any]
let targObj = compUserDetails(IDString: objs.key, activTarg: obs["ActivTarget"] as! Double, stepTarg: obs["StepTarget"] as! Double, name: obs["FullName"] as! String)
UPDATE
After spending a whole day working on it thought i had a solution using the below:
const docRef = dbConst.collection('Comps').doc('XEVDk6e4AXZPkNprQRn5Imfcsah11598092006.724980');
return docRef.get().then(docSnap => {
const tagets = docSnap.get('targets') as [[string, any]];
const newDataMap = [];
for (let [key, value] of Object.entries(tagets)) {
const tempMap = new Map<String,any>();
console.log(key);
const newreWorked = value;
tempMap.set('uid',key);
for(let [key1, value1] of Object.entries(newreWorked)){
tempMap.set(key1,value1);
newDatMap.push(tempMap);
};
};
newDatMap.forEach(element => {
const name = element.get('FullName');
console.log(name);
});
However the new data map has 6 seperate mapped objects. 3 of each of the original objects from the cloud. i can now iterate through and get the data for a given key but i have 3 times as many objects.
So after two days of searching an getting very close i finnaly worked out a solution, it is very similar to the code above but this works. it may not be the "correct" way but it works. feel free to make other suggestions.
return docRef.get().then(docSnap => {
const tagets = docSnap.get('targets') as [[string, any]];
const newDatarray = [];
for (let [key, value] of Object.entries(tagets)) {
const tempMap = new Map<String,any>();
const newreWorked = value;
tempMap.set('uid',key);
for(let [key1, value1] of Object.entries(newreWorked)){
tempMap.set(key1,value1);
};
newDatarray.push(tempMap);
};
newDatarray.forEach(element => {
const name = element.get('FullName');
const steps = element.get('StepTarget');
const avtiv = element.get('ActivTarget');
const UID = element.get('uid');
console.log(name);
console.log(steps);
console.log(avtiv);
console.log(UID);
});
}).catch(result => { console.log(result) });
I made this into a little function that gets the underlying object from a map:
function getMappedValues(map) {
var tempMap = {};
for (const [key, value] of Object.entries(map)) {
tempMap[key] = value;
}
return tempMap;
}
For an object with an array of maps in firestore, you can get the value of the first of those maps like so:
let doc = { // Example firestore document data
items: {
0: {
id: "1",
sr: "A",
},
1: {
id: "2",
sr: "B",
},
2: {
id: "3",
sr: "B",
},
},
};
console.log(getMappedValues(doc.items[0]));
which would read { id: '1', sr: 'A' }

Filtering Out unnecessary portions of scraped data

I am trying to make a scraper that scrapes Post ID and Poster's ID from a Facebook public post link, using puppeteer and nodejs.
(async() => {
let url = 'https://m.facebook.com/photo/?fbid=1168301430177531&set=gm.1386874671702414'; //demo link
let brw = await puppeteer.launch();
let page = await brw.newPage();
await page.goto(url,{ waitUntil:'networkidle2'});
let data = await page.evaluate(()=>{
let ids = document.querySelector('div[class="_57-p"] > a[class="_57-s touchable"]').search; // for
image post
return{
ids
}
});
console.log(data);
and I get output like:
{
ids: '?fbid=1168301430177531&id=100009930549147&set=gm.1386874671702414&refid=13&__tn__=%2B%3E'
}
how can I filter out the unnecessary portions?(I just want fbid and id values)
Thanks in advance
It seems this is the most reliable and simple way:
const href = document.querySelector('div[class="_57-p"] > a[class="_57-s touchable"]').href;
const searchParams = new URL(href).searchParams;
return {
fbid: searchParams.get('fbid'),
id: searchParams.get('id'),
};
Try use query-string.
it will help you to parse the query strings
let search = '?foo=bar'
const parsed = queryString.parse(search);
console.log(parsed);
//=> {foo: 'bar'}
it is just a simple example of what you should do
You can use a match() method with a regex like /\Wfbid=(\w+)(\W|$)/, the search result under index 1 of the capturing groups will contain the desired parameter value.
let ids = '?fbid=1168301430177531&id=100009930549147&set=gm.1386874671702414&refid=13&__tn__=%2B%3E'
const fbid = ids.match(/\Wfbid=(\w+)(\W|$)/)[1] // 1168301430177531
const id = ids.match(/\Wid=(\w+)(\W|$)/)[1] // 100009930549147
Without [1] you'd get all matches, e.g.:
ids.match(/\Wid=(\w+)(\W|$)/)
=>
["&id=100009930549147&", "100009930549147", "&", index: 22, input: "?fbid=1168301430177531&id=100009930549147&set=gm.1386874671702414&refid=13&__tn__=%2B%3E", groups: undefined]
And you need the string between the capturing & characters, the 2nd element of the array (so: [1]).

Dynamo DB Query Filter Node.js

Running a Node.js serverless backend through AWS.
Main objective: to filter and list all LOCAL jobs (table items) that included the available services and zip codes provided to the filter.
Im passing in multiple zip codes, and multiple available services.
data.radius would be an array of zip codes = to something like this:[ '93901', '93902', '93905', '93906', '93907', '93912', '93933', '93942', '93944', '93950', '95377', '95378', '95385', '95387', '95391' ]
data.availableServices would also be an array = to something like this ['Snow removal', 'Ice Removal', 'Salting', 'Same Day Response']
I am trying to make an API call that returns only items that have a matching zipCode from the array of zip codes provided by data.radius, and the packageSelected has a match of the array data.availableServices provided.
API CALL
import * as dynamoDbLib from "./libs/dynamodb-lib";
import { success, failure } from "./libs/response-lib";
export async function main(event, context) {
const data = JSON.parse(event.body);
const params = {
TableName: "jobs",
FilterExpression: "zipCode = :radius, packageSelected = :availableServices",
ExpressionAttributeValues: {
":radius": data.radius,
":availableServices": data.availableServices
}
};
try {
const result = await dynamoDbLib.call("query", params);
// Return the matching list of items in response body
return success(result.Items);
} catch (e) {
return failure({ status: false });
}
Do I need to map the array of zip codes and available services first for this to work?
Should I be using comparison operators?
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LegacyConditionalParameters.QueryFilter.html
Is a sort key value or partition key required to query and filter? (the table has a sort key and partition key but i would like to avoid using them in this call)
Im not 100% sure on how to go about this so if anyone could point me in the right direction that would be wonderful and greatly appreciated!!
I'm not sure what your dynamodb-lib refers to but here's an example of how you can scan for attribute1 in a given set of values and attribute2 in a different set of values. This uses the standard AWS JavaScript SDK, and specifically the high-level document client.
Note that you cannot use an equality (==) test here, you have to use an inclusion (IN) test. And you cannot use query, but must use scan.
const AWS = require('aws-sdk');
let dc = new AWS.DynamoDB.DocumentClient({'region': 'us-east-1'});
const data = {
radius: [ '93901', '93902', '93905', '93906', '93907', '93912', '93933', '93942', '93944', '93950', '95377', '95378', '95385', '95387', '95391' ],
availableServices: ['Snow removal', 'Ice Removal', 'Salting', 'Same Day Response'],
};
// These hold ExpressionAttributeValues
const zipcodes = {};
const services = {};
data.radius.forEach((zipcode, i) => {
zipcodes[`:zipcode${i}`] = zipcode;
})
data.availableServices.forEach((service, i) => {
services[`:services${i}`] = service;
})
// These hold FilterExpression attribute aliases
const zipcodex = Object.keys(zipcodes).toString();
const servicex = Object.keys(services).toString();
const params = {
TableName: "jobs",
FilterExpression: `zipCode IN (${zipcodex}) AND packageSelected IN (${servicex})`,
ExpressionAttributeValues : {...zipcodes, ...services},
};
dc.scan(params, (err, data) => {
if (err) {
console.log('Error', err);
} else {
for (const item of data.Items) {
console.log('item:', item);
}
}
});

Copying data from one DB to another with node-sqlite - formatting the 'insert' statement

I'm writing a small utility to copy data from one sqlite database file to another. Both files have the same table structure - this is entirely about moving rows from one db to another.
My code right now:
let tables: Array<string> = [
"OneTable", "AnotherTable", "DataStoredHere", "Video"
]
tables.forEach((table) => {
console.log(`Copying ${table} table`);
sourceDB.each(`select * from ${table}`, (error, row) => {
console.log(row);
destDB.run(`insert into ${table} values (?)`, ...row) // this is the problem
})
})
row here is a js object, with all the keyed data from each table. I'm certain that there's a simple way to do this that doesn't involve escaping stringified data.
If your database driver has not blocked ATTACH, you can simply tell the database to copy everything:
ATTACH '/some/where/source.db' AS src;
INSERT INTO main.MyTable SELECT * FROM src.MyTable;
You could iterate over the row and setup the query with dynamically generated parameters and references.
let tables: Array<string> = [
"OneTable", "AnotherTable", "DataStoredHere", "Video"
]
tables.forEach((table) => {
console.log(`Copying ${table} table`);
sourceDB.each(`select * from ${table}`, (error, row) => {
console.log(row);
const keys = Object.keys(row); // ['column1', 'column2']
const columns = keys.toString(); // 'column1,column2'
let parameters = {};
let values = '';
// Generate values and named parameters
Object.keys(row).forEach((r) => {
var key = '$' + r;
// Generates '$column1,$column2'
values = values.concat(',', key);
// Generates { $column1: 'foo', $column2: 'bar' }
parameters[key] = row[r];
});
// SQL: insert into OneTable (column1,column2) values ($column1,$column2)
// Parameters: { $column1: 'foo', $column2: 'bar' }
destDB.run(`insert into ${table} (${columns}) values (${values})`, parameters);
})
})
Tried editing the answer by #Cl., but was rejected. So, adding on to the answer, here's the JS code to achieve the same:
let sqlite3 = require('sqlite3-promise').verbose();
let sourceDBPath = '/source/db/path/logic.db';
let tables = ["OneTable", "AnotherTable", "DataStoredHere", "Video"];
let destDB = new sqlite3.Database('/your/dest/logic.db');
await destDB.runAsync(`ATTACH '${sourceDBPath}' AS sourceDB`);
await Promise.all(tables.map(table => {
return new Promise(async (res, rej) => {
await destDB.runAsync(`
CREATE TABLE ${table} AS
SELECT * FROM sourceDB.${table}`
).catch(e=>{
console.error(e);
rej(e);
});
res('');
})
}));

Resources