In this blog post we are going to do the following -
- Write a lambda function in nodeJS/typescript to extract the following set of data from a website
- Title of the page
- Any image on the page
- Store that extracted data on AWS's S3
We will use the following node packages for this project -
- serverless (This must be installed globally): This will help us write & deploy Lambda function
- cheerio: This will help us parse the content of webpage into a jQuery object
- Axios: Promise based HTTP client for the browser and node.js
- exceljs: To read, manipulate and write spreadsheet
- aws-sdk
- serverless-offline: To run lambda functions locally
Step 1: Install serverless globally
npm install -g serverless
Step 2: Create a new typescript based project from serverless template library like this
sls create --template aws-nodejs-typescript
Step3: Install the required node packages for this lambda project
npm install axios exceljs cheerio aws-sdk
Step 4: Add
serverless-offline
to plugins list in serverless.ts
plugins: ['serverless-webpack', 'serverless-offline']
Step 5: Add S3 bucket name in the environment variable in serverless.ts like this
environment: {
AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
AWS_BUCKET_NAME: 'YOUR BUCKET NAME'
}
Step 6: Define your function in serverless.ts file like this
import type { AWS } from '@serverless/typescript';
const serverlessConfiguration: AWS = {
service: 'scrapeContent',
frameworkVersion: '2',
custom: {
webpack: {
webpackConfig: './webpack.config.js',
includeModules: true
}
},
// Add the serverless-webpack plugin
plugins: ['serverless-webpack', 'serverless-offline'],
provider: {
name: 'aws',
runtime: 'nodejs14.x',
apiGateway: {
minimumCompressionSize: 1024,
},
environment: {
AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
AWS_BUCKET_NAME: 'scrape-data-at-56'
},
},
functions: {
scrapeContent: {
handler: 'handler.scrapeContent',
events: [
{
http: {
method: 'get',
path: 'scrapeContent',
}
}
]
}
}
}
module.exports = serverlessConfiguration;
Step 7: In your handler.ts file define your function to do the following
- Receive the url to scrape data of from query string
- Make a get request to the url using axios
- Parse the response data using cheerio
- Extract data from the parsed response object and store them in a JSON file and all the image urls in an excel file
- Upload the extracted data up to S3
import { APIGatewayEvent } from "aws-lambda";
import "source-map-support/register";
import axios from "axios";
import * as cheerio from "cheerio";
import { badRequest, okResponse, errorResponse } from "./src/utils/responses";
import { scrape } from "./src/interface/scrape";
import { excel } from "./src/utils/excel";
import { getS3SignedUrl, uploadToS3 } from "./src/utils/awsWrapper";
export const scrapeContent = async (event: APIGatewayEvent, _context) => {
try {
if (!event.queryStringParameters?.url) {
return badRequest;
}
//load page
const $ = cheerio.load(await (await axios.get(event.queryStringParameters?.url)).data);
//extract title and all images on page
const scrapeData = {} as scrape;
scrapeData.images = [];
scrapeData.url = event.queryStringParameters?.url;
scrapeData.dateOfExtraction = new Date();
scrapeData.title = $("title").text();
$("img").each((_i, image) => {
scrapeData.images.push({
url: $(image).attr("src"),
alt: $(image).attr("alt"),
});
});
//add this data to a an excel sheet and upload to s3
const excelSheet = await saveDataAsExcel(scrapeData);
const objectKey = `${scrapeData.title.toLocaleLowerCase().replace(/ /g, '_')}_${new Date().getTime()}`;
await uploadToS3({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.xlsx`,
ContentType:
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
Body: await excelSheet.workbook.xlsx.writeBuffer()
});
//Get signed url with an expiry date
scrapeData.xlsxUrl = await getS3SignedUrl({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.xlsx`,
Expires: 3600 //this is 60 minutes, change as per your requirements
});
//Upload to S3 & give a link to download result as xslx
await uploadToS3({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.json`,
ContentType:
'application/json',
Body: JSON.stringify(scrapeData)
});
return okResponse(scrapeData);
} catch (error) {
return errorResponse(error);
}
};
/**
*
* @param scrapeData
* @returns excel
*/
async function saveDataAsExcel(scrapeData: scrape) {
const workbook:excel = new excel({ headerRowFillColor: '046917', defaultFillColor: 'FFFFFF' });
let worksheet = await workbook.addWorkSheet({ title: 'Scrapped data' });
workbook.addHeaderRow(worksheet, [
"Title",
"URL",
"Date of extraction",
"Images URL",
"Image ALT Text"
]);
workbook.addRow(
worksheet,
[
scrapeData.title,
scrapeData.url,
scrapeData.dateOfExtraction.toDateString()
],
{ bold: false, fillColor: "ffffff" }
);
for (let image of scrapeData.images) {
workbook.addRow(
worksheet,
[
'', '', '',
image.url,
image.alt
],
{ bold: false, fillColor: "ffffff" }
);
}
return workbook;
}
Step 8: Set your aws access key and aws secret key in your environment like this
export AWS_ACCESS_KEY_ID = YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY = YOUR_ACCESS_SECRET_KEY
Step 9: You are now ready to run this function on your machine like this
sls offline --stage local
Now you should be able to access your function from your machine like this http://localhost:3000/local/scrapeContent?url=ANY_URL_YOU_WISH_TO_SCRAPE
Step 10: If you wish to deploy this lambda function on your AWS account then you can do it like this -
sls deploy
You can checkout this lambda function from here.
Top comments (0)