site diff tool with puppeteer



We recently have a requirement to do a site diff tool so that we can compare our pages in qa and prod to make sure the changes go in is expected.

Existing Stack

There used to be a version that works but a bit complicated. The existing stack is user provides a list of url paths and 2 hosts as input for an entry lambda function, which will spawn a bunch of additional lambda functions(via SQS) for each path so that they will take respectively connect to source labs, access the 2 hosts + path, then use selenium take screen shot, then upload to some s3 bucket and eventually write some new sqs msg to another queue which will spawn some other lambda functions to do the image diff then upload the diff image to s3 then write more sqs msg and eventually trigger a reduce lambda function to generate some summary json data for UI.


To be honest it is a lot of moving parts to reason/troubleshot and maintain. Logs are scattered all around in CloudWatch as there are so many functions created.  So I decided to make a slightly easier to use version.

Basic flow of new stack

  1. get all the paths from CMS with API so that user does not need to collect them manually and copy paste in multiple different places which is error-prone
  2. use puppeteer in headless mode to access the site and manipulate the dom then take screenshot.
  3. use JIMP to create diff image based on configured threshold and then create the summary json locally
  4. upload the diff images and summary to s3 bucket which populates the UI.


With the above flow,

  • No complex stack on aws, everything is done in one place, debugging/logging is really easy.
  • Also threshold is introduced so we do not need to create diff image for all the paths which are similar enough or identical.
  • Previously there are always some diff between prod and qa due to some extra prod-only feedback banner which creates big noice for screenshot comparison as it would sometimes cause pixel shift. Now we can easily remove it via puppeteer’s modern API before taking screenshot so the result is much more accurate and concise.
  • Saves a lot of space on s3 as on S3 we only need to have paths that are above threshold. And since we are doing everything in one shot, we also do not have to upload the original screenshot to S3 which was required previously due to the nature of multi-staging process.
  • If needed we can integrate to Jenkins and run from there ideally with a slave that has puppeteer installed.

lambda@edge prototype

Recently I was doing a MVP for replacing a ELB/EC2/Docker based static site preview stack with a cloudfront/lambda/s3 based one.


The purpose of this is to

  1. reduce the maintenance we has to do with the EC2 stack like regular AMI update.
  2. reduce the complexity of the stack  as the previous one involves building custom image, store image, cloud-formation to bring up stack, ec2 user data to init the system(pull image, run docker compose etc).
  3. reduce the cost as ELB, EC2 have to run 7/24.
  4. increase the stability as we know lambda does not rely on any specific run time whereas our docker containers still have to run on some instace even though docker has done a pretty good job on isolation.

EC2/Dcoker Stack

The existing stack is like below. On init, the docker containers will pull code from github, run installation on node dependencies, run preview command which is a browser-sync server pulls data from CMS and return combined html to client browser.

When code in Github updated, we have to restart the EC2 instance to pick it up.


CloudFront/Lambda@edge Stack

The new stack will be we build the bundle from Github code via Jenkins, push to s3 which is fronted by a cloud-front distribution which notifies the lambda@edge function on request. So when user request a page, if it is the entry point(/bank/xxx), as it has no extension, cloud-front will have a miss and forward the request to origin. At this point, the lambda function we registered on the origin request life cycle will receive this request before it goes to origin and this is perfect time to do manipulation. So here in the lambda function, we request the html file from origin by adding the .html extension, then request the dynamic data form CMS and combine them the function and return to user directly. What’s happening next is browser will parse the html and send requests for the resource to CloudFront where we cloud either serve from CDN cache or fetch from S3 origin.

When code in Github updates, we just need to have a hook to trigger a Jenkins build to push the new artifacts to s3. One thing to notice is we need to set the entry html file TTL to 0 on CloudFront so that we do not have to invalidate it explicitly when deploying new code. It is a trade-off.



I was having a hard time with lambda@edge logging on CloudWatch. The function I triggered from lambda test console logs fine however when the function is triggered via CloudFront, it does not appear on the /aws/lambda/Function_Name log path. I had to open an enterprise aws support ticket for it. Turns out that the function triggered by CloudFront logs have a region prefix, like: /aws/lambda/us-east-1.Function_Name

CloudFront Trigger Selection

There are currently(as of 09/015/2018) 4 triggers we can choose from:

  1. the time a viewer request is received
  2. the time of cache miss and send request to origin
  3. the time it receives response from origin and before it caches the object
  4. the time it returns the content to the viewer.

So the type 1 and 4 are kind of the expensive and heavy hook that are triggered on each request no matter what! Be careful when they are selected  as it may increase the latency as well as the cost. The origin request  is a perfect life cycle hook in this use case as we only what the entry point to be manipulated. The following real assets request can still be handled by CloudFront and leverage its cache capability.

nginx reverse proxy S3 files

China access issue

Recently some of our church site users reported that the sermon audio/video download feature does not work any more. We recently moved our large files from file system to s3. After some research looks like the aws s3 is blocked by the famous Chinese Great FireWall(GFW).

Possible Solutions

Moving files back to file system(EBS) is one option but maybe too much as we already decided to store files in S3 which is much cheaper and easier to maintain.

Tried 2nd way which is using our existing web server Nginx as a reverse proxy. The directives in nginx is quite tricky and there are a lot of conventions which is not easier to document as well as debug.

The config is quite simple though

location ^~ /s3/ {
    rewrite /s3/(.*) /$1 break;

    proxy_set_header Host '';
    proxy_set_header Authorization '';
    proxy_hide_header x-amz-id-2;
    proxy_hide_header x-amz-storage-class;
    proxy_hide_header x-amz-request-id;
    proxy_hide_header Set-Cookie;
    proxy_ignore_headers "Set-Cookie";

Basically have the url match whatever starts with /s3 and rewrite the current url with a regex so that we can get rid of the /s3 prefix and get the real key in s3. Then use break keyword so that it continues rather than the last which would try to find another handler. Next is use the proxy_pass to proxy the request to S3 with the processed real key. The other settings are just hide all the amazon response headers etc.

Nginx location order

From the HttpCoreModule docs:

  1. Directives with the “=” prefix that match the query exactly. If found, searching stops.
  2. All remaining directives with conventional strings. If this match used the “^~” prefix, searching stops.
  3. Regular expressions, in the order they are defined in the configuration file.
  4. If #3 yielded a match, that result is used. Otherwise, the match from #2 is used.

Example from the documentation:

location  = / {
  # matches the query / only.
  [ configuration A ] 
location  / {
  # matches any query, since all queries begin with /, but regular
  # expressions and any longer conventional blocks will be
  # matched first.
  [ configuration B ] 
location /documents/ {
  # matches any query beginning with /documents/ and continues searching,
  # so regular expressions will be checked. This will be matched only if
  # regular expressions don't find a match.
  [ configuration C ] 
location ^~ /images/ {
  # matches any query beginning with /images/ and halts searching,
  # so regular expressions will not be checked.
  [ configuration D ] 
location ~* \.(gif|jpg|jpeg)$ {
  # matches any request ending in gif, jpg, or jpeg. However, all
  # requests to the /images/ directory will be handled by
  # Configuration D.   
  [ configuration E ] 

pipe operator in rxjs

Pipe is introduced so that we can combine any number of operators.

const source$ = Observable.range(0, 10);
  .filter(x => x % 2)
  .reduce((acc, next) => acc + next, 0)
  .map(value => value * 2)
  .subscribe(x => console.log(x));

Above can be converted to:

const source$ = Observable.range(0, 10);
  filter(x => x % 2),
  reduce((acc, next) => acc + next, 0),
  map(value => value * 2)
).subscribe(x => console.log(x));

Pros are:

“Problems with the patched operators for dot-chaining are:

  1. Any library that imports a patch operator will augment the Observable.prototype for all consumers of that library, creating blind dependencies. If the library removes their usage, they unknowingly break everyone else. With pipeables, you have to import the operators you need into each file you use them in.
  2. Operators patched directly onto the prototype are not “tree-shakeable” by tools like rollup or webpack. Pipeable operators will be as they are just functions pulled in from modules directly.
  3. Unused operators that are being imported in apps cannot be detected reliably by any sort of build tooling or lint rule. That means that you might import scan, but stop using it, and it’s still being added to your output bundle. With pipeable operators, if you’re not using it, a lint rule can pick it up for you.
  4. Functional composition is awesome. Building your own custom operators becomes much, much easier, and now they work and look just like all other operators from rxjs. You don’t need to extend Observable or override lift anymore.”
We can also combine and make use of a single operator:
import { Observable, pipe } from 'rxjs/Rx';
import { filter, map, reduce } from 'rxjs/operators';

const filterOutEvens = filter(x => x % 2);
const sum = reduce((acc, next) => acc + next, 0);
const doubleBy = x => map(value => value * x);

const complicatedLogic = pipe(

const source$ = Observable.range(0, 10);

source$.let(complicatedLogic).subscribe(x => console.log(x)); // 50
For tap operator, we basically can do operation/logic with side effect in it and it
would return the original observable without affected by all the modification.

unsubscribe in rxjs(angular 2+)


In reactive world(rxjs/ng2+), it is common and convenient to just create some subject/observable and subscribe to them for event handling etc. It is like the gof observer pattern​ out of the box.


One caveat we recently have is, we call subscribe() of some subjects from our service in our ngOnInit or ngAfterViewInit funtions we forget to unsubscribe the subscriptions in our component. The consequence is each time the component is recreated during route change, one more subscription will be added to the subject, this is pretty bad if we are doing something heavy in the callback or even worse making some http call.

solution 1 – unsubscribe in ngOnDestroy

One solution is to call keep a reference of the subscription which is return by the subscribe function and then call its unsubscribe() function in the angular’s ngOnDestroy() lifecycle hook. It would work and is fine if there are only a few of them. If there are many and need to be called on each related component, it would be quite tedious.

Solution 2 – custom decorator calling ngOnDestroy

Another solution is to write a custom decorator which will provide logic for ngOnDestory. And the component itself still need to keep a list of subscriptions.

Solution 3 – use takeUntil operator

This way is to use a global subject to tell all subscription to stop taking values once it emit a value. It is more declarative IMHO.

import { OnDestroy } from '@angular/core';
import { Subject } from 'rxjs/Subject';

 * extend this class if component has subscription need to be unsubscribed on destroy.
 * example: myObservable.takeUntil(this.destroyed$).subscribe(...);
export abstract class UnsubscribableComponent implements OnDestroy {
  // the subject used to notify end subscription(usually with `takeUntil` operator).
  protected destroyed$: Subject = new Subject();

  protected constructor() {}

  ngOnDestroy(): void {

So in the component it can be something like:

export class MyOwnComponent extends UnsubscribableComponent implements OnInit {
  ngOnInit() {
      .subscribe(result => {
        if (result) {


sessionStorage/localStorage scope

Firstly, localStorage and sessionStorage are 2 objects on the window object. They tie to the origin of the current window.

As a result they are bind to :

  1. protocol, http/https are different
  2. domain
    1. subdomain can share with parent by manually setting document.domain.
    2. cannot share with
  3. port

Same thing apply to 302 redirect. The session/local storage value set on a page is not available on the page after redirect as long as they are different origin, even if they are in the SAME tab/window.

It can also be understood as per application based, as their values can be viewed in the dev-tool’s Application Tab.



MDN link

debug typescript mocha and server in vscode

We recently are developing a graphql api using apollo server and Typeorm on top of Aws Lambda. Code-wise is kind of straightforward with schema defined, then resolvers then service layer then dao layer then model defined typeorm with its annotations/decorators. However there are 2 issues related to debugging, unit test and run graphql local.

unit test

For unit test, our ci/cd pipeline uses nyc/mocha as runner. Those are good for running all test suites and generating reports on coverages etc. However when it comes to debugging we need to go to ide. And as we are using typescript, there is one more layer of transpile rather than vanilla es5/6 which is this a bit more complicated.

Good news is vscode comes with a powerful built-in node debugger, with the blow config, we can just open a ts​ file with mocha tests, set break point and start debug,

  "name": "TS Mocha Tests File",
  "type": "node",
  "request": "launch",
  "program": "${workspaceRoot}/node_modules/mocha/bin/_mocha",
  "args": ["-r", "ts-node/register", "${relativeFile}"],
  "cwd": "${workspaceRoot}",
  "protocol": "inspector",
  "env": { "TS_NODE_PROJECT": "${workspaceRoot}/tsconfig.json"}
  • Sets up a node task, that launches mocha
  • Passes a -r argument, that tells mocha to require ts-node
  • Passes in the currently open file – ${relativeFile}
  • Sets the working directory to the project root – ${workspaceRoot}
  • Sets the node debug protocol to V8 Inspector mode
  • The last TS_NODE_PROJECT I have to set it as I am using Typeorm which uses annotation/decorator which requires emitDecoratorMetadata set to true which is not default.

Local Run with nodemon

Another issue is as we are using aws lambda, it is not easy to run our graphql server locally.
Need to set up a local Koa server with the schema that the Apollo lambda also uses. This way we can access our the graphiql service from the  localhost:8080/graphiql​.

import 'reflect-metadata';
import * as Koa from 'koa';
import { initDatabase } from '../../dao/data-source';
import * as Router from 'koa-router';
import * as koaBody from 'koa-bodyparser';
import {
} from 'apollo-server-koa';
import { schema } from '../../gq-schema';
import { localConf } from '../../config/config';

export const routes = new Router();

// API entrypoint
const apiEntrypointPath = '/graphql';
const graphQlOpts = graphqlKoa({
    context: {msg: 'hello context'}

// routes.get(apiEntrypointPath, graphQlOpts);, koaBody(), graphQlOpts);

// GraphiQL entrypoint
routes.get('/graphiql', graphiqlKoa({ endpointURL: apiEntrypointPath }));

(async () => {
  const app = new Koa();

Now we can have nodemon run this server so every time we make any code change, the server will reload with the new content. Put below content in the nodemon.json in the project root.

  "watch": ["./src"],
  "ext": "ts",
  "exec": "ts-node --inspect= ./path/to/above/server.ts"

Notice we run ts-node with 9229 port flag which is the default debug port for chrome so that we can later do debug in the chrome’s built-in node-debugger which is a green cube icon in the chrome’s dev tool console.

Now we can run local server by adding command into package.json:

"local": "npm run build && nodemon"

Then run npm run local OR ​yarn local.

Option2 debug server with vscode

To debug the above  server with vscode, we need to add some config into the launch.json.

      "name": "Local Graphql Server",
      "type": "node",
      "request": "launch",
      "args": [
      "runtimeArgs": [
      "sourceMaps": true,
      "cwd": "${workspaceRoot}",
      "protocol": "inspector",

  • Sets up a node task that starts the currently open file in VS Code (the${relativeFile} variable contains the currently open file)
  • Passes in the --nolazy arg for node, which tells v8 to compile your code ahead of time, so that breakpoints work correctly
  • Passes in -r ts-node/register for node, which ensures that ts-node is loaded before it tries to execute your code
  • Sets the working directory to the project root – ${workspaceRoot}
  • Sets the node debug protocol to V8 Inspector mode (see above)

Now we can set break point in vscode and start debugging.

PS: No Enum in x.d.ts

One thing I notice today is in the xxx.d.ts file which is the module definition file, never define thing like Enum inside as this file is used for type/interface definition only and the content will NOT compile to js hence will not available in the run time. So if you defined anything like enum here it will compile fine but when you run the application, as long as you use these enums​, you get runtime error.

One alternative solution is to use custom type and define the list of strings:

export type MessageLevel = "Unknown" | "Fatal" | "Critical" | "Error";