Using AWS Serverless to Mine Global Data for Analytics, Research and Marketing

Overview

A proprietary data mining solution utilising an AWS serverless architecture and services was originally designed and built by Industry Data to create an Australia wide business directory, containing all known businesses, contacts and additional information required for analysis and marketing.

To stay ahead of the competiton the business database had to be updated daily by locating all new or changed data sources across the web. A serverless architecture was designed to allow for multiple updates from unpredictable workloads without the need to pre-empt scaling requirements.

Since go-live this technology has been used time and time again to mine information from sources all over the world. The solution is infintiely scalable and as all data is landed in S3 it is easily consumed by downstream services.

Take a look at the original product website here: Industry Prospects

Business Challenges

  • Ability to mine data daily to apply the latest updates and changes from across the web
  • Unpredicable and potentially large workloads required to run in minimal time to ensure all data is collected and processed
  • Multiple workloads required to concurrently mine from global locations
  • Data outputs must be easily integrated with downstream processes for collation and analysis

“Industry Prospects is Australia's best maintained B2B prospect database. Our proprietary web bot scours the web and collects new and changing business data. It is the work horse that needs no rest.”

--industry-prospects.com.au


Project Outcomes

AWS Lambda, SQS and S3 were used in combination to create a serverless solution with the following business outcomes:

  • Service scales quickly to demand to process all required workloads in expected timeframes, regardless of the demands made
  • Input and output data stored in S3 for easy load into SQL server databases for offshore and front-end development
  • Service has been extended to mine data from global markets including the US, UK, and EU without disruption to current processing
  • Ability to run service in near real-time where required
  • Low execution costs of approximately $350 USD to maintain all data on over 800K Australian businesses

The Industry Data Solution

The solution includes several independent AWS Lambda functions coupled with Amazon SQS and S3. Data is initially placed in an S3 bucket which triggers the service through the invokation of a controller lambda. Multiple lambda functions are then invoked using a combination of AWS CloudWatch triggers and SQS data.

Data is stored in S3 at every interval for easy tracking and problem solving, with CloudWatch logs being used as triggers to invoke further lambda functions. The solution is fully scalable at the component level to avoid any bottle necks, and the service is integral to all data mining jobs managed by Industry Data.

The AWS services utilised within this solution were: AWS Lambda / Amazon S3 / Amazon SQS

The serverless architecture developed for this solution is documented below. For more information on serverless computing by AWS please visit: AWS Serverless Architecture

Architecture diagram for an AWS serverless record matching solution