How It Works

In a nutshell, SpookyStuff scales up data collection by distributing web clients to many machines. Each of them receives a portion of heterogeneous tasks and run them independently. After that, their results can either be transformed and reused to dig deeper into the web by visiting more dynamic pages, or be exported into one of many data storage: including local HDD, HDFS, Amazon S3, or simply Memory block in JVM.
SpookyStuff is extremely lightweight by offloading most of the task scheduling & data transformation work to Apache Spark. It doesn't depend on any file system (even HDFS is optional), backend database, or message queue, or any SOA. Your query speed is only bounded by your bandwidth and CPU power.
SpookyStuff use phantomjs/GhostDriver to access dynamic pages and mimic human interactions with them, but it doesn't render them - nor does it download any image embedded in them by default (unless you take a screenshot), which makes it still considerably faster even on a single machine.
SpookyStuff's query language is an extension of Spark API, there is no problem in mixing it with APIs of other Spark components and ecosystems, particularly SparkSQL, GraphX and MLlib.

Published under ASF License, see LICENSE.

Provide feedback