Crawling Ajax-based Web Applications

Fork me on GitHub

Crawljax Gets a Face Lift. Web UI in 3.5

We just released Crawljax 3.5 and with it comes a web interface for Crawljax. It supports all the major functionality of crawljax such as adding/editing crawl configurations, installing plugins, viewing crawl logs in real time and displaying plugin output. Kudos to the developers, @jeremyhewett and @ryansolid. To run it, simply unzip the web distribution archive and run the command java -jar crawljax-web-3.5-SNAPSHOT.jar. You can customize the port and output directory using -p and -o arguments. Here are some screen-shots of the web interface in action. More usage tutorials will follow.

"Crawljax Web UI Home" Crawljax Web UI Home – You can start by adding a ‘New Configuration’ as shown

"Edit Configuration Page" ‘Edit Configuration’ Page

"Add Plugin Page" ‘Add Plugin’ Page

"Crawl Status Log" Crawl status log with real-time updates

"Crawl Overview Output" Output of the Crawl Overview plugin

Also in release 3.5:

  • Deprecated the malfunctioning DomChangeNotifierPlugin. Introuced StateVertexFactoryl. #347
  • Better PhantomJS support. Tests run on PhantomJS by default.
  • Switched from URL’s to URI’s for better performance. #322

You can download the release from our GitHub project site.

Release 3.4 Is Out

In this release:

  • Crawljax doesn’t accidentally go to other URLs anymore during a Crawl. #339
  • StateVertexImpl.getUsedEventables() Always returned an empty list #350
  • Fixed some Findbugs errors (thanks to @keheliya)

All binaries are available at Maven Central and GitHub

Release 3.3 Is Out

In this release Crawljax offers support for Phantom JS (if installed).

Other fixes include:

  • External URLs are not opened by default #328
  • Updated Selenium

All binaries are available at Maven Central and GitHub

Release 3.2 Is Out

Release 3.2 is out. It’s a small update with one API change and several bug fixes:

Metrics #314

Implements statistics for Crawljax using Codehale’s metrics.

We have pre-loaded some metrics in core and you can extend the functionality yourself. We have created an example that shows how you can print the already inserted metrics and add one of your own.

The default counters that are included right now are:

  • Crawler lost count
  • Unfired actions count
  • Invocations per plugin count.

The output of the example is the standard counters plus a histogram of the DOM-size:

[main] INFO  - type=COUNTER, name=com.crawljax.crawlevents.crawler_lost, count=0
[main] INFO  - type=COUNTER, name=com.crawljax.crawlevents.unfired_actions, count=0
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.DomChangeNotifierPlugin.invocations, count=19
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnBrowserCreatedPlugin.invocations, count=1
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnFireEventFailedPlugin.invocations, count=0
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnInvariantViolationPlugin.invocations, count=0
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnNewStatePlugin.invocations, count=17
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnRevisitStatePlugin.invocations, count=1
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.OnUrlLoadPlugin.invocations, count=15
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.PostCrawlingPlugin.invocations, count=1
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.PreCrawlingPlugin.invocations, count=0
[main] INFO  - type=COUNTER, name=com.crawljax.crawlplugins.PreStateCrawlingPlugin.invocations, count=15
[main] INFO  - type=HISTOGRAM, name=com.crawljax.examples.MetricPluginExample.domsize, count=17, min=2, max=5, mean=3.4705882352941178, stddev=0.7998161553463027, median=4.0, p75=4.0, p95=5.0, p98=5.0, p99=5.0, p999=5.0

Further more:

  • Crawl configuration now has an option to set the output folder #316. This is useful for plugins that require an output folder.
  • Browser.getDom() is deprecated. You can now choose between getStrippedDom and getUnstrippedDom() #305.
  • API Change: Proxy plugin has been removed. It didn’t work in 3.1 and 3.0 and is now replaced by the PreCrawlPlugin. #286

Release 3.1 Is Out

Release 3.1 is out. It’s a small update with one API change and several bug fixes:

  • Added the possibility to stop Crawljax from using the runner or by calling stop from any plugin. #270
  • Fixes bug positioning of elements in the crawloverview #237
  • Fixed HashCode/Equals bug in Conditions #276
  • Fixes bug where edges would disappear from the StateFlowGraph #272
  • Updated to the new version of Selenium

Release 3.0 Is Out

Release 3.0 is out. This is a major release, which contains many key updates and renovations. The release contains several bug fixes and loads of enhancements. The code base has been split up into modules, the API has changed a little, and the crawl overview plugin has been completely renovated.

New overview plug-in

Most notable in the new release is the new overview plugin. The plugin shows an interactive state graph of the crawl and some statistics. Make sure you check out a demo or try one of our examples yourself.

"The new overview plugin"

New command line interface

This release brings more support for command line configuration. Once you download the zip, running Crawljax can simply be done using the command:

java -jar crawljax-cli-version.jar outputfolder

Crawljax will Crawl that site with the new Crawl overview plugin enabled. You can run java -jar crawljax-cli-version.jar to see a list of possible configurations for the crawl.

The zip is downloadable from the central Maven repository.

Other important updates

  • Crawljax is now configured using a configuration builder. You start your configuration using CrawljaxConfiguration.builderFor("");.
  • The project has been split up in three versions: core, cli and examples. The cli modules contains the command line interface. The core module can be included in any project as a jar to run Crawljax programmatically. The examples module is the easiest way to try out several configurations of Crawljax in your favorite IDE. Check out our updated documentation for more details.
  • You can configure the crawler to crawl all found href attributes. Even if the elements are not visible because they only show up when the crawler hovers on another element.
  • You can now configure the crawler so that it does not click any children of a certain element using a short syntax like dontClickChildrenOf("LI").withId("dontClickMe");
  • Major performance and stability improvements.

You can view all closed issues or the full diff on Github.

Release 2.1 Is Out

Release 2.1 contains bug fixes and browser updates. You can find the solved issues here or look at the full diff here.

We’re already working on release 2.2. If you need anything fixed in Crawljax, make sure you file your issue!

We’ve Moved to GitHub!

Crawljax is alive and kicking again! We’ve started off by moving the Crawljax project to Github. This makes contributing to the project much easier using Git’s decentralized approach and Github’s pull requests.

All issues from Goole Code have been imported into Github’s issue tracker, and the wiki has moved as well. With this new website we will keep everyone informed about Crawljax’s development.

Feel free to file an issue or fork our repo and generate a pull request!

The Crawljax Team