Getting Started
Requirements
- Java (1.6)
- IE or Firefox (and all browsers supported by WebDriver)
- Crawljax jar (and all of its dependencies)
Running Crawljax through a properties file
Download the zip file and unpack it in a directory “crawljax”. From that directory run:
java -jar crawljax-versionnr.jar
where versionnr is the version of the crawljax downloaded.
In this case, crawljax will use the crawljax.properties file for the basic settings such as the URL, click tags, etc.
Using Crawljax Programatically
The preferred way of configuring Crawljax is using a simple “runner class” written in Java. This class consists of a main method that configures Crawljax using its configuration API. The class can then start Crawljax and catch any exceptions it throws.
The first step is to create a CrawlSpecification in which you configure what and how Crawljax should crawl. In this example we will crawl www.google.com.
CrawlSpecification crawler = new CrawlSpecification("http://www.google.com");
Clicking Elements
The general method to define which elements Crawljax should click and which it should not click is:
- Select a (large) set of elements to click to get good coverage. For example: often all anchor tags should be clicked.
- Exclude elements from the set which should NOT be clicked.
We want Crawljax to do a full exploration of the website. Crawljax should therefore click on all the anchor elements.
// click all anchor elements
crawler.click("a");
When you inspect the Google page you also see buttons (Google Search and I’m Feeling Lucky), which also should be clicked:
// click the submit buttons
crawler.click("input").withAttribute("type", "submit");
If we would run Crawljax with this specification it would click all the anchor elements and input buttons. There are often some links we don’t want to click, thus we should exclude these elements.
For example we are not interested in Language Tools so we specify that Crawljax should not click this link:
// don't click elements with the text Language Tools
crawler.dontClick("a").withText("Language Tools");
Often you want Crawljax not to click links in a certain area. In this example we do not want to click on any links in the top bar (e.g. Sign out and iGoogle).
With the underXpath() function you can select the elements under a certain element. In this example we do not wish to click the links which are under div’s with id=guser
// don't click elements in the top bar
crawler.dontClick("a").underXPath("//DIV[@id='guser']");
Now we specified that Crawljax should click all the anchor tags and submit buttons, except the language tools and the anchor elements in the top bar. Note: External links to other websites are automatically ignored by Crawljax.
Specifying Input Data
By default Crawljax enters random values into input fields. Some fields may need specific values to enable the user to proceed (e.g a valid phone number or URL).
To specify manual values you can use the InputSpecification class:
InputSpecification input = new InputSpecification();
To specify that Crawljax should enter manual values in form fields you need to know the id or name of the input elements. Whenever Crawljax encounters a form field it checks if there is manual input specified for this field by checking its id and name (in this order).
On www.google.com the input field for the search value has the name “q” for which we want a specified value:
// when Crawljax encouters a form element with the id or name "q" enter "Crawljax"
input.field("q").setValue("Crawljax");
When you specified all the form input values, add this to the CrawlSpecification
crawler.setInputSpecification(input);
Limiting the Crawling scope
When you are testing your crawl specification or want to limit the crawling scope/time you can limit Crawljax:
// limit the crawling scope
crawler.setMaximumStates(5);
crawler.setDepth(2);
Running Crawljax
Crawljax needs a configuration in which different Crawljax settings can be set.
CrawljaxConfiguration config = new CrawljaxConfiguration();
Add the created CrawlSpecification
config.setCrawlSpecification(crawler);
By default Crawljax crawls in Firefox using the FirefoxDriver from WebDriver. If you would like to run Crawljax in a different browser:
WebDriverIE
config.setBrowser(new ());
Set the output path for generated files:
config.setOutputFolder("/tmp/");
Now we are ready to run Crawljax:
CrawljaxController crawljax = new CrawljaxController(config);
try {
crawljax.run();
} catch (ConfigurationException e) {
e.printStackTrace();
System.exit(1);
}
} catch (CrawljaxException e) {
e.printStackTrace();
System.exit(1);
}