An external-content connector that retrieves content by scraping a public website.
This module allows you to extract content from another website by crawling and parsing
its DOM structure and transforms it directly into native SilverStripe objects, then
imports those objects into SilverStripe's database as though they had been created
via the CMS.
Although this has the disadvantage of leaving it unable to extract any information
or structure that isn't represented in the site's markup, it means no special access
or reliance on particular back-end systems is required. This makes the module suited
for legacy and experimental site-imports, as well as connections to websites generated
by obscure CMS's.
Importing a site is a 2 or 3 step process (Depending on user-selection).
A list of URLs are fetched and extracted from the site via PHPCrawl,
and cached in a text file under the assets directory.
Each cached URL corresponds to a page or asset (css, image, pdf etc) that the module
will attempt to import into native SilverStripe objects e.g. SiteTree
and File
.
Page content is imported page-by-page using cUrl, and the desired DOM elements
extracted via configurable CSS selectors via phpQuery
which is leveraged for this purpose.
See the included migration documentation for detailed
instruction on migrating a legacy site into SilverStripe using the module.
This module requires the PHP Sempahore
functions to work. These are installed by default on Debian and some OS/X PHP
distributions, but if you're using Macports you'll need to add the +ipc
flag
when installing php5
.
If compiling PHP from source you need to pass three additional flags to PHP's
configure script:
./configure <usual flags> '--enable-sysvsem' '--enable-sysvshm' '--enable-sysvmsg'
Once that's done, you can use Composer to add the module
to your SilverStripe project:
#> composer require phptek/staticsiteconnector
Please see the included Migration document, that describes
exactly how to configure the tool to perform a site-scrape / migration.
There is also an example database-dump (MySQL/MariaDB only)
provided which you can import into your DB to get you up and running quickly.
This code is available under the BSD license, with the exception of the PHPCrawl
library, bundled with this module which is GPL version 2.
Module rating system helping users find modules that are well supported. For more on how the rating system works visit Module standards
Score not correct? Let us know there is a problem