Xtreamer Support System



How to create a scraper for jukebox?



This tutorial is based on the scraper of Fdb. Please note that any of the scrapers in the Jukebox can be used for reference as different methods were applied there.

Another important note to make is that the xVOD project also using some scraper methods. You can learn how it works from the relevant scraper PHP file in this project once you download the package .


How to make scraper - Tutorial





1. Before we start

First we need to check site chosen by us has API

f answer is: yes then the base file is xJukebox/lib/Xjb/Scraper/Online/Tmdb.php, if no choose xJukebox/lib/Xjb/Scraper/Online/Fdb.php


2 . Preparation

Ok, depend on which one scraper we have chosen we need to copy it in the same folder and rename to name of portal we want to use.

Naming convention: in this example our portal is fdb.pl so scrapers name is fdb.php. Class inside must be named as: Xjb_Scraper_Online_XXX where XXX is name of php file.

Basic methods: there are 2 basic methods we need to implement: first is searchMovie with one param $movieName, the second is refreshMovieVO with 2 params: a reference $movieVO and movie ID

if you choose portal with API step to point 3, else step to point 4



3. Making scraper with API

First of all you should register and get unique key. Now we deal with searchMovie.

Step to line 51:


$xml = @simplexml_load_file ( $search_url. $lang . "/xml/" . $apiKey . "/" . urlencode ( $movieName));

This PHP function loads XML file from API. Parameter Syntax depends on each API.

Then we start scraping :)

At line 54:

foreach ( $xml->xpath ("/OpenSearchDescription/movies/movie" ) as $item ) {

we’ve got a loop with movies nodes. In each loop we need to return

$return->offsetSet ( $name, $id );

where name is displayed movie name and id is unique id of it.

Thats all. Now if you’ll check in web interface scanner online you will see your first result list.

Time for movie’s data scraping - refreshMovieVO method.

Similar to first method this one also has to start in xml movie root:

$xml->xpath ( "/OpenSearchDescription/movies/movie") as $item

in Tmdb.php there is a loop, but it is not necessary as we scrap only one movie per call.

At line 119:

$data = ( string )$item->alternative_name;

if($data != '') {

if(! array_search ( $data, $movieVO->titleAlternateArr )) {

$movieVO->titleAlternateArr[] = $data;

$return= true;

}

}


n example how to add to our database alternative titles.

Knowledge of XPath is strongly recommended becouse with it we can vary many things.

Now we can go through code and adjust certain fields such as category, country.

A full list of attributes is in:

/xJukebox/lib/Coretis//VO/Movie.php



4. Making scraper without API

To make scraper without API i use firefox with addon ‘firebug’ and ‘xpath checker’.

Now is time for search function.

First we need to get url withsearch query. in This example $search_url = "http://fdb.pl/szukaj?query=";

Next we need to get content of search result: $fil = implode ('', file ($search_url . urlencode( $movieName )));

and load it as html $html->loadHtml('<?xml encoding="UTF-8">'.$fil);

Now we are ready to scrap search results. We get xpath: $xpath = new DOMXPath( $html );

To get interesting nodes we simply right click on element or group of elements and select view xpath.

now subwindow will appear and we can navigate betweend nodes. When we will find correct element we can copy its xpath and use it as below.

Get our nodes with links: $nodes = $xpath->query ( ".//*[@id='search']/ol/li/div[2]/p[1]/a");

And for each of them we are get href attribute as id for each movie and value as movie name:

$data = ( string ) $item->getAttribute('href');

if ($data != '')

$id = $data;

$data = ( string ) $item->nodeValue;

if ($data != '')

$name = $data;


and return them:

$return->offsetSet ( $name, $id );

Now it is time for a real movie data scrapping function:

Similar to search function wee need to get our html page for scraping.

$fil = implode ('', file ($movieID));

$xml = new DOMDocument( "1.0", "UTF-8" );

$xml->loadHTML('<?xml encoding="UTF-8">'.$fil);

$xml->normalizeDocument();

$xpath1 = new DOMXPath( $xml );


Same as before wee need to map each data field to certain node:

$data = ( string ) $xpath1->query ( ".//*[@class='title']/h2" )->item(0)->nodeValue;

//for security reason

if ($data != '') {

if (! array_search ( $data, $movieVO->titleAlternateArr )) {

//mapping our data from xpath to alternate title

$movieVO->titleAlternateArr [] = $data;

$return = true;

}

}

for every field we want we must repeat above procedure.

Thats it


5. Finishing our scraper

We are almost ready. This is what we mast do before end:

1. In your lang file (ie. pl.inc.php) we have to add:

webinterface_scrapername_Xjb_Scraper_Online_XXX

where XXX is our scraper name and then translate it.


2. in conf/xjbDefaults.xml we must add:

<Scraper lang="pl" apiKey="no" rescanDelay="1209600" name="Xjb_Scraper_Online_XXX" />

where XXX is our scraper name and change lang to correct. If you use api then change apiKey to correct one.

That’s all.


6. Some hints and objects descriptions

* Coretis_VO_Person - object which represents a person in movie. It has a few roles:

static $JOB_DIRECTOR = 'director';

static $JOB_PRODUCER = 'producer';

static $JOB_EXECUTIVE_PRODUCER = 'executive producer';

static $JOB_SCREENPLAY = 'screenplay';

static $JOB_ACTOR = 'actor';

static $JOB_AUTHOR = 'author';

static $JOB_ORIGINAL_MUSIC_COMPOSER = 'original music composer';

static $JOB_DIRECTOR_OF_PHOTOGRAPHY = 'director of photography';

static $JOB_EDITOR = 'editor';

static $JOB_CASTING = 'casting';

static $JOB_OTHER = 'other';

* Coretis_VO_Picture - represents image (ie. poster or fanart):

static $TYPE_POSTER = 'poster'; # The DVD / Blu-ray cover art.

static $TYPE_FANART = 'fanart'; # The background picture. Called backdrop on some scrapers

static $TYPE_SCREEN = 'screen'; # The background picture. Called backdrop on some scrapers


* REMEMBER to check in lib/xjb/mapper/Genre.php if there is your language mapping for genres, otherwise your correctly grabbed genres won’t show at edit page.

http://en.wikipedia....mming_interface


$xml = @simplexml_load_file ( $search_url . $lang . "/xml/" . $apiKey . "/" . urlencode ( $movieName ));

foreach ( $xml->xpath ( "/OpenSearchDescription/movies/movie" ) as $item ) {

$return->offsetSet ( $name, $id );

$xml->xpath ( "/OpenSearchDescription/movies/movie" ) as $item

$data = ( string ) $item->alternative_name;

if ($data != '') {

if (! array_search ( $data, $movieVO->titleAlternateArr )) {

$movieVO->titleAlternateArr [] = $data;

$return = true;

}

}

/xJukebox/lib/Coretis//VO/Movie.php

Thank to duch


Related Articles

No related articles were found.

Attachments

No attachments were found.

Visitor Comments

Article Details

Last Updated
19th of December, 2010

Would you like to...

Print this page  Print this page

Email this page  Email this page

Post a comment  Post a comment

 Subscribe me

Subscribe me  Add to favorites

Remove Highlighting Remove Highlighting

Edit this Article

Quick Edit

Export to PDF


User Opinions

100% thumbs up 0% thumbs down (1 vote)

How would you rate this answer?




Thank you for rating this answer.

Continue