Navigation:  Standard templates >

Parser

Previous pageReturn to chapter overviewNext page
Show/Hide Hidden Text

 

  Any parser can have a loop. Another feature you may want to use is a regular expression validator for web page text.

 

First nesting level (search engine parsing)

  This level is applicable to cases where the information you are looking for is easily available when you jump to the next page. This is the way search engines are usually parsed, i.e. you come to a page with a preset key, collect all the links and jump to the next page.

  When we say 'jump to the next page' we invariably mean the next page following the same key. And hence the process continues until there are no more search results or the specified number of links is reached (e.g. you need to collect first N number of links).

  This is how a loop is organized for this particular case, with an exit, either bound to a certain number of processed pages or to the fact that there are no more available pages (e.g. the link 'Next' disappears).

 

 

 

 

 

 

Multi-layer nesting (parsing websites with multi-layer pricing lists)

      This level of nesting is applicable to cases where a web page contains links leading to pages with other links, which, in their turn, too, lead to other pages that contain the required content. Nesting types can be different. The scheme we just described is characteristic of on-line shops with catalogues as well as informational sites on films, games, etc. Or these could just be sites with catalogued information.

  Do not try approach this task in a rush. In other words, do not try to create a heap of nested loops - otherwise you may end up with half-parsed data.

  Here's another solution - you divide your parsing into two nesting levels and then create a separate template for each level. The first template first_parse.xml for the first level goes to the specified web resource and collects links to the second level into the file level_2.txt. Then the second template parse_level_2.xml is activated for the second level. It takes links from the file level_2.txt, follows them and parses links to the third level saving them into the file level_3.txt (please note that this is not within a loop that these links are parsed - this is one link per template activation, and you can determine the number of template activations as soon as the first template is executed and it becomes clear how many links it has gathered). Of course, you can opt for a loop, but this is going to be more difficult and you need to have a clear understanding of various errors you will have to process, so that your parsing does not finish in the middle of the process. (If encountering an error before it processes half of the file containing links provided by the previous template). And thus it continues until a file is formed comprising links to the last level where the needed content can be found. As a result the last template last_parse.xml will follow all the links gathered in the last file level_N.txt (where N is the last nesting level) and will parse the content on these pages saving it into the right file or files.

 So we have to deal with three types of template:

1. First type template is parsing links from the first page with links (first_parse.xml).

2. Second type template is parsing links to pages with other links (parse_level_2.xml, .... parse_level_N.xml).

3. Third type template is parsing content on final pages at the very bottom of nesting (last_parse.xml).

 

  You will have to activate these templates one by one starting with first_parse.xml, then parse_level_2.xml, .... parse_level_N.xml and at the end - last_parse.xml.

 

   

 

 

Modifications of multi-layer nesting

1.Sometimes the required content can be found on all levels and not necessarily on the last one. In this case your second type templates should also allow for the content parsing and saving. Please be reminded that you can parse one and the same page for endlessly (but first extract the text and then parse it with regular expressions with the help of corresponding macros, rather than taking each time a new text through the Get--WebBrowser branch as this overloads your computer processor).
2.Sometimes information is found under different categories. In this case you will have not just create level-oriented files, but, perhaps, create category folders. For instance, you will be parsing names of categories in the first type template and level_2.txt files will be created as a result but not alone: in each folder bearing the category name (e.g. action films, comedies, drama) there will be their own file. In other words when the first_parse.xml template is running it parses category names, and links from every category are stored into the folder with the same category name and into the file level_2.txt. There will be several files of this type depending on the number of categories. In this case you will probably have to create templates for each category on each level. Or you may take category name from the file (deleting the line afterwards) and insert the path to the file with links (e.g. level_2.txt). By doing so you will process every category. This file needs to be restored for each level as after the processing is over all lines with categories will be deleted.