Loop Item is one of the most frequently used actions in Octoparse and comes in handy when dealing with pagination button or load more button. However, many users have questions on the ending of a loop: does a loop end itself when the website reaches the last page or is there a way to end a loop manually?
There are two basic ways to end a loop in Octoparse. One way is simply to use "End loop" option from the Advanced Options. The other way requires modifying the XPath of the loop.
1. End Loop Option
"End Loop" is an advanced option that allows users to specify the number of execution(once, twice,...) of a loop. This option is perfect for anyone that knows exactly how many times they want to repeat the loop, a good example will be if an user wants to paginate for 5 times only when there are more than 5 pages really. To use this option, just click on the loop item, under Advanced Options, click open "End loop when", tick "Execution time reach", pick a number then click "Save". The loop will end after it's been repeated for the designed number of times.
2. Modify XPath
Setting "End loop" is easy and quick if you already know how many times to execute the loop (for example you may already know how many times you need to click the next page button). But in case if you have no idea of when to end the loop, you will need to modify the XPath of the loop manually.
The logic behind this is really to use an element from the page as an indicator of whether there's more to loop for. The loop should end itself when such element can no longer be located from the current page. For example, a cycle page loop should end itself when the XPath of the 'Next' button can no longer be located on the current page. So the trick is really to look for anything (most likely any icons for "Next") that persist until the very last page needed then write an XPath for it.
Here is an example to elaborate more about it (Example URL).
On the page below, the Next Page button is still visible on the last page and can be located with the XPath auto-generated by Firebug.
On the first page:
On the last page:
If we use this auto-generated XPath, the loop would not end and Octoparse would extract data from the last page repeatedly leading to endless scraping and duplicates.
But if we look at the codes of the buttons on the two pages, we can easily find the difference: the "class" attribute of "a" tag is different.
On the first page the class is "gspr next":
On the last page the class is "gspr next-d":
So we can utilize this 'difference' to modify the XPath to //a[@class='gspr next'] to locate all the 'Next Page' button but the one on the very last page (learn more about XPath here).
Now, let's do a quick check on the last page with the modified XPath.
On the first page, yes the 'Next Page' button is correctly located.
On the last page, no "Next Page" button is found just like how we want it.
With this XPath, the loop would end when Octoparse comes to the last page.
To learn more about XPath and pagination, you can refer to these tutorials:
Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!