|Written 2010 by Will Johnson for Fast Forward Technologies |
Email Fast Forward Technologies at firstname.lastname@example.org
or post your comments for public view far below.
Creative Commons Attribution 3.0 License
Follow Fast Forward Technologies on Twitter!
or use my Knol Public activity feed
Knol uses the "class" tag extensively, assigning easily recognizable names into the classes. This allows us to realize that we can write code to parse those tags which causes a program to know what the tag values "mean". This is similar to screen-scraping except we're more simply scraping the output HTML file to extract the meaning and values and discarding the rest of the display elements.
We will then build an array of simply the components of the output result. One use for something like this, perhaps, would be the following scenario. Search on "Dog", collect all the articles, sort them alphabetically by author's name, then post them up on your own HTML page elsewhere as links back into Knol with your own Adsense banners around the content.
The manual labor of extracting and updating the content is removed. It can all be done through automatic PHP scripting. First we have to discover the complete layout of the file.
Knol's Search Toolkit, has many options. After you have selected which options you want and clicked submit, you will see that those options are passed using the GET method. That is, they appear, embedded in the URL in your address bar. So your short URL is suddenly 300 characters long or whatever, after you've filled-in some complex options. Try it. Search on Cat or something, click Submit, when the results page comes up, check the URL. Scroll to the right and you should see "Cat" embedded somewhere in the URL string.
Knowing this, we can bookmark results, but more importantly for our current purposes, we can write scripts to execute the exact same search every week or every day, and return the results, with any updates of course, into our program. The script doesn't need to know what the screen layout of the Search Toolkit page looks like, it simply passes the long URL into Knol intact and listens for the results to come back.
On the results page, if you are using the Firefox browser, select from your Main Menu, the View option, then select Page Source. Other browsers probably use a different menu name, but what we're trying to show here, is that you can see the underlying HTML code. Examine that code. Look for the instance of the Knol title that appears as the first result on your results page.
Now just a bit ahead of that title you will see the following Div id defined:
We can going to tell our program script, that this particular string is where the "results begin". But what is the format of the results paragraph? I have gone through the results paragraph and stripped out any background noise and the values, so we can see just the format. So, each returned "result" has the following format:
<li class="knol-search-bullet"><div class="knol-search-knol"><div class="knol-search-left"><div class="knol-search-knol-image-c"><img class="knol-search-knol-image knol-search-knol-image-sized"><div class="knol-search-mid"><div class="knol-title-wrapper"><a class="knol-search-knol-title"><div class="knol-search-knol-author"><div class="knol-search-knol-snippet"><div class="knol-search-right"><div class="knol-search-knol-author"><div class="knol-search-knol-info knol-search-knol-info-pageviews"><span class="knol-search-knol-info-details"><span class="knol-search-knol-info knol-search-knol-info-version"><span class="knol-search-knol-info-details"><span class="knol-search-knol-info knol-search-knol-info-edited"><span class="knol-search-knol-info-details"><div class="knol-search-knol-info"><span class="knol-search-knol-zipit"><span class="zzAggregateRating"><span class="knol-zipit-count-display"><a class="knol-search-knol-info knol-search-knol-info-badges"><class="knol-badge-small knol-sprite-main-top_viewed_badge">
What does it all mean?!?!!? We'll see in our next lesson. But I'll tell you this right now. The class "knol-search-bullet" is a tally mark for each results paragraph. So a simple test would be to just count how many of these classes appear. That would tell your program, how many separate results are on this page. Of course it will not tell you how many pages of results exist.
--> Forward to Parsing Knol's Search Toolkit Results (Page 2)