Parsing Knol's Search Toolkit Results

Using the Class in Knol to code a PHP parsing routine

This article will show you how to parse the Knol Toolkit Search output, by understanding it's use of classes, into a PHP program so you can do whatever other kind of manipulation you want to do with it.

Authors

Written 2010 by Will Johnson for Fast Forward Technologies
Email Fast Forward Technologies at fft2001@aol.com
or post your comments for public view far below.
Creative Commons Attribution 3.0 License

Follow Fast Forward Technologies on Twitter!
or use my Knol Public activity feed

Knol uses the "class" tag extensively, assigning easily recognizable names into the classes.  This allows us to realize that we can write code to parse those tags which causes a program to know what the tag values "mean".  This is similar to screen-scraping except we're more simply scraping the output HTML file to extract the meaning and values and discarding the rest of the display elements.

We will then build an array of simply the components of the output result.  One use for something like this, perhaps, would be the following scenario.  Search on "Dog", collect all the articles, sort them alphabetically by author's name, then post them up on your own HTML page elsewhere as links back into Knol with your own Adsense banners around the content.

The manual labor of extracting and updating the content is removed.  It can all be done through automatic PHP scripting.  First we have to discover the complete layout of the file.

Knol's Search Toolkit, has many options.  After you have selected which options you want and clicked submit, you will see that those options are passed using the GET method.  That is, they appear, embedded in the URL in your address bar.  So your short URL is suddenly 300 characters long or whatever, after you've filled-in some complex options.  Try it.  Search on Cat or something, click Submit, when the results page comes up, check the URL.  Scroll to the right and you should see "Cat" embedded somewhere in the URL string.

Knowing this, we can bookmark results, but more importantly for our current purposes, we can write scripts to execute the exact same search every week or every day, and return the results, with any updates of course, into our program.  The script doesn't need to know what the screen layout of the Search Toolkit page looks like, it simply passes the long URL into Knol intact and listens for the results to come back.

On the results page, if you are using the Firefox browser, select from your Main Menu, the View option, then select Page Source.  Other browsers probably use a different menu name, but what we're trying to show here, is that you can see the underlying HTML code.  Examine that code.  Look for the instance of the Knol title that appears as the first result on your results page.

Now just a bit ahead of that title you will see the following Div id defined:
<div id="knol-searchresults-id">

We can going to tell our program script, that this particular string is where the "results begin".  But what is the format of the results paragraph?  I have gone through the results paragraph and stripped out any background noise and the values, so we can see just the format.  So, each returned "result" has the following format:

<li class="knol-search-bullet"><div class="knol-search-knol"><div class="knol-search-left"><div class="knol-search-knol-image-c"><img class="knol-search-knol-image knol-search-knol-image-sized"><div class="knol-search-mid"><div class="knol-title-wrapper"><a class="knol-search-knol-title"><div class="knol-search-knol-author"><div class="knol-search-knol-snippet"><div class="knol-search-right"><div class="knol-search-knol-author"><div class="knol-search-knol-info knol-search-knol-info-pageviews"><span class="knol-search-knol-info-details"><span class="knol-search-knol-info knol-search-knol-info-version"><span class="knol-search-knol-info-details"><span class="knol-search-knol-info knol-search-knol-info-edited"><span class="knol-search-knol-info-details"><div class="knol-search-knol-info"><span class="knol-search-knol-zipit"><span class="zzAggregateRating"><span class="knol-zipit-count-display"><a class="knol-search-knol-info knol-search-knol-info-badges"><class="knol-badge-small knol-sprite-main-top_viewed_badge">



What does it all mean?!?!!?  We'll see in our next lesson.  But I'll tell you this right now.  The class "knol-search-bullet" is a tally mark for each results paragraph.  So a simple test would be to just count how many of these classes appear.  That would tell your program, how many separate results are on this page.  Of course it will not tell you how many pages of results exist.



--> Forward to Parsing Knol's Search Toolkit Results (Page 2)

Comments

Interesting

Can you observe how many knols are published so far and how many authors have registered on knol so far?

Narayana Rao - 27 Sep 2010

Thank you for the reply.
Today, Google celebrates birthday. Of course now the doodle was changed back. Earlier it was a cake.

http://knol.google.com/k/-/-/2utb2lsm2k7a/3018

I am happy to report that today my monthly visitors number will touch 33,000.
The monthly page view number will cross 48,000. For the last three months there is a continuous increase in page numbers. I hope you are also having increase in page views.

What is lacking on Knol is systematic promotion of their articles by Knol authors.
There is a benefit of using twitter, Google Buzz, social book marking sites, and Blog for promoting Knol articles. Certainly Knol will have many more visitors and also there will be positive sentiment in social media.

Regards.

Narayana Rao - 27 Sep 2010

I don't know how to see that. When I do searches, it seems to truncate the list of authors at a certain number. I'm not sure how to tell the total number.

Will Johnson - 27 Sep 2010