The internet is such a big place, and it is still growing exponentially together with the (also) growing trend of data traffic. Sometimes, that what just matters is all that we need. Links, paragraphs, keywords are three examples of data that we care about: the metadata. LXML is a great library that makes parsing HTML documents from within Python pretty useful, so I decided to write some code example for those who are interested.
Scraping the Reddit front page as an example
Reddit's front page is easily parsable. In fact, it has a straight forward CSS structure that actually makes sense:
Each link to a post is contained inside a div tag with the thing class inside of it. Chromium - the internet browser in the screenshot above - actually supports searching by XPath from the developer console. Very neat, cheers to the developers that made this possible!
The same thing could be done programmatically, by using Python and LXML. Here's an example that should work:
#!/usr/bin/env python3 import lxml.html from pycurl import Curl from io import BytesIO userAgent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' redditMainPage = 'https://www.reddit.com/new/' def fetchUri(uriToFetch): buffer = BytesIO() c = Curl() c.setopt(c.URL, uriToFetch) c.setopt(c.WRITEDATA, buffer) c.setopt(c.USERAGENT, userAgent) c.perform() c.close() return buffer.getvalue().decode('iso-8859-1') requestResult = fetchUri(redditMainPage) requestLxmlDocument = lxml.html.document_fromstring(requestResult) requestLxmlRoot = requestLxmlDocument.xpath("//div[contains(@class, 'thing')]//div[contains(@class, 'entry')]//p[contains(@class, 'title')]//a[contains(@class, 'title')]") for rootObject in requestLxmlRoot: print(str(rootObject.text_content())+"\n")
This code iterates over each reddit post found on the main page, and returns it's name, followed by a newline character. This code snippet should work fine on both Python 2 and 3, with PycURL and LXML installed. Good luck experimenting with LXML!