IndiaTimes EPG grabber
The Times Of India group used to host a Indian EPG site at http://tvguide.indiatimes.com/. The Dabba1 grabber is based on pytvgrab
PyTvGrab-lib is an XMLTV grabber library for Python.It extracts information from a webpage and outputs in the xmltv format (version 0.5.15). The library provides you with helpers to generate the xml, to work with date and time, an abstract grabber model, a powerful regular expression tool and an easy to customize html parser.
I adopted the pytvgrab to support EPG download from this site. The network download speed was comparable to dialup speeds and for EPG download this was a big negative. So made the following changes to the pytvgrab implementation to deal with this.
CustomizedParserTo parse the india times web pages, added a RE expression to ignore the <script> ....</script> sequence.
- Also added many RE expressions to ignore certain unicode characters.
Checkpointing the grab-data
Reasons for checkpointing
- power failure
- program crash
- stuck network and hence need to abort (really this is the issue with the pytvgrab library)
How to checkpoint ?
EPG has three levels of data:
- main page - the updated .conf file is the way of check pointing
- Program List - (p1,p2,p3,p4) - These files can be cached
- Program description - [ 1,2,3......] -
Checkpoint by caching and saving the list of the program items
Channel configuration was modified to indicate some channels are "favorite" and other are not.
- channel configuration:
y : want EPG but not a favorite
v : want EPG and is a favorite
For channels which are not favorite, no pgm items will be grabbed just the list of programs
Adding the option --parse--parse xxx.html : this can be used to do the html parsing of the downloaded or local files. Useful for debugging
Adding the option --cachefile--cachefile URL
First show the hashed cachefile for the URL and if it is present
parse the cached file and display it.