python, lxml and xpath - html table parsing -
I am new to lxml, new to python and the following can not be found:
< P> I have to import some tables with 3 columns and an undefined number of rows starting 3 lines.When the second column of any row is empty, then this row is removed and processing of the table is aborted.
The following code fixes the data in the table (but I can not reuse the data later):
  to lxml.html import parse def Process_row (line): for cell in row.xpath ('./td'): print cell.text_content () produce cell.text_content () def process_table (table): [process_row (line) for line return table.xpath ('./tr')] doc = parse (url) .getroot () tbl = doc.xpath ("/ html // table [2]") [0] data = process_table (tbl)   This prints the first column only: (
  In the data I: print i.next ()   The following are only the third Import line, and not later
  tbl = doc.xpath ("// body / table [2] // tr [position ()> gt; 2]") [0] < / Code>  Anyone can know a fancy solution to get all the data from row 3 to tbl and copy it to an array, so it is processed in the module with any LXML dependency
Thanks for your help, Alex There is a generator:
  def process_row (line): row.xpath ('./t D '): print cell.text_content () yield cell.text_saint ()   You are calling it though you thought it returns a list that is not There are references in which it behaves like behaves :
  for print [r process_row (line)]    But this is just because a generator and a list both leave the same interface to  . Use it in a context where it is only evaluated at a time, such as: 
  return to the [process_row (line) for the row in table.xpath ('./tr') ]    calls only a new example of the generator for each new value of  line , the first result returns compensation. 
So this is your first problem. Your second one is that you are expecting:
  tbl = doc.xpath ("// body / table [2] // tr [position ()> 2] "[0]    to give you all the rows of the third and the last, and this only  tbl  Setting in the third line is OK,  xpath     is returning to the third and all subsequent rows  [0]  In the end that is messing you up. 
Comments
Post a Comment