multithreading - How check if a task is already in python Queue? -


I am writing a simple crawler in Python using threading and Q modules. I get a page, check the link and place them in a queue, when a certain thread process ends the page, it grabs the next one from the queue, I use an array for those pages I am looking to filter the links that I have already added to the queue, but if there are multiple threads and they get the same link on different pages Qatar put in duplicate links I've already exists in the URL line that I can find out how to do it or not, to avoid putting it again?

If you do not care about the order in which the items are processed, then I I try a sub-class of which internally uses set :

  class setQueue (queue): def _init (self , Maxsize): self.maxsize = maxsize self.queue = set () def_put (self, item): self.queue.add (item) def _get (self): Paul McGuire told that to add this "duplicate item" The latter has been removed from the "to-b-processed" set and has not yet been added, To resolve this on your own KQPop ()  

set of "processed", you can store both sets in the line example , But when you are checking in a large number that the item has been processed, you can also go back to the queue which will order the requests properly.

  class setQueue (line): def _init (auto, max): QE (self, max) self.all_items = set () def _put (auto, item): if The item is not in the self. All_items: Qi._put (self, items) self.all_items.add (item)  

Using a set aside, unlike its profit, is that line < / Code> methods are thread-safe, so that you do not need extra locking to check the other set.


Comments

Popular posts from this blog

MySql variables and php -

url rewriting - How to implement the returnurl like SO in PHP? -

Which Python client library should I use for CouchdB? -