Web Scraping and Automation With Playwright

Поділитися
Вставка

КОМЕНТАРІ • 6

  • @guyteigh3375
    @guyteigh3375 14 днів тому

    With a project to be able to scrape *most* pages it is given (a 30% or so failure rate here is perfectly tolerable) , how much code is required to make it populate fields like Title, Description, H1, H2, Category, first 4K of page text - and so on?
    Happy to use Python, Java or other language - and do not mind which headless browser to use - but the priority is speed (bandwidth to internet will not be an issue even though we need to be able to make this work on multiple cores at once).
    I realise this is perhaps not a simple question, but I am just wondering how difficult it is to create a script that will make it scrape well over half the sites it is asked to do, with a priority on speed. An hour or so work, an afternoons work, a weeks work?
    Most of the tutorials I have seen from others, explain how you can tune the system to search specific sites - which is awesome if you (for example) want to scrape a huge site with a consistent page format.
    I am after a guide that let's me provide a file of (say) 1000 pages - and it will "do its best" to scrape each one, regardless of layout - and populate fields like TITLE, DESCRIPTION, H1 Content, H2 Content, CATEGORY and so on - very much like a little search engine might want to do.
    If you know of any tutorials that might be worth a look, I would appreciate a link please. Google for once has not been massively helpful!
    Many thanks.

  • @sansnomnull2799
    @sansnomnull2799 Рік тому

    will playwright be able to scrape those anti-robot pop ups that jump in the middle of scraping? For example, walmart has that 'click to verify that you are a human'?

    • @oxylabs
      @oxylabs  Рік тому +1

      Playwright has all the tools necessary to imitate user behavior. This also includes mouse control, which would be helpful in this particular case. Here's some more info on that: bit.ly/3jmN6yA