With a project to be able to scrape *most* pages it is given (a 30% or so failure rate here is perfectly tolerable) , how much code is required to make it populate fields like Title, Description, H1, H2, Category, first 4K of page text - and so on? Happy to use Python, Java or other language - and do not mind which headless browser to use - but the priority is speed (bandwidth to internet will not be an issue even though we need to be able to make this work on multiple cores at once). I realise this is perhaps not a simple question, but I am just wondering how difficult it is to create a script that will make it scrape well over half the sites it is asked to do, with a priority on speed. An hour or so work, an afternoons work, a weeks work? Most of the tutorials I have seen from others, explain how you can tune the system to search specific sites - which is awesome if you (for example) want to scrape a huge site with a consistent page format. I am after a guide that let's me provide a file of (say) 1000 pages - and it will "do its best" to scrape each one, regardless of layout - and populate fields like TITLE, DESCRIPTION, H1 Content, H2 Content, CATEGORY and so on - very much like a little search engine might want to do. If you know of any tutorials that might be worth a look, I would appreciate a link please. Google for once has not been massively helpful! Many thanks.
will playwright be able to scrape those anti-robot pop ups that jump in the middle of scraping? For example, walmart has that 'click to verify that you are a human'?
Playwright has all the tools necessary to imitate user behavior. This also includes mouse control, which would be helpful in this particular case. Here's some more info on that: bit.ly/3jmN6yA
With a project to be able to scrape *most* pages it is given (a 30% or so failure rate here is perfectly tolerable) , how much code is required to make it populate fields like Title, Description, H1, H2, Category, first 4K of page text - and so on?
Happy to use Python, Java or other language - and do not mind which headless browser to use - but the priority is speed (bandwidth to internet will not be an issue even though we need to be able to make this work on multiple cores at once).
I realise this is perhaps not a simple question, but I am just wondering how difficult it is to create a script that will make it scrape well over half the sites it is asked to do, with a priority on speed. An hour or so work, an afternoons work, a weeks work?
Most of the tutorials I have seen from others, explain how you can tune the system to search specific sites - which is awesome if you (for example) want to scrape a huge site with a consistent page format.
I am after a guide that let's me provide a file of (say) 1000 pages - and it will "do its best" to scrape each one, regardless of layout - and populate fields like TITLE, DESCRIPTION, H1 Content, H2 Content, CATEGORY and so on - very much like a little search engine might want to do.
If you know of any tutorials that might be worth a look, I would appreciate a link please. Google for once has not been massively helpful!
Many thanks.
will playwright be able to scrape those anti-robot pop ups that jump in the middle of scraping? For example, walmart has that 'click to verify that you are a human'?
Playwright has all the tools necessary to imitate user behavior. This also includes mouse control, which would be helpful in this particular case. Here's some more info on that: bit.ly/3jmN6yA