Web scraping: Reliably and efficiently pull data from pages that don't expect it
Вставка
- Опубліковано 7 бер 2012
- Asheesh Laroia
Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable.
We'll cover parallel downlo
started watching this to help me fall asleep, ended up watching the whole damn thing
Thanks for doing a great job making this recording and putting it up so fast.
"String hackery" is how it all started. I personally think it depends on the task at hand.
if tags are added or order changes the task of reissuing DOM selectors still remains.
I agree it's more efficient and reliable using dom selectors over regular expressions.
But don't misunderstand the power of regular expressions. Regular expressions are heavily used for parser development and string manipulation in general.
It's also considered fundamental enough to cover in computer science curriculums
Thanks a lot for this good job!
Local business extractor
Great info on scraping. Thanks a lot.
I love scraping sites to build a massive content site - pisses the existing competition off no end. I only use it for "fantastic evil"
Learned a lot thanks!
Thanks for the great video. I loved it. :)
This is a really cool talk. Nice knowing that an average scraper can live 1 to 10 years. These are great tools if I was smart enough to use them, but I am not a programmer so I have to use Mozenda for scraping, but python and your browser tools look powerful.
Nice tutorial; is there an update on how to add SSL certificates with Mechanize yet?
That's pretty cool!
Thanks a lot for this good job!
Greetings, from the world of tomorrow!!
It's 2019 now, and it's absolutely *mind-blowing* that I'm 26 minutes in, in the middle of an interesting segment about web standards, average page sizes and so forth.... and there's been no discussion of JS/jQuery or any of the hundreds of client-side js libraries that are really important parts of the web nowadays.
It prompted me to take a quick look at my code library (which goes back to 2004 for PHP and SQL).
Sure enough, the earliest js includes I have are from _2007_ (and I was not an 'early adopter' by any means.
By 2009 there were at least a two dozen things I included **all the time** - jquery (DUH!); jqplot; tablesorter; chainedselects, wz_tooltips plus about a dozen things that I wrote myself.
Interesting that the whole of jquery was only 118KB un-minified (as of 9:23 pm on the 6th of August 2009 hahahahaha).
And yes: my 2008Header.php (and 2009Header.php etc) had a DOCTYPE declaration (XHTML 1.0 Strict//EN), an HTML tag, HEAD and BODY tags, and the resulting pages were always compliant... although they did have JS includes in the HEAD tag (LOL).
Because until 2010 the google web scrapper did not understand JS so indexed text had to be static.
I miss the old days.
I can't see the POST and GET request in latest version of Chrome. Somebody know how to see?
i used scrapy in our open source subject last year. it was fun scraping websites hahaha
Cool video! I've been using HTTP WebRequests with VB.NET to preform tasks of the same concept, I should learn python seems a lot simpler.
I didn't finish the video, yet, but he said that his favorite tool for HTML parsing is lxml, so I'm assuming he doesn't use Jquery like library for HTML parsing. I've done a great deal of HTML parsing and I prefer Jquery like parsing libraries because:
a) The syntax is more clean. It uses CSS selectors and contains syntactic sugar.
b) It contains familiar (for most people involved in web development) powerful helper methods. The similarity to JQuery is very important, because it's often useful to select elements in the web browser's console, which is much more interactive. Then you can easily port JS Jquery selector code to your app. There are extensions for Chrome and Firefox which automatically inject JQuery into the page, so the console will always have the power of JQuery available to you.
Selecting whatever you need in the console first saves a lot of time, at least it does for me.
In python there's PyQuery(uses lxml under the hood), in C# there is CSQuery which I used a lot and it was the best tool for the job.
Sergey Shepelev
Yeah I now know that lxml supports CSS select syntax, but it doesn't support all JQuery's syntax ("JQuery selector extensions" and etc). PyQuery still uses fast LXML underneath but it has all necessary extensions in place to make the life easier.
Also instead of capturing and viewing requests in the browser itself, I prefer to use a debugging proxy. ZaProxy is great and it has the best websockets support.
Also, selenium has some nifty selection syntax like driver.get(By.Id, "idvalue")
My favorite tool for HTML Parsing in Python is HTLMParsing.
Kappa.
Quite good information.
So in the first Matrix movie when we first see Neo, is he web scraping the entire net for Morpheus?
LOL
It works only with "Massive Attack - Dissolved Girl" music
1337 101 type ‘ish weaved all up in this ‘ma. good presentation.
Which Github repository are you?
Anyone know how to scrape YT sub counts from social blade site with python? i just want the number not any other content?
Reminds me of the time that I built a scraper into my website the siphoned millions of products from a competitor.
Will this still work with the new redesign of Craigslist's housing section?
Hi... Hope you are alive... Resilient spiders are difficult to write but yeah it's possible
@@tejaszarekar9145 hey, I hope you are alive too!
I also use requests, ipython, but I prefer BeautifulSoup (who uses html5lib and lxml underneath)
TOP ASHEESH ON THE WEB
lol. I just checked... he's still on top
Hey can you return html that uses post data
thanks for sharing NextDayVideo
Greaaat
Hey does that text to speech API still work?
Do you know any scrappers with NodeJS ? It should be really cool :)
cool
How can I log into a website to do scraping or maybe use a browser like Chrome to scrape - since Chrome automatically logs in to the site I want to scrape without the log in prompt. or Mozilla
Easily. Just use webdriver. I believe the name is selenium for the library. My recommendation is to use firefox. Webdriver supports several browsers. Firefox is the best browser to scrape with IMO. Webdriver is a library than controls a real browser and it allows you to get HTML of the page, run JS there and simulate all possible user interactions with a browser. With it you can either:
a) login by filling input field and sending enter key/pressing submit button
b) login yourself and run a script you want
I like to separate code that gets pages from code that parses them as much as I can. That way I can easily change implementation for how I get data. I can develop a quick prototype using webdriver which will use a real browser and then if I need to scale it I simulate the browser-server interaction in the code and remove the webdriver dependency.
I did most of my web-scraping in C# and I'm still leaning python and I didn't yet finish this video, so I'm not sure if it talked about cookies or not, but basically to log in you just need to get cookies from a site and send them with each next request. If you want to be really creative you can login using a real browser (through webdriver) which will bypass some tricks that sites sometimes use to make your life harder, then you can ask webdriver to give you cookies of that site and then you can store those cookies and use more efficient scraping tools with those cookies.
I want to shift my website from vinafix.com to bioslaptop.com
but i have an little issue that pervious website contain very huge data (Bin,Zip,PDF)
that cannot be manually download so i need a web scrapper that can do a job for me
help me
I want to create a web scraping tool, that will get the most recent information about soccer in my country, like soccer match, event, etc. Can anyone please help me get started?
well, first of all, you should tell us what you mean by webscraping tool.
what does it do? it scraps... ok but does it crawl? by crawling, i mean scraping URLs to feed the scraper and automate all the process of scraping.
do you want only data or do you want images and files too?
is the data in pure HML or in javascript?
those are the question you should ask yourself.
i leave you my email: eric5037@hotmail.com
why python and not php?
Is anything similar Python libraries to "PHP Simple HTML DOM Parser"?
It's just like jquery but in php.
"The Power Of The Dream"........ Is it Still Alive?
Dear, if you could help me with listing groups of friends on facebook I was able to log in without a token or we do not go listing groups of friends ... Thank you in advance
Please try fscraper.com/ for Facebook scraping.
+App Hero Thank you
The structure of the Cepstral page has changed. This makes the code provided unusable anymore.
at 23:30 the Quirks v. Standards mode question comes up - from 90% (Quirks) in 2012, we're way down to below 25% now. That's a really good thing!
try.powermapper.com/Stats/QuirksMode
Great content, not-so-great speaker.
He invented the facebook!
Trying to get a 200 response from sites is the hardest part :(
send a header like Mozilla/5.0
Even with user agents like Mozilla or Chrome or IE, you will still get blocked. Try a proxy rotation to build a pretty robust scraper
Y U NO USE SCRAPY?
Cepstral page doesn't exist anymore
Use Perl.
16:05
He doesn't know about backreferences in regular expressions? \1 will fix his problem at 20min. But i agree regexpr are pretty useless for scrapping.
hotkeys
"hmm"
0:06 LMFAO look how he waves... that's funny as fawk!
RAbBIT SEASON
DUCK SEASON
RABBIT SEASOn
Nice sleepy video presentation. But when you say RE regular expressions are not preferred is simply not true. You claim the single and double quotes cause problems.
Well fact is if you are a bot or scraper you have to pre Tokenize the HTML string before scraping it to make minor adjustments and to clean up mistakes and or quotes much the same as a Browser Engine does to be set for your scraper beforehand. That is why we use st=re.sub("\'s", 's', st) to filter out word->'s
slide: www.asheesh.org/pub/pycon-2010/scraping-slides.pdf
Or you could just learn XSLT.
++
WTF did he just say???
the page doesnt exist anymore :P maybe its because they have had an unusual amount of scrapper traffic going to them LOL
cut the crap and come to the point .
Regular expressions are useless for page scraping? Seriously?
Regex will cause the world to end if used to parse HTML.
there are very good frameworks, no need regex in web scraping....
cool