Web scraping: Reliably and efficiently pull data from pages that don't expect it

Next Day Video

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 бер 2012
Asheesh Laroia
Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable.
We'll cover parallel downlo

КОМЕНТАРІ • 87

@Rebelsoco 11 років тому ⁺²⁵
started watching this to help me fall asleep, ended up watching the whole damn thing
@powerthinkvideo 12 років тому
Thanks for doing a great job making this recording and putting it up so fast.
@JerrolTaylor 10 років тому ⁺¹
"String hackery" is how it all started. I personally think it depends on the task at hand.
if tags are added or order changes the task of reissuing DOM selectors still remains.
I agree it's more efficient and reliable using dom selectors over regular expressions.
But don't misunderstand the power of regular expressions. Regular expressions are heavily used for parser development and string manipulation in general.
It's also considered fundamental enough to cover in computer science curriculums
@touchesoftwares9604 6 років тому
Thanks a lot for this good job!
Local business extractor
@wensheng 12 років тому
Great info on scraping. Thanks a lot.
@kickpublishing 9 років тому ⁺⁸
I love scraping sites to build a massive content site - pisses the existing competition off no end. I only use it for "fantastic evil"
@HWShadow 11 років тому
Learned a lot thanks!
@JinhaJung 11 років тому
Thanks for the great video. I loved it. :)
@jonthralow 9 років тому
This is a really cool talk. Nice knowing that an average scraper can live 1 to 10 years. These are great tools if I was smart enough to use them, but I am not a programmer so I have to use Mozenda for scraping, but python and your browser tools look powerful.
@maro2046 10 років тому
Nice tutorial; is there an update on how to add SSL certificates with Mechanize yet?
@robbybobby6464 8 років тому
That's pretty cool!
@ramsayleemacs 12 років тому
Thanks a lot for this good job!
@thestopper5165 4 роки тому
Greetings, from the world of tomorrow!!
It's 2019 now, and it's absolutely *mind-blowing* that I'm 26 minutes in, in the middle of an interesting segment about web standards, average page sizes and so forth.... and there's been no discussion of JS/jQuery or any of the hundreds of client-side js libraries that are really important parts of the web nowadays.
It prompted me to take a quick look at my code library (which goes back to 2004 for PHP and SQL).
Sure enough, the earliest js includes I have are from _2007_ (and I was not an 'early adopter' by any means.
By 2009 there were at least a two dozen things I included **all the time** - jquery (DUH!); jqplot; tablesorter; chainedselects, wz_tooltips plus about a dozen things that I wrote myself.
Interesting that the whole of jquery was only 118KB un-minified (as of 9:23 pm on the 6th of August 2009 hahahahaha).
And yes: my 2008Header.php (and 2009Header.php etc) had a DOCTYPE declaration (XHTML 1.0 Strict//EN), an HTML tag, HEAD and BODY tags, and the resulting pages were always compliant... although they did have JS includes in the HEAD tag (LOL).
@llothar68 3 роки тому
Because until 2010 the google web scrapper did not understand JS so indexed text had to be static.
I miss the old days.
@sntshkmr60 8 років тому
I can't see the POST and GET request in latest version of Chrome. Somebody know how to see?
@theGreatDpunisher 9 років тому
i used scrapy in our open source subject last year. it was fun scraping websites hahaha
@kopuz.co.uk. 10 років тому
Cool video! I've been using HTTP WebRequests with VB.NET to preform tasks of the same concept, I should learn python seems a lot simpler.
@JohnBastardSnow 10 років тому ⁺²
I didn't finish the video, yet, but he said that his favorite tool for HTML parsing is lxml, so I'm assuming he doesn't use Jquery like library for HTML parsing. I've done a great deal of HTML parsing and I prefer Jquery like parsing libraries because:
a) The syntax is more clean. It uses CSS selectors and contains syntactic sugar.
b) It contains familiar (for most people involved in web development) powerful helper methods. The similarity to JQuery is very important, because it's often useful to select elements in the web browser's console, which is much more interactive. Then you can easily port JS Jquery selector code to your app. There are extensions for Chrome and Firefox which automatically inject JQuery into the page, so the console will always have the power of JQuery available to you.
Selecting whatever you need in the console first saves a lot of time, at least it does for me.
In python there's PyQuery(uses lxml under the hood), in C# there is CSQuery which I used a lot and it was the best tool for the job.
@JohnBastardSnow 10 років тому
Sergey Shepelev
Yeah I now know that lxml supports CSS select syntax, but it doesn't support all JQuery's syntax ("JQuery selector extensions" and etc). PyQuery still uses fast LXML underneath but it has all necessary extensions in place to make the life easier.
Also instead of capturing and viewing requests in the browser itself, I prefer to use a debugging proxy. ZaProxy is great and it has the best websockets support.
@thatguy1000001 10 років тому ⁺¹
Also, selenium has some nifty selection syntax like driver.get(By.Id, "idvalue")
@nightangel7239 10 років тому
My favorite tool for HTML Parsing in Python is HTLMParsing.
Kappa.
@arsalan2005 11 років тому
Quite good information.
@tonyrosam 9 років тому ⁺³³
So in the first Matrix movie when we first see Neo, is he web scraping the entire net for Morpheus?
@farmrocket_ 7 років тому ⁺¹
LOL
@736939 6 років тому ⁺¹
It works only with "Massive Attack - Dissolved Girl" music
@mishasawangwan6652 4 роки тому
1337 101 type ‘ish weaved all up in this ‘ma. good presentation.
@pookiedacat8364 6 років тому
Which Github repository are you?
7 років тому
Anyone know how to scrape YT sub counts from social blade site with python? i just want the number not any other content?
@BobbyTomMathew 7 років тому ⁺²
Reminds me of the time that I built a scraper into my website the siphoned millions of products from a competitor.
@gerrigriffin2209 10 років тому ⁺¹
Will this still work with the new redesign of Craigslist's housing section?
@tejaszarekar9145 5 років тому
Hi... Hope you are alive... Resilient spiders are difficult to write but yeah it's possible
@bobanmilisavljevic7857 Рік тому
@@tejaszarekar9145 hey, I hope you are alive too!
@Dromar56 11 років тому ⁺¹
I also use requests, ipython, but I prefer BeautifulSoup (who uses html5lib and lxml underneath)
@avvvqvvv99 6 років тому
TOP ASHEESH ON THE WEB
@anteconfig5391 5 років тому
lol. I just checked... he's still on top
@drewmaican 11 років тому
Hey can you return html that uses post data
@yankumar5280 9 років тому
thanks for sharing NextDayVideo
@WKirouacS 12 років тому
Greaaat
@SUPERHEAVYBOOSTER 4 роки тому
Hey does that text to speech API still work?
@joyview1 11 років тому
Do you know any scrappers with NodeJS ? It should be really cool :)
@tpvtpv8760 8 років тому
cool
@abdelrahmanmohamed8633 10 років тому
How can I log into a website to do scraping or maybe use a browser like Chrome to scrape - since Chrome automatically logs in to the site I want to scrape without the log in prompt. or Mozilla
@JohnBastardSnow 10 років тому ⁺³
Easily. Just use webdriver. I believe the name is selenium for the library. My recommendation is to use firefox. Webdriver supports several browsers. Firefox is the best browser to scrape with IMO. Webdriver is a library than controls a real browser and it allows you to get HTML of the page, run JS there and simulate all possible user interactions with a browser. With it you can either:
a) login by filling input field and sending enter key/pressing submit button
b) login yourself and run a script you want
I like to separate code that gets pages from code that parses them as much as I can. That way I can easily change implementation for how I get data. I can develop a quick prototype using webdriver which will use a real browser and then if I need to scale it I simulate the browser-server interaction in the code and remove the webdriver dependency.
I did most of my web-scraping in C# and I'm still leaning python and I didn't yet finish this video, so I'm not sure if it talked about cookies or not, but basically to log in you just need to get cookies from a site and send them with each next request. If you want to be really creative you can login using a real browser (through webdriver) which will bypass some tricks that sites sometimes use to make your life harder, then you can ask webdriver to give you cookies of that site and then you can store those cookies and use more efficient scraping tools with those cookies.
@UmerIlyas 5 років тому
I want to shift my website from vinafix.com to bioslaptop.com
but i have an little issue that pervious website contain very huge data (Bin,Zip,PDF)
that cannot be manually download so i need a web scrapper that can do a job for me
help me
@brainletcommunity 7 років тому ⁺¹
I want to create a web scraping tool, that will get the most recent information about soccer in my country, like soccer match, event, etc. Can anyone please help me get started?
@ilaravel7162 7 років тому ⁺²
well, first of all, you should tell us what you mean by webscraping tool.
what does it do? it scraps... ok but does it crawl? by crawling, i mean scraping URLs to feed the scraper and automate all the process of scraping.
do you want only data or do you want images and files too?
is the data in pure HML or in javascript?
those are the question you should ask yourself.
i leave you my email: eric5037@hotmail.com
@joyview1 11 років тому
why python and not php?
Is anything similar Python libraries to "PHP Simple HTML DOM Parser"?
It's just like jquery but in php.
@JaySparkman1 10 років тому
"The Power Of The Dream"........ Is it Still Alive?
@pilenele7922 7 років тому
Dear, if you could help me with listing groups of friends on facebook I was able to log in without a token or we do not go listing groups of friends ... Thank you in advance
@apphero6560 7 років тому
Please try fscraper.com/ for Facebook scraping.
@pilenele7922 7 років тому
+App Hero Thank you
@cloudvirake 11 років тому
The structure of the Cepstral page has changed. This makes the code provided unusable anymore.
@BrandonOsborn404 9 років тому
at 23:30 the Quirks v. Standards mode question comes up - from 90% (Quirks) in 2012, we're way down to below 25% now. That's a really good thing!
try.powermapper.com/Stats/QuirksMode
@SashaChedygov 12 років тому
Great content, not-so-great speaker.
@MarkFobert 11 років тому
He invented the facebook!
@trones9204 8 років тому ⁺³
Trying to get a 200 response from sites is the hardest part :(
@grandpagamer9594 6 років тому
send a header like Mozilla/5.0
@rajeshseptember09 5 років тому
Even with user agents like Mozilla or Chrome or IE, you will still get blocked. Try a proxy rotation to build a pretty robust scraper
@richerite 12 років тому
Y U NO USE SCRAPY?
@sminkify 12 років тому
Cepstral page doesn't exist anymore
@BryanChance 2 роки тому
Use Perl.
@domaincontroller 4 роки тому
16:05
@llothar68 10 років тому
He doesn't know about backreferences in regular expressions? \1 will fix his problem at 20min. But i agree regexpr are pretty useless for scrapping.
@shmore 11 років тому
hotkeys
@richardangapin 4 роки тому
"hmm"
@retiredshitposter1062 11 років тому ⁺¹
0:06 LMFAO look how he waves... that's funny as fawk!
@iaiband 8 років тому ⁺²
RAbBIT SEASON
DUCK SEASON
RABBIT SEASOn
@about2mount 5 років тому
Nice sleepy video presentation. But when you say RE regular expressions are not preferred is simply not true. You claim the single and double quotes cause problems.
Well fact is if you are a bot or scraper you have to pre Tokenize the HTML string before scraping it to make minor adjustments and to clean up mistakes and or quotes much the same as a Browser Engine does to be set for your scraper beforehand. That is why we use st=re.sub("\'s", 's', st) to filter out word->'s
@masterweb3314 7 років тому
slide: www.asheesh.org/pub/pycon-2010/scraping-slides.pdf
@HushMoney808 10 років тому
Or you could just learn XSLT.
@MrArmas555 4 роки тому
++
@ronsteal4243 10 років тому
WTF did he just say???
@GG_GG_GG 11 років тому
the page doesnt exist anymore :P maybe its because they have had an unusual amount of scrapper traffic going to them LOL
@neerajsingh-xf3rp 7 років тому ⁺²
cut the crap and come to the point .
@JerrolTaylor 10 років тому ⁺³
Regular expressions are useless for page scraping? Seriously?
@FullHouseFanatic 8 років тому ⁺¹
Regex will cause the world to end if used to parse HTML.
@oz4232 6 років тому
there are very good frameworks, no need regex in web scraping....
@ameeschleimer9868 4 роки тому
cool