Types of PDF - Computerphile

Поділитися
Вставка
  • Опубліковано 17 чер 2021
  • "Just send me a PDF!" - but what kind of PDF? As Professor Brailsford explains, PDF is simply a wrapper which can contain a variety of joys!
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

КОМЕНТАРІ • 396

  • @isaac10231
    @isaac10231 3 роки тому +798

    Life goal - finding something to be as passionate in life as this man is about crispy text.

    • @skuzzbunny
      @skuzzbunny 3 роки тому +15

      crispy text is the best!!!!!D

    • @unlokia
      @unlokia 3 роки тому +21

      CRISP, *_not_* "crispy". This is a silly error that seems to be propagating net-wide +as usual we can blame the yanks!!+
      A brand of creme donuts' products are named "crispy", images and text are *CRISP!!*

    • @CJT3X
      @CJT3X 3 роки тому +8

      @@unlokia no need to be so crispy ‘bout it

    • @DryPaperHammerBro
      @DryPaperHammerBro 3 роки тому +1

      @@skuzzbunny {o{obi,l. K.l k I 98xd

    • @kokoinmars
      @kokoinmars 3 роки тому

      Crispy text is nothing to scoff about.

  • @martinbean
    @martinbean 3 роки тому +455

    Imagine saying something as innocuous as “I’ll send you a PDF” to this guy and then getting a 2-hour lecture in response…

    • @FriedEgg101
      @FriedEgg101 3 роки тому +20

      Maybe you could cut the lecture short by following up with "it'll be PDF Normal".

    • @erwinmulder1338
      @erwinmulder1338 3 роки тому +17

      Professor Brailsford can lecture me all day.

    • @michaeldamolsen
      @michaeldamolsen 3 роки тому +7

      That would be the best day of the month for sure!

    • @swiftfox3461
      @swiftfox3461 3 роки тому +4

      I'd listen closely and turn off my phone to make sure I didn't miss anything.

    • @amicaaranearum
      @amicaaranearum 3 роки тому +6

      Professor Brailsford definitely made this video in response to receiving a low-quality PDF scanned from a photocopy.

  • @StraightOuttaJarhois
    @StraightOuttaJarhois 3 роки тому +662

    What PDF says to me isn't quality, but uniformity, as in it'll look the same no matter what device or software you're using to view it, even if it's a sheet of paper instead of a screen. (I know this isn't actually the case, but as I understand it, it's how it _should_ work.) So when I get a PDF, I trust that each line and character is exactly where it's supposed to be, and not shifted due to text reflow or different fonts or whatever. From that perspective it doesn't matter if it's using razor sharp vectors or blocky bitmaps.

    • @max15half
      @max15half 3 роки тому +54

      Well, you could be reasonably sure that a bitmap will not misplace your lines and characters.

    • @StraightOuttaJarhois
      @StraightOuttaJarhois 3 роки тому +18

      @@max15half Sure, but there are other qualities of bitmaps that make them less than ideal for text. PDF has the same advantages as other document formats while feeling more trustworthy than, say, a .doc or a .html, even if they're not always used to the fullest.

    • @Platoqp
      @Platoqp 3 роки тому +6

      I think that is how it started too. That said, if a professor asks for a PDF, it is a decent implication for some layout

    • @hirmuolio
      @hirmuolio 3 роки тому +11

      @@max15half But how are those bitmaps viewed by the receiver?
      Numeric ordered images but reader tries to open them in alphabetical order, size order or age order (whatever is the default on their image viewer).
      Varying image sizes and the image viewer scales them in stupid ways.
      PDF is still good system even if the content is just bitmaps. It keeps them all in correct scale and order.

    • @ccreutzig
      @ccreutzig 3 роки тому +8

      @@hammerhals These days, not everything in PDF is "statically linked." Many PDF viewers, including Acrobat, have a JavaScript engine, and for the modern type of PDF forms, where you may be able to add table rows etc., you kind of need that.
      That in turn means some people embed code in their PDF to, say, render animations etc.

  • @sedawk
    @sedawk 3 роки тому +280

    “I asked someone to send me a PDF and all I got was this lousy bit map” - would make a great t-shirt.

    • @SomethingUnreal
      @SomethingUnreal 3 роки тому +30

      Complete with blocky JPEG artifacts all around the text, of course!

    • @frankharr9466
      @frankharr9466 3 роки тому +5

      Don't tempt me.

    • @naughtiusmaximus789
      @naughtiusmaximus789 2 роки тому

      Grand Theft Auto : Vice City 100% completion reward

  • @StevenSeiller
    @StevenSeiller 3 роки тому +86

    🤓me before video: "Finally time to learn the differences between PDF/X, PDF/E, and PDF/A!"
    🤷‍♂️me after video: "Where is PDF(FTG), PDF(I), or PDF(I+HT) in my Adobe Save As...???"

  • @greatquux
    @greatquux 3 роки тому +182

    Brailsford’s eyesight is better than mine, he can use xterm at the default font size!

  • @mastertacosmith
    @mastertacosmith 3 роки тому +85

    This man needs a 40” ultrawide so he can truly enjoy a good typeface at scale

  • @ToSMaster12345
    @ToSMaster12345 3 роки тому +48

    I was smiling in total bliss throughout the video! Finally I feel understood!
    This is the reason why I write all my documents in LaTeX and using vector images for figures that have embedded text! So that even the scalebar and axis labels in my plots can be selected or searched via text!
    Reject Bitmap! Embrace PDF-FTG! :D

    • @carlosmspk
      @carlosmspk Рік тому +2

      I mean, anyone wtih academic background would understand you

  • @thuokagiri5550
    @thuokagiri5550 3 роки тому +89

    How much we missed prof Brailsford

  • @mikefochtman7164
    @mikefochtman7164 3 роки тому +15

    Reminded of a similar issue we had with old mechanical, piping, and electrical drawings, the kind that were literally 'blueprints'. They had been photographed onto microfische and the originals worn out/lost. Taking the microfische cards and having them scanned (causing even more loss of quality).
    Then a team of graphics artists would import the scanned image as 'background' into a modern drafting tool and literally 'trace' over each marking on the original. This basically re-drew the drawings using the scanned background image as the template. The final step was to 'hide' the background and voila! A modern, vector drawing that was searchable and could be manipulated with modern tools. If anyone suspected a mistake in the redrawing, we would 'unhide' the background to look at the scanned image, or even go back to the microfische (we kept a 30-year-old viewer on hand).
    I forget how much that cost, but it was about 3 graphics artists working over a year to do several hundred drawings. :(

  • @1337Unlucky
    @1337Unlucky 3 роки тому +64

    He clearly has strong views on PDFs, it's funny because it reminds me of me but explaining formats for photography and how to preserve quality. God i hate when they send photos via social media without using .zip or .rar and all the photos gets ultra compressed.
    It's not only about photos and not only about PDFs, I understand the man, it's about PRESERVATION. The world needs to understand better formats and ways to preserve content. I just love this man.

    • @ZaneDaMagicPufferDragon
      @ZaneDaMagicPufferDragon 3 роки тому +3

      💯 Preservation!!! I’m a Preservationist At Heart ❤️😉

    • @LordMegatherium
      @LordMegatherium 3 роки тому +6

      If it's about preservation then rar should be out of the picture because it's a closed format. It's unlikely that we won't be able to open them in 50+ years especially since we have a libre decompression implementation but the point still stands.

    • @Entertainment-
      @Entertainment- Рік тому

      That's why I love Telegram, it does the compression too, but it also allows you to send pictures or any file for that matter in it's original size

  • @nikolayrayanov2895
    @nikolayrayanov2895 3 роки тому +9

    This is gold. I've tried to explain to people at work about different types of PDFs for years.

  • @IIARROWS
    @IIARROWS 3 роки тому +245

    I got worse: an Excel sheet with a picture pasted inside it.
    And not a picture of a table, a screenshot of the application I was working on.

    • @olik136
      @olik136 3 роки тому +16

      my architectural software has a library folder with a drawing file that contains a screenshot of that library folder telling you that certain files are hidden and can only be found with windows explorer...

    • @recklessroges
      @recklessroges 3 роки тому +2

      I'll send you a screen-shot of that in an HTML email ;-) /s

    • @david.mcmahan
      @david.mcmahan 3 роки тому +12

      I once had a client take a screenshot of their full desktop (with an opened PDF among many windows), paste it into a Word doc., crop it down to just a signature graphic, and then scale it back up because the signature was too small. This was their method of "extracting" the signature image from a PDF.
      Fair enough, but it was because they wanted the version of the signature we had already cleaned up to look better in print.

    • @JNCressey
      @JNCressey 3 роки тому +2

      @@david.mcmahan, can whoever they give the Word document to tell Word to show the full image to see everything they had open in the screenshot?

    • @david.mcmahan
      @david.mcmahan 3 роки тому +5

      @@JNCressey Yes, I could see everything they had opened on the screen. There was nothing bad, but it could have been a security incident.

  • @jlivewell
    @jlivewell 3 роки тому +17

    Every time I watch a video by Dr. Brailsford, Phd, I add a new life regret …. That I didn’t meet him when I was 17 and learn everything from him.

    • @jackkraken3888
      @jackkraken3888 2 роки тому

      With someone like him you can never learn everything.

  • @drskelebone
    @drskelebone 3 роки тому +8

    I'm in a completely different field, and when the Professor states "if you want a straight line, you just say Line()" he is 100% talking to my soul and speaking the truth I have wanted to shout into so many faces.
    ty!

  • @noferblatz
    @noferblatz 3 роки тому +4

    This professor is positively the best you feature. His enthusiasm and his ability to explain complex technical concepts in a simple way is unmatched.

  • @kasamikona
    @kasamikona 2 роки тому +3

    Prof Brailsford you're a very brave man pronouncing PNG as "ping" around these parts...

  • @Sam-th4jl
    @Sam-th4jl 3 роки тому +1

    i think i could listen to him talk about literally anything and find it interesting just because of his delivery

  • @m47h4r
    @m47h4r 2 роки тому +1

    This was a joy to watch! I respect people like him very much. Being genuinely interested in something and actually putting the time in to learn about its ins and outs. Never mind the fact that he uses Linux with a bunch of open terminals, that's just the cherry on top!

  • @balmar3
    @balmar3 3 роки тому +10

    Yesss! Professor is using Alpine, one of the best emailers out there. You should make some videos on the awesome power of terminal-based utilities.

  • @RhinoBlindado
    @RhinoBlindado 3 роки тому +3

    Prof B looking quite dapper today. Loved the video!

  • @deansundquist9601
    @deansundquist9601 3 роки тому

    The strive for excellence in typesetting is very noble. As always, thanks for the wonderful content Prof. Brailsford.

  • @TheAstronomyDude
    @TheAstronomyDude 3 роки тому +31

    How does post office OCR work? Sorting centers read the address off an envelope in a fraction of a second and they've been doing it for decades; long before Adobe.

    • @666Tomato666
      @666Tomato666 3 роки тому +32

      fundamentally the same technology, but they have the benefit that the address is highly redundant; can't read the full postcode? check the city and street name

    • @bluedeath996
      @bluedeath996 3 роки тому +15

      Combined with a very standardised way to format addresses. There is also a "lost letter" centre where a person decodes things the OCR can't read, but newer tech is better at the job.

    • @the_lenny1
      @the_lenny1 3 роки тому +2

      @@666Tomato666 yeah, and on top of that the most important information is the postcode, which is only numbers.

  • @YingwuUsagiri
    @YingwuUsagiri 3 роки тому +16

    As someone in an administrative job when someone says send me a PDF they mean "any quality yet not easily edited". Invoices for example are never allowed to be easily editable like Word or Excel (and yes that happens often enough). If they want infinitely scalable they'll ask for a Vector and if they want something that's super sharp made in InDesign etc. they'll ask for an INDD. In my almost decade of working in administrations PDF just means can't be edited (easily, because I am very well aware that you still can somehow).

    • @Starguy256
      @Starguy256 3 роки тому +1

      I edit PDFs every day in my work. Sometimes our software prints the wrong thing and instead of going in and trying to fix it, just edit it on the PDF before you send it. As long as it's FTG (as anything not produced by a photocopier should be) you just hit "Edit PDF" in Acrobat.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 2 роки тому

      The irony is that using vector graphics and actual text objects make it easier to edit the PDF file. The hardest type to edit is the one where every page is a bitmap.

  • @harshjinger
    @harshjinger 3 роки тому +7

    Thanks... I rely on open source information to learn about computer based things that occurred even before I was born.
    Recently, I was looking into this exact question for a project of my own, And this is a perfect resource.
    I have never used Adobe's official softwares, being a novice ungrad student besides being broke, this serves as a great reference.
    Thanks a lot again...

  • @Richardincancale
    @Richardincancale 3 роки тому +12

    Do you remember desk-top search engines? I used to test them by hiding the word ‘marmalade’ in a PowerPoint in a zip file to test their ability to find and index text :-)

    • @ShankarSivarajan
      @ShankarSivarajan 3 роки тому

      Did that work?

    • @CJT3X
      @CJT3X 3 роки тому +1

      You mean like an early version of Spotlight/Alfred?

    • @Richardincancale
      @Richardincancale 3 роки тому

      @@CJT3X I recall that both Altavista and Hoogle had desktop indexing tools. Yes it worked and found my hidden marmalade!

    • @Richardincancale
      @Richardincancale 3 роки тому

      @@ShankarSivarajan Yup

  • @mickjames73
    @mickjames73 3 роки тому +3

    Pdf variability is very frustrating for blind or low vision people. You would often receive a document of instruction manual which was rendered as an image only and we used to have to print, rescan and ocr them (often quiite tricky with complex page layouts). Luckily there is now a fairly accurate builtin ocr engine in things like acrobat reader. The other issue with pdf variantion is many pdf dont confirm to standards for accessibility and thus become unusable, or difficult, when viewed with accessibility features turned on.

    • @Jebusankel
      @Jebusankel 3 роки тому

      I was frustrated recently that my auto insurance documents are all in bad bitmap PDF format. But if I complain to them and claim to be blind, I think they'll have some follow up questions. 😜

  • @squishmastah4682
    @squishmastah4682 3 роки тому +12

    "[PDF] covers a multitude of sins."
    Yes. Especially at Hustler Magazine.

  • @okusa7750
    @okusa7750 2 роки тому +2

    Feel like David Attenborough just lectured me about the types of PDF. Amazing passionate storyteller

  • @magacacciari3565
    @magacacciari3565 3 роки тому

    Huge fan of Professor B and his computer lores.

  • @geirtwo
    @geirtwo 11 місяців тому

    I wish this channel had more satisfying visuals.

  • @PhilReynoldsLondonGeek
    @PhilReynoldsLondonGeek 3 роки тому +55

    The only real *problem* with PDF is that many organisations provide you with their forms as images. If they could be done as proper forms it would be far easier to actually use them.

    • @turpialito
      @turpialito 3 роки тому +14

      But isn't it that it's not actually a PDF problem, but rather people not using the proper PDF generator; in this case Adobe Forms (which AFAIR is bundled with Acrobat)?

    • @ophello
      @ophello 2 роки тому +2

      This isn’t a problem with PDF. It’s a problem with organizations.

  • @JNCressey
    @JNCressey 3 роки тому +19

    Some interesting wierd things I've encountered with PDFs:
    1. I remember some time last year I copied a JPEG out of a PDF container and found it had a slightly different format than regular JPEGs. I think normal JPEGs have the word "JFIF" at the beginning of the file but I think this had something else maybe "ADOBE" through I don't exactly remember, could have been a different word.
    2. Just today I found out there are two options to save a pdf from Microsoft edge. "Save as PDF" vs "Microsoft print to PDF", and the "Microsoft print to PDF" produced a file that was significantly larger and slower to load when viewing.
    3. some PDFs I've seen allow you to search and select text, but don't let you copy or print. I think it's called "secured PDF". I'm not sure why PDF viewers from companies other than adobe would respect those restrictions. Is there something in the file that fundamentally makes these actions impossible or does it just ask the program to disallow them?

    • @neumdeneuer1890
      @neumdeneuer1890 3 роки тому +12

      Response to point 3:
      Yes, the PDF just asks nicely to not allow copying. There are no technical restrictions and more then enough programms which ignore such requests.

    • @hanelyp1
      @hanelyp1 2 роки тому +1

      And a fair selection of the software you could use to read the open format PDF is open source. If such software did pay attention to a "no copy" flag it would be possible to alter the software to ignore it.

  • @TheFakeVIP
    @TheFakeVIP 3 роки тому +3

    I feel it bares also pointing out that correctly type-set text in PDF files that is reproduced from a font, not a bitmap, significantly increases the accessibility of such documents for people who use assistive technologies such as screen readers. PDF files are often ripped to shreds by the blind community for this exact reason. Even correctly produced PDFs that are, for instance, produced from a word processor, often cause problems for screen readers depending on how the text is drawn, and the competency of the software to add accessibility hints where appropriate. A common example of this is text in columns: quite often assistive technologies don't expect this, and so read it linearly (I.E. they read both columns at once). Properly tagging important landmarks such as headings can also be a great help, as screen reader users frequently navigate (or even summarise) a document simply by jumping between headings.

    • @williamchamberlain2263
      @williamchamberlain2263 3 роки тому

      Yes

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 2 роки тому

      DJVU format deals with this by storing searchable text objects which are not rendered, separate from the actual page rendering.
      I think PDF allows this also.

  • @Yupppi
    @Yupppi 3 роки тому +6

    I see new computerphile with prof. Brailsford's face and my week is immediately better. I even got to walk inside his home a little bit this time!
    After seeing bad photocopies of 80's device manuals, I too can get behind their obsession about pdf quality. Even the manufacturer's archives has that poor photocopy and the original pront could've been subpar.

  • @ajayrangishetti5515
    @ajayrangishetti5515 3 роки тому +7

    Please do a video on explaining Pentium processor architecture, and about how multi-core processor perform out-of-order execution.

  • @Baxtexx
    @Baxtexx 3 роки тому +1

    Urg this reminds me of a software I was working on that was consuming pdfs and rebranding them. There were so many edge cases all the time!

  • @henke37
    @henke37 2 роки тому +2

    Fun fact: the pdf format is so complex that it literally includes functionality for executing arbitrary shell commands. As a feature.

  • @johnno4127
    @johnno4127 3 роки тому

    The searchable nature of image and hidden text or (image with text replaced by an actual font) is fantastic!
    .
    The vast quantity of extra spaces and line returns can get frustrating when trying to use that OCR text, though. It's also a pain when adobe put a random space in the middle of a word or between EACH LETTER and now you can't find what you're looking for.

  • @tjarko72
    @tjarko72 3 роки тому +14

    I always tought that PDF(ftg) was closely related to postscript, I would have expected a mention of postscript. More mordern, also PDF/A.

    • @ZedaZ80
      @ZedaZ80 3 роки тому +1

      PostScript is lovely

    • @nezZario
      @nezZario 3 роки тому

      It is.

  • @jorisschellekens4630
    @jorisschellekens4630 3 роки тому

    The way most PDF libraries or programs handle OCR is by something the spec calls "optional content groups".
    Optional content groups allow you to mark any content in the pdf content stream with a particular tag (typically the layer name).
    Programs like Adobe will then show you a listing of all the layers. So you could imagine being able to toggle OCR on and off.

  • @lablnet
    @lablnet 3 роки тому +1

    Nice love to see more video's like these

  • @superfluidity
    @superfluidity 3 роки тому +3

    If you can, don't just aim for the highest quality that your audience demands - aim for quality far beyond that. That will give you more freedom to rework the document later if you want to.

  • @Gnsdtc
    @Gnsdtc 2 роки тому +1

    This is beautiful. The OCR version is PDF I+HT!

  • @jashaswimalyaacharjee9585
    @jashaswimalyaacharjee9585 3 роки тому +1

    I am totally convinced that Prof. Brailsford uses this machine 9:58 as his occasional-use Computer. What Peeping Toms like me can observe, there's Alpine 2.21 (fairly latest software compared to the system)

  • @UncleKennysPlace
    @UncleKennysPlace 3 роки тому +2

    My day job is assembling documents in PDF format for aviation certification. It's shocking how many engineers send everything as PDF, even bitmaps, when I know they had to convert them, despite instructions saying we can work with any format that their native applications produce.

    • @bhargavk1515
      @bhargavk1515 11 місяців тому

      Sir how do I learn to pdf format encoding, any guide?

  • @HugoOneYT
    @HugoOneYT 3 роки тому +2

    To me PDF is about compatibility, there's a reason why all invoices are PDF, everything can open it

  • @bartas9693
    @bartas9693 3 роки тому +6

    It's ok I'll send you a PDF.

    • @SimGunther
      @SimGunther 3 роки тому

      Yeah, but what? Image, full, text?

  • @MartinOmander
    @MartinOmander 3 роки тому

    Excellent video! I have a request for future videos: please consider keeping the camera still if the subject is stationary. The shakycam effect unfortunately made me seasick and distracted from the professor's excellent performance.

  • @delhatton
    @delhatton 3 роки тому +1

    OCR for pure text. Maybe OK. It will still require editing. OCR for numerical data, like some Excel sheets, by the time you've verified all the numbers, you might as well have retyped it.

  • @b391i
    @b391i 3 роки тому

    Awesome as usual 😇

  • @DaimlerSleeveValve
    @DaimlerSleeveValve 3 роки тому +4

    It surprised me that for the last couple of years, Google has been running OCR on the contents of PDFs which contain only images. I've located names mentioned only on signs visible in the backgrounds of pictures of something else.

  • @anarchist
    @anarchist 3 роки тому +3

    8:40 4:3 monitor because nothing can throttle Brailsford's brain power.
    Not PDF but something that tickled when working with TIFFs was a joke it stands for "Thousands of Incompatible File Formats"

  • @trollhunter200
    @trollhunter200 3 роки тому

    You are just awesome Professor.
    👍👍👍

  • @adrianalexandrov7730
    @adrianalexandrov7730 Рік тому

    That's kinda how djvu worked: saving text as a high detailed foreground and compressing background. That was miracle how scanned hundreds of pages book could fit into just a few Mb

  • @ZaneDaMagicPufferDragon
    @ZaneDaMagicPufferDragon 3 роки тому

    PDF FTG FTW 🙌🏻 I LOVE ❤️ PDF AND ITS PROGRESS IS AMAZING 🤩 GREAT VIDEO PROFESSOR 👨🏻‍🏫 BRAILSFORD!!!

  • @Graham_Rule
    @Graham_Rule 3 роки тому

    The photocopier/scanner at work can scan to PDF/A which generates searchable text by doing OCR. Being internet enabled it can then send a copy by email (possibly bcc'd to Xerox or other third parties without our knowlege).

  • @unlokia
    @unlokia 3 роки тому

    Prof Brailsworth: The font of all PDF knowledge.

  • @AleksyGrabovski
    @AleksyGrabovski 3 роки тому +2

    Can you also do a video on DJVU format?

  • @saranchance5650
    @saranchance5650 3 роки тому +1

    Pdf has additional accessibility features that the variants you described make possible

  • @jeromethiel4323
    @jeromethiel4323 3 роки тому +1

    I worked for a company, and we had electrical prints that were paper only. We paid a company to generate CAD files of the prints. What they did is insert scans of the paper copy into the CAD software, which isn't what we wanted. They basically screwed us over big time.
    The whole point of having them i CAD format was so that we could edit the bloody things!

  • @Rubrickety
    @Rubrickety 2 роки тому

    Fascinating video with perhaps the least clickbaity title in history.

  • @iabervon
    @iabervon 3 роки тому

    Midway through the video, I was distracting by recognizing that Professor Brailsford uses the same program for email that I do.
    I often solve crossword puzzles that I get as PDFs, and it's interesting to see whether the program that made the PDF put the text of the clues in the logical order that you'd read them, or if it went top to bottom, left to right, ignoring columns.

  • @zombiegeorge749
    @zombiegeorge749 3 роки тому +5

    2:42 whats up with the edges of the screen?

    • @Computerphile
      @Computerphile  3 роки тому +4

      if you read the small text on the "newspaper" it helps explain it a little :) -Sean (basically I rotated it a little to fix my wonky camerawork and missed zooming it in)

  • @SeanBZA
    @SeanBZA 3 роки тому

    Also different types of PDF creator gives different file size outputs. Firefox PDF is massive, often bigger than the original, as it is a PDF of the page as it would be sent to the printer, but the PDF output from Debian is a lot smaller, just a file with the fonts and text, as the original document had.

  • @danielmnet
    @danielmnet 3 роки тому

    If Prof. Brailsford is explaining I am interested in, it doesn't matter the subject

  • @MrBoubource
    @MrBoubource 3 роки тому +13

    My internship topic is to find the paragraphs containing some keywords in a pdf with 4 different formatting depending on its provider.
    I am beginning to hate it.

    • @DT-dc4br
      @DT-dc4br 3 роки тому +4

      Might be a job for a Linux shell script with awk / grep & sed

    • @MrBoubource
      @MrBoubource 3 роки тому +3

      @@DT-dc4br I went with python (and regex's) because I'm most familiar with it... But holy what a mess it is to covert pdf to html and plain text..

    • @etziowingeler3173
      @etziowingeler3173 3 роки тому

      Hahaha I can imagine

  • @marsgal42
    @marsgal42 3 роки тому

    In a past life I did a lot of work with PostScript and one product we developed was a PostScript sanitizer that would take any deranged PostScript you threw at it and output well-behaved well-structured PostScript suitable for further processing. We got the idea from generating PDF then printing it to a file with Adobe's PostScript printer driver.

  • @TimothyWhiteheadzm
    @TimothyWhiteheadzm 3 роки тому +16

    Expecting a certain quality of content from the pdf format is as ridiculous as expecting quality content on a web page. A container is just that. It can contain flowers, or manure. As for the OCR feature, that is great, but one wonders if that is part of 'pdf' or part of the tool that creates the pdf?

    • @harshjinger
      @harshjinger 3 роки тому

      Idk... About this... I would love to know more... Commenting for any followups

    • @majorgnu
      @majorgnu 3 роки тому +1

      It's a feature of the software that produced the PDF, obviously.
      Even if the format was extended at some point with features that facilitate this kind of use, the file itself still only contains the *result* of the OCR process, which was performed by whatever applications were used to produce it.

    • @drawapretzel6003
      @drawapretzel6003 3 роки тому +1

      Well, its not in the free version of adobe reader, thats for sure.
      Theres lots of free OCR software that can OCR a pdf for you, but yes, its included in the tools for an actual PDF creation software too.

    • @HetareKing
      @HetareKing 3 роки тому

      The actual OCRing happens in the creation tool, but this whole notion of having a bitmap overlay invisible text has to be encoded into the file and so the format has to support it. And since this functionality only really makes sense in the context of the OCR feature, I think it's fair to say it's part of "PDF".

    • @JNCressey
      @JNCressey 3 роки тому

      I suppose if the creator of the pdf has a bitmap with text that is obviously unOCRable (maybe stylised text) they would manually add the hidden text, getting the same effect but without OCR.
      Styles that come to mind that OCR wouldn't work well on could be extra objects between the letters (google doodles), people posing in letter shapes (it's fun to stay at the YMCA), drawing just the negative space, bubble text or drawing just the shadows of the text, leaving out lines (E as 3 horizontal lines, A without the horizontal part), or using characters of other alphabets that look similar (like in r/grssk).

  • @SteveMacSticky
    @SteveMacSticky 2 роки тому

    Very well explained

  • @soccerox817
    @soccerox817 3 роки тому +32

    Exactly why I cant stand when people just ask for a PDF or send a poorly rendered pdf. Gotta write documents in LaTex and export a quality PDF

    • @peterwhitey4992
      @peterwhitey4992 3 роки тому +2

      LaTex is overrated.

    • @miran248
      @miran248 3 роки тому +14

      @@peterwhitey4992 Wouldn't say overrated, but maybe an overkill in most cases. Something like markdown should be more than enough for simple stuff (w/o math equations, ..)

    • @peterwhitey4992
      @peterwhitey4992 3 роки тому

      @@miran248 - I know it's practical to write in, but it's the result that I find overrated. You can always tell when a paper/book is written in LaTex. They all look the same. Especially textbooks written in LaTex are generally not very good.

    • @Platoqp
      @Platoqp 3 роки тому +1

      @@peterwhitey4992 It is excellent for writings that include mathematics and other scientific formulas

    • @michaelb2047
      @michaelb2047 3 роки тому +4

      @@peterwhitey4992 I would say most natural science textbooks are written in latex. You can change everything so you won’t notice that it was actually written with latex. You notice it only if they use the default template / font. Also they are often much cleaner / more consistent than „Word“ books for example.

  • @PhilipStorry
    @PhilipStorry 3 роки тому +2

    How do I subscribe to Vague Magazine? If it has high quality reminiscing from Professor Brailsford, then I need a subscription! 😉

  • @Smogshaik
    @Smogshaik 2 роки тому

    I would love a video about the PDF/A format!

  • @davidgillies620
    @davidgillies620 2 роки тому

    I primarily generate PDFs with pdflatex, using EPS or PNG for embedded graphics, so I get searchable, arbitrary-resolution output. It looks very nice.

  • @PswACC
    @PswACC 3 роки тому

    What software on linux are you using to activate OCR search ability?

  • @No0utlet
    @No0utlet 3 роки тому

    At 2:30, it appears that the video of Prof. Brailsford is overlaying a video of the paper on his table and is rotated a very slight amount. Are there any video editors out there that could explain how that might happen by accident?

  • @LoesserOf2Evils
    @LoesserOf2Evils 3 роки тому

    If you can decompose the PDF into the text and the graphics and then recreate them into a word processing document, that can help. Then drop the document into Adobe Indesign for better and tighter layout. I admit that's a lot of effort, but sometimes it's worth it; and if the PDF standard changes in the future and it's important to produce a new standard, it'll be far easier.

  • @pierreabbat6157
    @pierreabbat6157 3 роки тому +1

    Many of my programs output PostScript, which can be converted to PDF. I've seen many PS files get bigger when converted to PDF; I just checked one which is 4.5 times as big in PDF as in PS. I also once wrote a PS file using the random number generator and converted it to PDF. The converted file lost the randomness.
    I'm a surveyor and download maps in PDF from register of deeds sites. The old ones are scanned, of course. But the ones drawn with CAD are, I think, also scanned. They should be taken from the PDF output of the CAD program, except that the signature is written on paper (or clear plastic sheet), which poses a problem. Digitizing the numbers from a printed copy of the plat can result in illegible numbers (is that a 6, an 8, or a 9?).

  • @John_Fx
    @John_Fx 3 роки тому +4

    He barely scratched the surface of the complexity of PDF formats. Didn't even cover PDF/A or why you should never redact a PDF and send out that original file.

    • @Jebusankel
      @Jebusankel 3 роки тому

      There is a true Redact function in Adobe Acrobat. You just have to use that instead of drawing a box on top.
      Ditto on PDF/A though.

  • @lawrencedoliveiro9104
    @lawrencedoliveiro9104 2 роки тому

    12:03 It looks like a scan that has been quantized into a bilevel (black and white only, no greys) bitmap. Those little hairy extensions on the edges are characteristic of that.

  • @turpialito
    @turpialito 3 роки тому

    Brailsfordphile, Brady. I think it's high time ;)

  • @samuelworsnop9983
    @samuelworsnop9983 3 роки тому +3

    I really want to know what Professor Brailsford's favourite font is!

  • @ieperlingetje
    @ieperlingetje 3 роки тому

    4:24 Sean often gets camera settings wrong and things come out blurry, so here's an animation to hide that.

  • @Chobungus
    @Chobungus 3 роки тому +1

    Can someone clarify for me, when he is going over the "hideously complex mathematical equations" @ 9:19, he says that you do not want to have to type that out character-by-character. Yet he then demonstrates that he is able to zoom in greatly while preserving quality. So how did he translate the bitmap image to that high quality type set?

    • @Computerphile
      @Computerphile  3 роки тому +3

      In this case that's exactly what the Prof is working on, recreating this important document page by page using similar software to what Dennis would have had available - Professor Brailsford talks about it in a recent video but it has been an almost full time job for him for a while now! -Sean p.s. if you see the two pictures early in this video you'll see that a version of the Thesis Dennis held was damaged but one his friend had reviewed is OK - The damaged one has amendments so this is a difficult task!

    • @Chobungus
      @Chobungus 3 роки тому

      @@Computerphile Thanks for the reply! Great video!

  • @gedavids84
    @gedavids84 2 роки тому

    I just want to say that I'm really glad Professor Brailsford survived covid.

  • @bhargavk1515
    @bhargavk1515 11 місяців тому

    Can you make a tutorial (or is there a tutorial) on how prof. Brailsford restored the bitmap pdf into pdf encoding...

  • @xelaxander
    @xelaxander 2 роки тому

    What’s the software Prof. Brailsford is using? I’d really love to search to some older mathematical books.

  • @Ice_Karma
    @Ice_Karma 2 роки тому +1

    Prof. Brailsford, do you still use PINE, or Alpine? =D
    (PINE user since 3.87...)

  • @trollhunter200
    @trollhunter200 3 роки тому +2

    Debian with KDE Plasma is the best.

    • @gug1970
      @gug1970 3 роки тому +1

      It was oddly satisfying to watch him using KDE on Debian on my Debian box running KDE.

  • @camadams9149
    @camadams9149 2 роки тому

    Sounds like people don't know what each file format does
    1) PDFs - I use exclusively for pages I wanted bundled together in a single document that always looks the same regardless of device viewed on OR for a fillable document
    2) PNG - I use exclusively for a single image that I want to be static in quality & size
    3) JPG - I don't use it
    4) SVG - A PNG that may need to be resized while retaining quality
    Then again, I don't pay for file editors. So my approach is very much: I want you to be able to use the files natively

  • @Fre1maurer
    @Fre1maurer 3 роки тому

    My first PDF was the manual of the flight simulator game TFX back in 1994, it was the re-release budget version without printed manual. There was Adobe Acrobat Reader for MS-DOS on the game CD, and holy crap was the quality of the document bad (and the clumsy Reader itself was not much better). They obviously simply scanned a real printed manual and saved it as images with something like 4-Bit grayscale and the the text sections looked like plain 1-Bit black-or-white without any anti-aliasing. I never thought this text for the poor called PDF could be a thing in the future.

  • @tubbdoose
    @tubbdoose 3 роки тому

    He has so much passion about PDFs XD

  • @volodyadykun6490
    @volodyadykun6490 3 роки тому +4

    4:18 great newspaper

    • @miran248
      @miran248 3 роки тому

      .5btc - that's one expensive newspaper :)

    • @klaxoncow
      @klaxoncow 3 роки тому

      @@miran248 Or maybe not. Depends how well Bitcoin's doing at the time.
      Virtual currency, yes. Anchored currency, no.

  • @Amonimus
    @Amonimus 3 роки тому +1

    To me a PDF is like an archive with multiple images or doc that you can list through.

  • @rudiklein
    @rudiklein 3 роки тому

    A great talk, scrolling printer paper and a flashy shirt. What else does a video need?

  • @andrewjc13
    @andrewjc13 2 роки тому +1

    I've found PDF-I to be very useful when professors have ridiculous requirements for their assignment format but just say "give me a pdf." Why yes, I'll happily do this assignment in word and then convert it to the biggest bitmap PDF possible. Here's your 200MB non-searchable pdf, enjoy grading!

  • @johnholland7497
    @johnholland7497 3 роки тому +1

    I'd love to know which software you used to convert the PDF with just bitmaps into one with searchable text. Is it open source?

    • @igorthelight
      @igorthelight 3 роки тому

      I know about "ABBYY FineReader PDF" which is not Open Source nor free.
      Maybe there are others

    • @beakmann
      @beakmann 3 роки тому

      There is tesseract

  • @UnOrigionalOne
    @UnOrigionalOne 3 роки тому +1

    One could argue similar points for video.

  • @ahmetardaedogan6697
    @ahmetardaedogan6697 3 роки тому

    Could you explain harris corner detection?

  • @Ziphoroc
    @Ziphoroc 2 роки тому

    You missed the most common reason people choose to put thinks into a PDF. You can put multiple things into the PDF and having it be one single file, allowing you to send all the documents in one neatly organized PDF rather than sending multiple separate files that won’t have any order. It’s much more convenient to be able to scroll back and forth, rather than having to open multiple windows back and forth to get the same information. I wouldn’t have finished college if the online textbooks I paid for came in 350 separate JPEG files in a folder, rather than a PDF of the entire book that I can scroll through. I’d take the PDF even if it was for whatever reason an even lower quality images than the individual pages in JPEG.