VSCode has this mode called “Python Interactive Window” where you have Jupyter-esque code blocks. The blocks are separated by a special comment (“# %%”) so the end result is still a script you can version control, unit test, debug, etc. It’s available through the Python extension.
I use the same. Although I leave out the "#%%" and just highlight the section I need to run while building and testing so that I clean script when I am finished.
I've only "recently" switched to VSCode from PyCharm Community, and I found that this feature to be quite useful for my exploratory stuff! It's a shame that they had put it under a paywall in PyCharm there... I usually create functions and scripts, but if I just wanna check a logic quickly, and perhaps fix some syntax to regex my data correctly, I'd used Jupyter :)
I use this at work! I stopped using notebooks and moved to scripts with these blocks so I could run sections easily, and dive into the interactive window. But then I often re-run as a script as a whole.
If your DS team learned to code like I did (taking a DS certification to transition from analytical background with only basic SD knowledge), they likely only ever coded in notebooks. Comparing what I've learned from Arjan versus my (not cheap) certification courses, I expect it's very common that many new data science professionals have a false sense of best coding practices. I've tried building complete data analysis applications in Jupyter notebooks and while it's a great learning space, it can quickly become a hellscape for testing.
@@mattd7828Most barely transitioned from excel formular chains to python. A stretch to call it data science (empirical test and validation) instead of data mushing. And that is the companies fault… nobody ever told them what software engineering is.
@@EmileAI Nearly all major accredited colleges and universities in the US offer "continuing education" programs for professionals. They use the term "certification" more as a marketing tool because it's really only as good as the reputation of the school, and often they outsource instruction to third party professionals. I took my DS cert online through UC Irvine. I don't regret it, but my point remains that they really only focus on getting data, cleaning it, and doing the analysis - which is fine because many newcomers struggle with even basic python and pandas. But when you have no SD background, you can easily come out thinking you are an amazing programmer, clueless to the massive breadth and depth of software development.
Dude I got kicked off of stackoverflow twice when I went from software to working in physics and had to become familiar with Jupyter, and now I’m in love with it. I have a masters degree in astrophysics but wound up working as a developer for 8 years before getting a job in my actual field. I absolutely *_hate_* jupyter books… but jupyter _notebooks_ are awesome for exploratory type stuff.
Im a computational chemist, so I use jupyter notebooks daily for explorative data analysis. Especially when analyising convergences, some paramaters have to be adjusted on the fly everytime and here notebooks are awesome to see the immediate effect of your choices without painfully loading in the huge amount of data again
For the Jupyter issue with regard to imports being present or missing due to editing "errors". I have a rule that I reload and run all features after finishing changes in a cell. This helps to ensure that I did not create a side effect, add, or remove something needed elsewhere. I can still run into issues, but the restart typically shows me the errors of my ways.
I've used Jupyter notebook a lot working in earth science modeling. I also manage the frontend (Java) and backend (Python) code for a website that process data based on a user's request. My thought process when using a notebook is so different than when I'm working on code that is part of the site backend. For instance, "testing" when working on the backend becomes more like "validation" when using notebooks for earth science modeling. With the backend code I may be using unit tests, while when I'm using notebooks I may be generating a plot or map to ensure that the data are being modified how I expect. To address the issues that can arise from running cells out if order, I am a stickler about using the "Restart and run all" command. Working in earth science modeling, Jupyter has been a big boon in regards to repeatably, reusability, and transparency. Thanks for another great video!
VScode actually supports doing annotated jupyter-style blocks in .py files. The advantage of this is that instead of a blob of json (jupyter is json with a lot of the output saved in there as literals) you're working in plain text and can therefore version control your file.
Great video! Coming from data science, I definitely see the value of exploratory data analysis with Jupyter notebooks. For your question, one annoying difficulty with Jupyter notebook files is version control. If you write a .py file and a coworker runs the file to see what it does, then there is no change to the .py file. Hence the version control software will not note the file as changed. But if the same scenario happens with a Jupyter notebook file, then the file changes! This is pretty annoying, especially if your coworkers are used to simply write git add .
I've been using Jupyter for EDA and building pipelines. However, transitioning that pipeline into a standalone script has always been a bit of a journey for me. I would absolutely love to see a video on how to effectively make that transition from a Jupyter notebook to a full-fledged Python script, especially when it comes to keeping checks (maybe asserts to ensure data looks as expected?) for the exploratory nature of Jupyter while ensuring the robustness and maintainability of a script. Thanks for all the content you produce, and keep up the great work!
I use notebooks for reports. Of course I run into the same issues you mentioned. This is why I try to define functions in a separate module. However, in many cases I use notebooks in an IPython shell which is convenient to explore code snippets.
This was a nice video. Too often I see the more formal programmers, who don't have any experience with exploratory data analysis, dismiss notebooks upfront, without any nuance. Yeah, sure, I'll just run a script again and again, redoing the calculations and plots I again and again, super efficient. Notebooks are a great way of combining text, images, code and output, and have their downsides, of course, as everything in this planet. I've faced all the problems you mentioned in the video, and I'm now aware of the code smells. One golden rule I found was that, before "checking in" any notebook, or giving it to someone else, always restart the Kernel and run all cells. If it doesn't run to the end, except in some very specific cases, you have a problem that needs fixing. Ideally, restart and run everything every once in a while, like 1-2 hours. In the end, I consolidate some useful behavior into functions or classes and move them to a module that I can import in future notebooks, and which is properly unit tested and documented.
I used to use Jupyter notebooks for data exploration and especially if there were intermediate results that took a long time to calculate. Eventually I gave up on them because it was too easy to save data off to csv, xslx, or even into a sqlite database (usually via diskcache) and then read them back in each time I re-ran.
I usually build pipelines on notebooks and transition that to regular python scripts. You just have to be aware of any changes to predefined variables. I do find notebooks to be much slower in magnitudes of 10s of minutes
I would have expected that you’ve touch on two other aspects: - version control challenges with Jupiter notebooks; and related - breaking the code -data separation paradigm in these notebooks that can be also a security/privacy risk.
its possible use lib "papermill" to create a python script to run a specific jupyter notebook with parameters seted in that python script, with a kernel of your choice. Very interesting lib to use notebooks as parametrized funcions with auto-saved state after each run.
Just a note - a square latlon will vary from an area perspective, with the largest at the equator and smallest at the poles. There are (lossy) projections to translate latlon to local distance in meters using the Azimuthal equidistant projection
I love Jupyter notebooks and use them almost every day, biggest issues I've found are version control and debugging. Although VScode has some limited debugging features for Jupyter it's definitely not as smooth as .py files.
These problems would also occur if the Python was being written as a script though. Don’t change the definition of functions and assignment of variables randomly throughout your code. Or if you must work that way, define each block with its own variables and definitions.
Agreed, and I had the same reaction as you. My hunch is that this video is geared more towards people using Jupyter notebooks to learn Python. Then, alternating between blocks/ cells containing 4 lines markdown and 4 lines of Python code is prone to buggy, inefficient, or unexpected code behavior.
I find jupyter useful for the reasons you have outlined. Plus they are a good way to prepare presentations where you want to show graphs and the like. Not to actually run the code, just to be able to see it with markdown providing reasonable headings, comments etc. Being able to access the presentation through a browser is also useful - you can demo through eg an iPad. The alternative of exporting the graphs/tables etc and then importing to PowerPoint is a pain. In the past I might of used XL to do something similar. I also find them useful for developing new code (where it’s not obvious what data manipulations are required up front) then once happy with the results, re-write the algorithm as a script. On the downside, version control is a pain in the arse. Merging always seems to go wrong with git. Also, they do seem to glitch in strange ways sometimes losing code or requiring a re-write.
Notebooks also crash in a very weird way at times. I don't really like them too because it is also hard to use them for debugging sometimes. Moving code into production, you have to redesign the program you have written. You can have the best of both worlds by using Python Interactive #%%.
One pitfall I hit when working with Jupyter notebooks and normal scripts together is that if you add a new function to your script it will not be imported into the notebook, even if you run the code block with the import command again. I believe the first time it runs the kernel caches the imports in some way to list what it's expecting, so a reimport doesn't load the new functions. I found restarting the kernel was necessary to allow me to import the new functions
This happens a lot when dealing with researchers. Supposedly novel research with sota results in a notebook, but the state is broken because they've rerun cells in different orders with cell changes in between. So it becomes completely not reproducible.
I have run into problems with Jupyter notebooks. To avoid them I do gather function definitions and imports within .py modules. Also, it is always a good idea to restart the kernel and run up to the current cell to ensure the context is as intended within the sequence of the actions taken in cells. It is particularly risky to scroll up to a cell and re-run it.
Just want to say that I'm loving using jupyter notebooks within vscode. You can easily connect them to a kernel without running a jupyter server and you can run copilot and other vscode plugins within jupyter, but still get the benefits of data exploration and running the code piece by piece. It appears you are using it already within vscode, but I bet many people are unaware and running it within an anaconda juptyer host.
It's only available for Julia, but Pluto notebooks solve a lot of the problems mentioned in this video - code blocks get reloaded when one of their dependencies change, making this sort of mistake considerably harder
We build data pipelines using Jupyter Notebooks and we do try to put much of the complex code into Python files that we import. One “gotcha” that still gets me sometimes is making changes to those Python files do not impact the “global state” even after running the imports again. The kernel needs to be restarted as well.
I found some specific behaviour when you using imports into Jupyter from your script files. If you change function in script file after you start working in the notebook your change wouldn't have effect in notebook. Repeating import doesn't help you, only restart Jupyter kernel.
I encountered the exact same problems you talk about in your video with Jupyter Notebook. Note I am not a data scientist. Same as you I can use it for deep data exploration that need visualisation. For simple data exploration I just import data in an sqlite DB which is more convenient to me when you already know SQL. I still sometimes use jupyter notebook to try some pieces of code since it's more practical and convenient than a simple python console. But still, I sometimes encounter a problem, sometimes I just try some pieces of code but I got hooked by the game and end up with a quite extensive script that is quite messy and that I need to take time to clean if I want to create a reusable script from it (outside of Jupyter Notebook). So I would say Jupyter Notebook can be absolutely great for data science, because of deep data exploration and because some code involving huge dataset or complex processing can take really long to execute in data science. Outside of this it can sometimes be an interesting tool to use to test some short code. But as I am not a data scientist I keep it more as a secondary tool, VS Code is far more convenient for 90% of my work.
Jupyter with right extensions it can show a image or play an audio(its very good for who people dealing with audio processing, you dont have to create audio file because of this) etc. Its very good for explonations and analysis
What's with the "I love god slash design guide" at 0:32? Not sure if that's what you're actually saying (could it be "I love code"?) but the subtitles say "I love god..." 😂
Interesting topic! I view JN as totally useless. But it reminds me of the SQL scripts that I keep in a txt document. I use these typically for analysis of data and they are a little complex to rewrite or remember. So I can see that Jupyter Notebooks would be useful for data analysis. 😊
One of the biggest problems I have with jupyter is its interaction with git. Since it records not only the code but also the metadata and outputs, simply re-running a cell will lead to git detecting a change as the metadata has changed, and even worse, changing the code which generates an output will make git detect the change in the output, which can be hundreds of lines long. Worst case is if two people commit those type of “changes” in the same notebook in two different branches, it will lead to merge conflicts which are horrible to resolve, especially taking into account that standard conflicts resolution aid tools will not work (try finding the damn separator “=“s among hundreds of gibberish lines which represent the output). If you could make a video on ways to work around this, I’d be inmensely grateful
Experimenting with data pipeline components in notebooks, then migrating to scripts or modules is super common on my teams. A couple major pain points: copying an instance method from a class in some module to a notebook, or back, is terrible due to the `self` or `cls` argument, or lack of it, breaking the expected functionality. And ensuring the ipykernel handles your custom module imports in the notebook the same way as Python does when running a script is often awkward. Modifying a custom module you imported in the notebook earlier means you must restart your kernel to ensure it uses the new version. And depending on where you store your notebooks relative to those modules, and where you run these from results in varying import behavior. I'm shocked at how often non-trivial data projects are built on popsickle sticks like this...and yet it (mostly) works.
@Zaltan1 "Modifying a custom module you imported in the notebook earlier means you must restart your kernel to ensure it uses the new version." If you don't want to restart the jupyter kernel, you can run this instead after you have modified your module: > import importlib > importlib.reload(modulename)
Totally, I made a set of notebooks for someone to train their own language model and it was a nightmare managing the state and getting everything to return the same value all the time. Chasing ghosts in the machine.
Jupiter is good for reports or design but for pipelines scripts and vscode outline view are my preferred solution. But most important i noticed are your custom modules you generate over time, which increases the productivity. With Jupiter using your custom modules can get annoying because you have to reload the kernel each time after changing/importing your modules.
a simple workaround for this is to use the importlib library, so when you modify your modules, just execute a cell with importlib.reload(your_module_name) with this you update the modifications made in your module into your notebook without having to reset the kernel.
based on my experience using .ipynb in vs code, I need to put this 2 lines at the top of jupyter notebook file so that the update in .py file will be reflected in .ipynb ``` %load_ext autoreload %autoreload 2 ```
ipython (what jupyter notebook is built on top of) is also really nice. you can use it in place of the standard python repl. you get autocomplete and even some basic syntax highlighting
That’s 98% of what I use jupyter for. PtPythin for those interested allows vim keybindings and a few more notebook like features directly in the terminal… that was the game changer that convinced me to leave actual notebooks.
Notebooks empower the mess! I try to avoid them as maximum as I can. I end up discovering that for me, using small scripts to do EDA is more productive, because you will always start by investigating some hypothesis on your data, and with a plus that you'll keep best practices for coding. To do data viz, I'm experiment to use streamlit apps, because are very easy to build and also will be useful if you'll need to show your analysis for someone else.
You have to keep putting common functions in modules as you go. This leads to old notebooks being refactored or function signatures changing without being re-run. There are tools to re-run them to check compatibility but some things like training DL cannot be re-run easily. All global parameters should be defined once at the top. You have to get comfortable with the notebook violating single object responsibility. The notebooks are good for documentation but due to the fact they tend to get copied and pasted, you get out-of-date comments. The notebooks do not play nice with Git and review solutions but there are some solutions for this. Despite these drawbacks, they are good for running experiments and docummenting them.
Notebooks seem super odd to me. Not really software in the traditional sense. I cannot imagine a scenario where I'd ever use one. Seems only useful for munging data interactively. Which, obviously, (as evidenced by several of the comments below), is something many people do. That's just not something which would ever come up for me. But at least I now understand at a vague, conceptual level what they are and can happily ignore them.
I suspect that the majority of comments will touch on the issues of data scientists only working in notebooks and not having any knowledge of software engineering principles. The unfortunate part of this is all of the junior level data scientists who are really a just jack of all trades and master of none. Many of the data science teams are poorly managed and offer no up skilling or mentorship, which will likely lead to more of these scenarios where competent engineers have to compensate for lack of ability within overgrown data science teams. If you are in a decision making role, please ensure that your data science team either trains people to code or produces cutting-edge work. The latter is extremely rare outside of world class research groups.
I have had juniors ask me why unit testing is important, why we would move code into python scripts, and why we should use type hinting. These questions arise even when some notebooks they develop start to push 500+ lines of code. I am currently the only person in a team of 15 data scientists with industry experience in software engineering teams.
The saying is:: “a jack of all trades is a master of none, but oftentimes better than a master of one.” - so things get better. We limp along, and they get better.
I'm going down the Arjan Rabbit Hole now. 🐇🕳 I watch one video, then you mention something I don't know, like partial functions, which leads me to watching another video about sometihng I don't know... 🤣🤣
I have very mixed feelings about Jupyter notebooks. Working with it at least 5 years I definitely enjoy its way of being more interactive and easier to share than scripts. However, it’s an issue when you are transitioning from EDA to development and this happens not always in a well defined moment :) and eventually you have quite bad piece of code. So my rule is to run only high level functions with very well defined interfaces and put everything else into user defined functions and classes files.
💡 Get my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.
VSCode has this mode called “Python Interactive Window” where you have Jupyter-esque code blocks. The blocks are separated by a special comment (“# %%”) so the end result is still a script you can version control, unit test, debug, etc. It’s available through the Python extension.
I use the same. Although I leave out the "#%%" and just highlight the section I need to run while building and testing so that I clean script when I am finished.
this also works in neovim. i use nvim-repl for this in case anyone is looking to try it
I've only "recently" switched to VSCode from PyCharm Community, and I found that this feature to be quite useful for my exploratory stuff! It's a shame that they had put it under a paywall in PyCharm there... I usually create functions and scripts, but if I just wanna check a logic quickly, and perhaps fix some syntax to regex my data correctly, I'd used Jupyter :)
I use this at work! I stopped using notebooks and moved to scripts with these blocks so I could run sections easily, and dive into the interactive window. But then I often re-run as a script as a whole.
Currently trying to refactor a jypiter notebook I got from our data science team - complete nightmare.
If your DS team learned to code like I did (taking a DS certification to transition from analytical background with only basic SD knowledge), they likely only ever coded in notebooks. Comparing what I've learned from Arjan versus my (not cheap) certification courses, I expect it's very common that many new data science professionals have a false sense of best coding practices. I've tried building complete data analysis applications in Jupyter notebooks and while it's a great learning space, it can quickly become a hellscape for testing.
@@mattd7828Most barely transitioned from excel formular chains to python. A stretch to call it data science (empirical test and validation) instead of data mushing. And that is the companies fault… nobody ever told them what software engineering is.
@mattd7828 I'm curious about the certifications you are talking about. Would it be possible for you to share their name ?
Thanks a lot
@@EmileAI Nearly all major accredited colleges and universities in the US offer "continuing education" programs for professionals. They use the term "certification" more as a marketing tool because it's really only as good as the reputation of the school, and often they outsource instruction to third party professionals.
I took my DS cert online through UC Irvine. I don't regret it, but my point remains that they really only focus on getting data, cleaning it, and doing the analysis - which is fine because many newcomers struggle with even basic python and pandas. But when you have no SD background, you can easily come out thinking you are an amazing programmer, clueless to the massive breadth and depth of software development.
Dude I got kicked off of stackoverflow twice when I went from software to working in physics and had to become familiar with Jupyter, and now I’m in love with it. I have a masters degree in astrophysics but wound up working as a developer for 8 years before getting a job in my actual field. I absolutely *_hate_* jupyter books… but jupyter _notebooks_ are awesome for exploratory type stuff.
Im a computational chemist, so I use jupyter notebooks daily for explorative data analysis. Especially when analyising convergences, some paramaters have to be adjusted on the fly everytime and here notebooks are awesome to see the immediate effect of your choices without painfully loading in the huge amount of data again
For the Jupyter issue with regard to imports being present or missing due to editing "errors". I have a rule that I reload and run all features after finishing changes in a cell. This helps to ensure that I did not create a side effect, add, or remove something needed elsewhere.
I can still run into issues, but the restart typically shows me the errors of my ways.
I've used Jupyter notebook a lot working in earth science modeling. I also manage the frontend (Java) and backend (Python) code for a website that process data based on a user's request. My thought process when using a notebook is so different than when I'm working on code that is part of the site backend. For instance, "testing" when working on the backend becomes more like "validation" when using notebooks for earth science modeling. With the backend code I may be using unit tests, while when I'm using notebooks I may be generating a plot or map to ensure that the data are being modified how I expect.
To address the issues that can arise from running cells out if order, I am a stickler about using the "Restart and run all" command.
Working in earth science modeling, Jupyter has been a big boon in regards to repeatably, reusability, and transparency.
Thanks for another great video!
VScode actually supports doing annotated jupyter-style blocks in .py files. The advantage of this is that instead of a blob of json (jupyter is json with a lot of the output saved in there as literals) you're working in plain text and can therefore version control your file.
Do you use some VSCode extension for this?
@@hojaelee1562 It's part of the Python extension. IIRC it also adds the option to convert a jupyter notebook to a .py file to the command palette.
I just wish there was something similar in neovim. That and in-browser debugging is the only thing I miss from vsCode.
This is the code centric approach. Jupyter is more documentation/output centric.
Great video! Coming from data science, I definitely see the value of exploratory data analysis with Jupyter notebooks. For your question, one annoying difficulty with Jupyter notebook files is version control. If you write a .py file and a coworker runs the file to see what it does, then there is no change to the .py file. Hence the version control software will not note the file as changed. But if the same scenario happens with a Jupyter notebook file, then the file changes! This is pretty annoying, especially if your coworkers are used to simply write git add .
I've been using Jupyter for EDA and building pipelines. However, transitioning that pipeline into a standalone script has always been a bit of a journey for me. I would absolutely love to see a video on how to effectively make that transition from a Jupyter notebook to a full-fledged Python script, especially when it comes to keeping checks (maybe asserts to ensure data looks as expected?) for the exploratory nature of Jupyter while ensuring the robustness and maintainability of a script. Thanks for all the content you produce, and keep up the great work!
take a look at my comment in the main session
Why not use papermill at that point and run the notebooks in the background without having to rewrite the pipeline?
@@arrozesss you dont have any other comments in this video
I use notebooks for reports. Of course I run into the same issues you mentioned. This is why I try to define functions in a separate module. However, in many cases I use notebooks in an IPython shell which is convenient to explore code snippets.
This was a nice video. Too often I see the more formal programmers, who don't have any experience with exploratory data analysis, dismiss notebooks upfront, without any nuance. Yeah, sure, I'll just run a script again and again, redoing the calculations and plots I again and again, super efficient. Notebooks are a great way of combining text, images, code and output, and have their downsides, of course, as everything in this planet. I've faced all the problems you mentioned in the video, and I'm now aware of the code smells. One golden rule I found was that, before "checking in" any notebook, or giving it to someone else, always restart the Kernel and run all cells. If it doesn't run to the end, except in some very specific cases, you have a problem that needs fixing. Ideally, restart and run everything every once in a while, like 1-2 hours. In the end, I consolidate some useful behavior into functions or classes and move them to a module that I can import in future notebooks, and which is properly unit tested and documented.
Combining scripts with notebook is very useful for me in some situations!
I used to use Jupyter notebooks for data exploration and especially if there were intermediate results that took a long time to calculate. Eventually I gave up on them because it was too easy to save data off to csv, xslx, or even into a sqlite database (usually via diskcache) and then read them back in each time I re-ran.
I usually build pipelines on notebooks and transition that to regular python scripts. You just have to be aware of any changes to predefined variables. I do find notebooks to be much slower in magnitudes of 10s of minutes
I would have expected that you’ve touch on two other aspects:
- version control challenges with Jupiter notebooks; and related
- breaking the code -data separation paradigm in these notebooks that can be also a security/privacy risk.
its possible use lib "papermill" to create a python script to run a specific jupyter notebook with parameters seted in that python script, with a kernel of your choice. Very interesting lib to use notebooks as parametrized funcions with auto-saved state after each run.
Just a note - a square latlon will vary from an area perspective, with the largest at the equator and smallest at the poles. There are (lossy) projections to translate latlon to local distance in meters using the Azimuthal equidistant projection
I love Jupyter notebooks and use them almost every day, biggest issues I've found are version control and debugging. Although VScode has some limited debugging features for Jupyter it's definitely not as smooth as .py files.
These problems would also occur if the Python was being written as a script though.
Don’t change the definition of functions and assignment of variables randomly throughout your code.
Or if you must work that way, define each block with its own variables and definitions.
Agreed, and I had the same reaction as you. My hunch is that this video is geared more towards people using Jupyter notebooks to learn Python. Then, alternating between blocks/ cells containing 4 lines markdown and 4 lines of Python code is prone to buggy, inefficient, or unexpected code behavior.
I find jupyter useful for the reasons you have outlined.
Plus they are a good way to prepare presentations where you want to show graphs and the like. Not to actually run the code, just to be able to see it with markdown providing reasonable headings, comments etc.
Being able to access the presentation through a browser is also useful - you can demo through eg an iPad. The alternative of exporting the graphs/tables etc and then importing to PowerPoint is a pain. In the past I might of used XL to do something similar.
I also find them useful for developing new code (where it’s not obvious what data manipulations are required up front) then once happy with the results, re-write the algorithm as a script.
On the downside, version control is a pain in the arse. Merging always seems to go wrong with git. Also, they do seem to glitch in strange ways sometimes losing code or requiring a re-write.
Notebooks also crash in a very weird way at times. I don't really like them too because it is also hard to use them for debugging sometimes. Moving code into production, you have to redesign the program you have written. You can have the best of both worlds by using Python Interactive #%%.
One pitfall I hit when working with Jupyter notebooks and normal scripts together is that if you add a new function to your script it will not be imported into the notebook, even if you run the code block with the import command again.
I believe the first time it runs the kernel caches the imports in some way to list what it's expecting, so a reimport doesn't load the new functions. I found restarting the kernel was necessary to allow me to import the new functions
There is autoreload feature
@@RatafakRatafakplease elaborate… I too run into this often and I’m bad about remembering to restart the kernel.
This happens a lot when dealing with researchers. Supposedly novel research with sota results in a notebook, but the state is broken because they've rerun cells in different orders with cell changes in between. So it becomes completely not reproducible.
the function at 5:14 looks simple but why it took 4.5 minutes? how big was the UFO data?
I have run into problems with Jupyter notebooks. To avoid them I do gather function definitions and imports within .py modules. Also, it is always a good idea to restart the kernel and run up to the current cell to ensure the context is as intended within the sequence of the actions taken in cells. It is particularly risky to scroll up to a cell and re-run it.
Just want to say that I'm loving using jupyter notebooks within vscode. You can easily connect them to a kernel without running a jupyter server and you can run copilot and other vscode plugins within jupyter, but still get the benefits of data exploration and running the code piece by piece. It appears you are using it already within vscode, but I bet many people are unaware and running it within an anaconda juptyer host.
It's only available for Julia, but Pluto notebooks solve a lot of the problems mentioned in this video - code blocks get reloaded when one of their dependencies change, making this sort of mistake considerably harder
Missing the obvious usecase:
Showcasing your code in a tutorial kind of style
I use it in vscode when needed. Usually when I'm doing something with a 20min datalake query.
I see the run button above the 'Square' Brackets... How is that done?
We build data pipelines using Jupyter Notebooks and we do try to put much of the complex code into Python files that we import. One “gotcha” that still gets me sometimes is making changes to those Python files do not impact the “global state” even after running the imports again. The kernel needs to be restarted as well.
I found some specific behaviour when you using imports into Jupyter from your script files. If you change function in script file after you start working in the notebook your change wouldn't have effect in notebook. Repeating import doesn't help you, only restart Jupyter kernel.
I encountered the exact same problems you talk about in your video with Jupyter Notebook. Note I am not a data scientist. Same as you I can use it for deep data exploration that need visualisation. For simple data exploration I just import data in an sqlite DB which is more convenient to me when you already know SQL. I still sometimes use jupyter notebook to try some pieces of code since it's more practical and convenient than a simple python console. But still, I sometimes encounter a problem, sometimes I just try some pieces of code but I got hooked by the game and end up with a quite extensive script that is quite messy and that I need to take time to clean if I want to create a reusable script from it (outside of Jupyter Notebook). So I would say Jupyter Notebook can be absolutely great for data science, because of deep data exploration and because some code involving huge dataset or complex processing can take really long to execute in data science. Outside of this it can sometimes be an interesting tool to use to test some short code. But as I am not a data scientist I keep it more as a secondary tool, VS Code is far more convenient for 90% of my work.
Jupyter with right extensions it can show a image or play an audio(its very good for who people dealing with audio processing, you dont have to create audio file because of this) etc. Its very good for explonations and analysis
Does anyone know how to add the runtimer at the lower left corner when running a cell in the notebook?
Thanks in advance!
What's with the "I love god slash design guide" at 0:32? Not sure if that's what you're actually saying (could it be "I love code"?) but the subtitles say "I love god..." 😂
arjan.codes/designguide :D Unfortunately, the AI transcribing the video made a small mistake.
The problem you described for Jupyter can also be an issue for REPL-centered programming e.g. in Lisp/Emacs. :D
Fantastic!! Thank you.
Glad you liked it!
Glad you enjoyed it!
My biggest issue with them is that they are not reuseable and the code in them becomes very cluttered and hard to work with ver, quickly
Good video, I wanted to write about Jupyter notebooks myself, it's a very good tool
Interesting topic! I view JN as totally useless. But it reminds me of the SQL scripts that I keep in a txt document. I use these typically for analysis of data and they are a little complex to rewrite or remember.
So I can see that Jupyter Notebooks would be useful for data analysis. 😊
One of the biggest problems I have with jupyter is its interaction with git. Since it records not only the code but also the metadata and outputs, simply re-running a cell will lead to git detecting a change as the metadata has changed, and even worse, changing the code which generates an output will make git detect the change in the output, which can be hundreds of lines long.
Worst case is if two people commit those type of “changes” in the same notebook in two different branches, it will lead to merge conflicts which are horrible to resolve, especially taking into account that standard conflicts resolution aid tools will not work (try finding the damn separator “=“s among hundreds of gibberish lines which represent the output).
If you could make a video on ways to work around this, I’d be inmensely grateful
Experimenting with data pipeline components in notebooks, then migrating to scripts or modules is super common on my teams.
A couple major pain points: copying an instance method from a class in some module to a notebook, or back, is terrible due to the `self` or `cls` argument, or lack of it, breaking the expected functionality.
And ensuring the ipykernel handles your custom module imports in the notebook the same way as Python does when running a script is often awkward. Modifying a custom module you imported in the notebook earlier means you must restart your kernel to ensure it uses the new version. And depending on where you store your notebooks relative to those modules, and where you run these from results in varying import behavior.
I'm shocked at how often non-trivial data projects are built on popsickle sticks like this...and yet it (mostly) works.
You can use %%autoreload (google it) to automatically reload your changing imported scripts
@Zaltan1 "Modifying a custom module you imported in the notebook earlier means you must restart your kernel to ensure it uses the new version."
If you don't want to restart the jupyter kernel, you can run this instead after you have modified your module:
> import importlib
> importlib.reload(modulename)
Totally, I made a set of notebooks for someone to train their own language model and it was a nightmare managing the state and getting everything to return the same value all the time. Chasing ghosts in the machine.
Jupiter is good for reports or design but for pipelines scripts and vscode outline view are my preferred solution. But most important i noticed are your custom modules you generate over time, which increases the productivity. With Jupiter using your custom modules can get annoying because you have to reload the kernel each time after changing/importing your modules.
a simple workaround for this is to use the importlib library, so when you modify your modules, just execute a cell with importlib.reload(your_module_name) with this you update the modifications made in your module into your notebook without having to reset the kernel.
@@pabloskewes2184 I will look into. Thanks.
What browser do YOU use with jupyter notebook?
based on my experience using .ipynb in vs code, I need to put this 2 lines at the top of jupyter notebook file so that the update in .py file will be reflected in .ipynb
```
%load_ext autoreload
%autoreload 2
```
ipython (what jupyter notebook is built on top of) is also really nice. you can use it in place of the standard python repl. you get autocomplete and even some basic syntax highlighting
That’s 98% of what I use jupyter for. PtPythin for those interested allows vim keybindings and a few more notebook like features directly in the terminal… that was the game changer that convinced me to leave actual notebooks.
@@andrewiglinski148 wow ptpython looks incredible. django shell support too with django extensions! looking forward to trying it out
A load of the actual, real scientists (not just data scientists) I've talked to have been very keen on Jupyter.
This was extremely helpful, thank you!
Happy to help! Thank you for watching :)
End-to-End PyMC project, please, yes. Thank you, bye! :D
People who run jupyter cells by clicking on the play button with the mouse are a big red flag xd
Refactoring and debugging is a nightmare
Notebooks empower the mess! I try to avoid them as maximum as I can. I end up discovering that for me, using small scripts to do EDA is more productive, because you will always start by investigating some hypothesis on your data, and with a plus that you'll keep best practices for coding.
To do data viz, I'm experiment to use streamlit apps, because are very easy to build and also will be useful if you'll need to show your analysis for someone else.
You have to keep putting common functions in modules as you go. This leads to old notebooks being refactored or function signatures changing without being re-run. There are tools to re-run them to check compatibility but some things like training DL cannot be re-run easily.
All global parameters should be defined once at the top. You have to get comfortable with the notebook violating single object responsibility.
The notebooks are good for documentation but due to the fact they tend to get copied and pasted, you get out-of-date comments. The notebooks do not play nice with Git and review solutions but there are some solutions for this.
Despite these drawbacks, they are good for running experiments and docummenting them.
You can do the UFO sightings in GIS much easier and map the data properly..
Notebooks: never.
Python files: always
Notebooks seem super odd to me. Not really software in the traditional sense. I cannot imagine a scenario where I'd ever use one. Seems only useful for munging data interactively. Which, obviously, (as evidenced by several of the comments below), is something many people do. That's just not something which would ever come up for me. But at least I now understand at a vague, conceptual level what they are and can happily ignore them.
6:25 is that a semicolon there, after calling sns.histplot?! Sacrilegious!
I suspect that the majority of comments will touch on the issues of data scientists only working in notebooks and not having any knowledge of software engineering principles.
The unfortunate part of this is all of the junior level data scientists who are really a just jack of all trades and master of none. Many of the data science teams are poorly managed and offer no up skilling or mentorship, which will likely lead to more of these scenarios where competent engineers have to compensate for lack of ability within overgrown data science teams.
If you are in a decision making role, please ensure that your data science team either trains people to code or produces cutting-edge work. The latter is extremely rare outside of world class research groups.
I have had juniors ask me why unit testing is important, why we would move code into python scripts, and why we should use type hinting.
These questions arise even when some notebooks they develop start to push 500+ lines of code.
I am currently the only person in a team of 15 data scientists with industry experience in software engineering teams.
The saying is:: “a jack of all trades is a master of none, but oftentimes better than a master of one.” - so things get better. We limp along, and they get better.
I'm going down the Arjan Rabbit Hole now. 🐇🕳
I watch one video, then you mention something I don't know, like partial functions, which leads me to watching another video about sometihng I don't know... 🤣🤣
Great video....make more videos....we are new and confused on many points.
Glad you enjoyed it!
I have very mixed feelings about Jupyter notebooks. Working with it at least 5 years I definitely enjoy its way of being more interactive and easier to share than scripts. However, it’s an issue when you are transitioning from EDA to development and this happens not always in a well defined moment :) and eventually you have quite bad piece of code. So my rule is to run only high level functions with very well defined interfaces and put everything else into user defined functions and classes files.
USS Cerritos
"... we can clearly see that the US is the most popular country" - lol, that's gotta be a giant red flag of some kinds 😆
:)
Simple. Never use Jupyter notebooks. Unless you really hate your customer. And yourself.
What can I train my BIE, DS, MLE, and DE to use instead? I am totally open to some suggestions. Excell?