The more you go in data analysis, the more you understand that the most suitable tool for coding and visualizing is not a pure code, or SQL IDE, or even simplified data manipulation diagrams (aka workflows or jobs). From some point you realize that you need a mix of these all – that’s what “notebook” platforms are. I have tried two most powerful of them in production use with about 20+ analytic users. My experience is described in this article.
First, let me describe the situation. As a big data warehouse, we have many data storage systems in our infrastructure:
- Greenplum database (~50 Tb)
- Hadoop cluster with Hive tables and flat text files on HDFS (~600Tb of “raw” data)
- Flat CSV files on several machines (tens of Gb)
- SAS datafiles (hundreds of GB)
- Some other databases (hundreds of GB)
One kind of users of that data is internal analytics, who use it to provide data support for business developers. Good news is that we love our users. Really, they are not the kind of users that “Oh, dear, I didn’t press anything, it broke by itself”. Our analytics feel free with SQL, Python and it’s analytical and visualization libraries (Numpy, plotly, etc.), know what machine learning is, and so on.
These users use different software to analyze a different kind of data – Aginity SQL IDE for Greenplum, SAS for CSV SAS files, PIG for Hadoop data, etc. The main problem is you can’t join the data from different sources for further usage that way, you cannot even visualize data from multiple sources in one report.
That is where Notebook way can help. Notebook is an interactive report (mostly web-based) that contains:
- The code in one of supported platforms (Python, SQL, R, etc.)
- The result of the code that was executed above
- The visualization of the result (charts, diagrams)
- Any other stuff like HTML content, plain text, pictures of kittens, etc.
Two of the most popular notebook applications are Jupyter Notebook and Zeppelin.
Jupyter Notebook is well-known, widely spread software that is used for a long time in such giants like Google and NASA. Jupyter was created in 2012, it is an evolution of IPython Notebook – similar software that supports only Python language as a notebook engine. Open-source, it has a big community and a lot of additional software and integrations.
Apache Zeppelin is a new player. Started by Apache Foundation (what a surprise) in 2013, it is also open-source, but its community is still 1/10 of Jupyter’s (based on number of Github contributors).
Both system’s installation process is quite simple. For Zeppelin its just decompressing the tarball and running the server, for Jupyter – installing pip package and also running the binary.
But, as Zeppelin is a new, fast-changing system, it is better to build it from sources – in that case, you will get much more new features:
wget https://github.com/apache/zeppelin/archive/master.zip unzip master cd zeppelin-master/ ./dev/change_scala_version.sh 2.11 sed -i 's/60000/600000/g' spark-dependencies/pom.xml #default download timeout is too small for our network mvn clean package -DskipTests -Pspark-2.0 -Phadoop-2.4 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Externally notebooks in both systems are alike. There is a control panel on the top of the page to work with notebook selection, Save/Delete buttons, etc. Notebook itself consists of paragraphs – each of them contains the code and the output. Here is Jupyter’s notebook:
Zeppelin’s biggest advantage here is that Zeppelin allows you to combine multiple paragraphs in one line:
Also, Zeppelin has built-in simple data visualizing tool for some interpreters (that’s how engines are called in Zeppelin), for example, for SQL:
You have traditional charts, pie charts, diagrams and some extra staff. Also, table output in Zeppelin allows you apply sorting out-of-the-box.
From the other hand, Jupyter’s code editor and paragraph navigator seem to be much more effective – it has command and editor modes, switched by ESC button. Editor mode allows you to modify the selected paragraph, while in command mode you can quickly operate with paragraphs, run them, restart kernels (engines in Jupyter), toggle output off/on and so on (vim users will love this approach). Also, Jupyter has more hot keys (aka shortcuts) then Zeppelin (there is no one doc page for Zeppelin shortcuts, so try starting with these and these). Btw, not all Zeppelin’s hot keys work on every platform/browser.
Also, Jupyter’s big advantage is in a big number of Python’s libraries for visualizing data that support output of pictures and other interactive content directly in paragraph’s output. For example, plotly lib will output the chart in Jupyter’s notebook, while in Zepeplin it will just save the HTML-file. Zeppelin supports only matlotlib’s content.
And the final knockout to Zeppelin here is that Jupyter has great autocompletion feature – it completes Python’s methods and objects, SQL tables and schemas, and so on.
From the first look, the winner in this category is Jupyter because of the huge (80+) list of supported engines against only 19 Zeppelin’s interpreter types. As Jupyter’s community is bigger and older, it is obvious that Jupyter supports much more external systems.
But there are two nuances that need to be mentioned:
- Zeppelin’s community is growing faster than Jupyter’s. In fact, there are a lot of things in Zeppelin that are borrowed from Jupyter – IMHO, that’s why Zeppelin’s development is going faster;
- Small from the first look, this improvement makes Zeppelin much more suitable for some cases: every Zeppelin’s notebook supports unlimited number of engines out-of-the-box. That means that user can combine different data sources and its outputs in one notebook, creating wide, cross-system reports (look at picture #2 – I used three different interpreters in one notebook – R, Python and JDBC to Hive2). I think we can definitely call this a killer-feature.
Also, Zeppelin allows you to choose how do you want to run your interpreter:
- one process for all users (“shared”), which means that everyone will use the same python session and will be able to use dynamic objects of each other;
- isolated, which means that everyone will have separated interpreter process.
Jupyter does not need such choice feature because it runs a separate notebook server only.
Any software that wants to be called production-ready must integrate well with corporate’s security systems. Other words, you need to authenticate users in external system, Microsoft Active Directory in our case.
Jupyter doesn’t support multi-user configuration by default, but you can it installing Jupyterhub – its an additional service that accepts client’s connection, authenticate him and starts separate Jupyter server for him. That’s not a good solution when you have a big number of users – starting a separate server for every of them will lead you to load overheads on your physical server.
Zeppelin supports multiuser configuration well – it does it as any other software using only one server process, authenticating user in configured system (flat file user:password list, or LDAP or Active Directory) before allowing him to go further. All you need to do is just to set up LDAP/AD connection in shiro.conf file.
Now, when you have a large number of users accessing your Notebook server, authenticating via LDAP or Active Directory, you may want to restrict some user group from viewing other user’s notebooks’ code. The next stage will be using traditional user-group permission system, which is quite flexible.
The short answer for asking for ability to make this in Jupyter is “You can’t”. From the other hand, there is an opened Github issue for this, so sometime later we may see this work done.
In Zeppelin you can create flexible security configurations – user may belong to a group, a group may have or may not have read, write and execute accesses on separate notebooks. The only one fly in the ointment is that you can assign permissions only on notebook itself – but not on the directory. Zeppelin:Jupyter is 1:0 here.
Because of a huge community, Jupyter notebook has much more extensions available to use. You can find full list of almost all existing extensions in this repository.
As for Zeppelin, the true is there are no available extensions existing for it. However, any new feature that someone implements goes into official repository, so just check for updates.
For now, Zeppelin is not the winner here. We found the following problems in the current Zeppelin’s version (0.7):
- Pyspark interpreter is still not stable when it is used by big amount of users in parallel. Sometimes it just hangs, sometimes it stops to work with meaningless errors. I hope we will manage with this problem in near future.
- When using separate interpreter mode, the time of living of interpreter process after code was executed last time is unpredictable. That means that you cannot predict if your dynamic objects in interpreter’s context are still alive after some time without activity.
- Some bugs in UI (like notepad’s cron job continuing working after notepad was moved to trash).
I hope it is a matter of time for Zeppelin to become as stable, as Jupyter.
As far as I see, for now Zeppelin doesn’t cover all Jupyter’s features and possibilities, also it is not so stable and popular among analytic users. But already now Zeppelin shows that it is designed for enterprise users – thus it has great LDAP integration feature, permissions management system and so on.
So, if you are planning to use the notebook app just for yourself or for a limited number of analytics – Jupyter is still your choice. However, if you are designing notebook usage for a big amount of users in an enterprise, take a look on Zeppelin – it will not take long for it to overtake Jupyter with temps that it is developed now.