Solving Data Science tasks with Greenplum DB

Analytics in GreenplumUntil 2016, the terms "data science" and "data mining" usually meant the use of the Hadoop ecosystem. But the game changed about two years ago. Many enterprises have faced the fact that the Hadoop stack is too heavy to use it entirely for the enterprise tasks. Can we use MPP RDMS for typical Hadoop cases – Data Science and Data Mining? Lets find out.

Parallel access to external data sources from Greenplum DB using PXF

Parallel accessGreenplum 5.0.0 brought us a lot of new features. Most of them were planned a long time ago, but couldn't be implemented without breaking backward binary compatibility, which cannot be done in 4.X major branch. One of such features is a new PXF framework. It allows you to integrate Greenplum cluster with other systems - databases, in-memory grids, Hadoop components, etc. Moreover, it can do it in parallel - all Greenplum segments can retrieve its personal shards of data.

Apache Zeppelin vs Jupyter Notebook: comparison and experience

MPP monitoringThe more you go in data analysis, the more you understand that the most suitable tool for coding and visualizing is not a pure code, or SQL IDE, or even simplified data manipulation diagrams (aka workflows or jobs). From some point you realize that you need a mix of these all – that’s what “notebook” platforms are. I have tried two most powerful of them in production use with about 20+ analytic users. My experience is described in this article.

Monitoring MPP systems

MPP monitoringThere are a lot of monitoring systems nowadays, but working with Massively Parallel Processing (MPP) databases showed me that they are not enough to monitor complex data processing systems from both sides - data and hardware. For that purposes I found solution in combining multiple metric collecting, visualizing and alerting systems.