Until 2016, the terms "data science" and "data mining" usually meant the use of the Hadoop ecosystem. But the game changed about two years ago. Many enterprises have faced the fact that the Hadoop stack is too heavy to use it entirely for the enterprise tasks. Can we use MPP RDMS for typical Hadoop cases – Data Science and Data Mining? Lets find out.
Greenplum 5.0.0 brought us a lot of new features. Most of them were planned a long time ago, but couldn't be implemented without breaking backward binary compatibility, which cannot be done in 4.X major branch. One of such features is a new PXF framework. It allows you to integrate Greenplum cluster with other systems - databases, in-memory grids, Hadoop components, etc. Moreover, it can do it in parallel - all Greenplum segments can retrieve its personal shards of data.
The more you go in data analysis, the more you understand that the most suitable tool for coding and visualizing is not a pure code, or SQL IDE, or even simplified data manipulation diagrams (aka workflows or jobs). From some point you realize that you need a mix of these all – that’s what “notebook” platforms are. I have tried two most powerful of them in production use with about 20+ analytic users. My experience is described in this article.
There are a lot of monitoring systems nowadays, but working with Massively Parallel Processing (MPP) databases showed me that they are not enough to monitor complex data processing systems from both sides - data and hardware. For that purposes I found solution in combining multiple metric collecting, visualizing and alerting systems.
Most of unix administrators sooner or later meet the problem of using multiple Python versions on one system. Mostly, the reason of this is users - they want to use different versions of Python and easily switch between them. How can we help them with it?