Yet another IT-blog

Solving Data Science tasks with Greenplum DB

06.05.201814.05.2018 Dmitriy Pavlov3 Comments

Until 2016, the terms "data science" and "data mining" usually meant the use of the Hadoop ecosystem. But the game changed about two years ago. Many enterprises have faced the fact that the Hadoop stack is too heavy to use it entirely for the enterprise tasks. Can we use MPP RDMS for typical Hadoop cases – Data Science and Data Mining? Lets find out.

Parallel access to external data sources from Greenplum DB using PXF

10.01.201810.01.2018 Dmitriy PavlovLeave a comment

Greenplum 5.0.0 brought us a lot of new features. Most of them were planned a long time ago, but couldn't be implemented without breaking backward binary compatibility, which cannot be done in 4.X major branch. One of such features is a new PXF framework. It allows you to integrate Greenplum cluster with other systems - databases, in-memory grids, Hadoop components, etc. Moreover, it can do it in parallel - all Greenplum segments can retrieve its personal shards of data.

Apache Zeppelin vs Jupyter Notebook: comparison and experience

25.03.201726.09.2017 Dmitriy Pavlov14 Comments

The more you go in data analysis, the more you understand that the most suitable tool for coding and visualizing is not a pure code, or SQL IDE, or even simplified data manipulation diagrams (aka workflows or jobs). From some point you realize that you need a mix of these all – that’s what “notebook” platforms are. I have tried two most powerful of them in production use with about 20+ analytic users. My experience is described in this article.

Monitoring MPP systems

19.02.201719.02.2017 Dmitriy Pavlov1 Comment

MPP monitoring There are a lot of monitoring systems nowadays, but working with Massively Parallel Processing (MPP) databases showed me that they are not enough to monitor complex data processing systems from both sides - data and hardware. For that purposes I found solution in combining multiple metric collecting, visualizing and alerting systems.

Working with multiple Python versions

23.01.201722.06.2023 Dmitriy PavlovLeave a comment

Python 2.7, Python 3, Python 3.5... Most of unix administrators sooner or later meet the problem of using multiple Python versions on one system. Mostly, the reason of this is users - they want to use different versions of Python and easily switch between them. How can we help them with it?