Программа h2o что это

Введение в H2O.ai

Машинное обучение, как один из мощнейших локомотивов современной it-индустрии, располагает огромным количеством различных алгоритмов, фреймворков, библиотек и инструментов, предоставляющих пользователям широкие возможности для анализа и обработки данных, обучения различных видов моделей, развертывания и масштабирования ML-решений на облачных платформах. Во всем этом огромном объеме инструментов и алгоритмов, легко потеряться не только новичку, который только учится запускать jupyter notebook из командной строки, но и порой даже опытному ML-инженеру, выбирающему подходящий технологический стек для MVP нового прорывного стартапа. Хорошим выбором в обоих случаях может стать H2O.ai.

Что такое H2O.ai?

H2O.ai — это компания и одноименная программная платформа, которая оптимизирует процесс машинного обучения с помощью opensource пакета H2O и AutoML (механизм частичной автоматизации рутины по обучению ML-моделей). H2O.ai зарождался как стартап Кремниевой долины в далеком 2012 году, а на текущий момент — это компания, состоящая из более чем 150 сотрудников и совокупной годовой выручкой в $21M. В январе 2019 года компания даже появилась на Магическом квадранте Gartner для Data Science и Machine Learning платформ в категории Visionary, наряду с такими гигантам как Google, IBM и Microsoft.

Стоит отметить, что большинство продуктов и возможностей H2O.ai доступны пользователям абсолютно бесплатно, а исходный код решений находится в открытом доступе на странице компании в github.com.

Почему H2O.ai?

H2O.ai, как компания, пытающаяся упростить процесс разработки ML-решений, предоставляет широкую линейку продуктов, в том числе и таких, которые позволяют пользователям, не обладающим глубокими познаниями в Data Science и навыками программирования извлекать инсайты из данных и строить прогностические модели.

H2O — главный продукт компании. Платформа для машинного обучения и предиктивной аналитики, позволяющая строить модели машинного обучения и генерировать для них продакшн-код на Java и Python нажатием одной кнопки. Содержит реализации алгоритмов supervised и unsupervised моделей таких как GLM и K-Means, а также простой в использовании веб-интерфейс, называемый Flow. В H2O заявляют, что данная платформа позволяет обучать модели быстрее популярных библиотек машинного обучения, таких как sckit-learn.

Deep Water — интеграция H2O с библиотеками для глубокого обучения TensorFlow, MXNet и Caffe. Проект не поддерживается командой H2O.ai c декабря 2017 года.

Sparkling Water — интеграция H2O и Spark для удобства использования пользователями работающей Spark экосистемы с H2O ML.

H2O4GPU — продукт, предоставляющий аппаратное ускорение обучения ML моделей на GPU через API на языках Python и R.

Steam — проприетарный (небесплатный) продукт, позволяющий строить и развертывать ML-модели. Инженеры могут строить и развертывать модели на STEAM, делая их доступными для интеграции при помощи API другим пользователям.

Driverless AI — проприетарная пользовательская оболочка, предоставляющая возможность сотрудникам предприятий, не обладающим техническими навыками, обрабатывать данные, подгонять параметры моделей и подбирать наилучшие для решения конкретной задачи алгоритмы. Все это доступно через удобный пользовательский веб-интерфейс.

В нашей статье речь пойдет о главном продукте компании — H2O. Все примеры, рассматриваемые ниже, будут приведены на языке Python, однако H2O так же можно использовать при помощи языка R.

Установка

Установка H2O и всех необходимых компонентов производится через командную строку или терминал соответствующей ОС.

Установите пакетный менеджер homebrew:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”

Overview¶

Welcome to the H2O documentation site! Depending on your area of interest, select a learning path from the sidebar, or look at the full content outline below.

We’re glad you’re interested in learning more about H2O. If you have questions or ideas to share, please post them to the H2O community site on Stack Overflow.

See how are customers are using H2O at https://www.h2o.ai/customers/.
Keep up to date with the latest H2O blogs at https://www.h2o.ai/blog/.
Review projects, applications, research papers, tutorials, courses, and books that use H2O at https://github.com/h2oai/awesome-h2o.
Learn about securing your installation by following our security guidelines.

API-Related Changes
- From 3.32.0.1
- From 3.30.1.2
- From 3.30.1.1
- From 3.30.0.5
- From 3.30.0.4
- From 3.28 or Below to 3.30
- From 3.26 or Below to 3.28
- From 3.22 or Below to 3.24
- Quick Start Videos
  - H2O Quick Start with Flow
  - H2O Quick Start with Python
  - H2O Quick Start on Hadoop
  - H2O Quick Start with Sparkling Water
  - H2O Quick Start with R
  - Cloud Integration
    - EC2 Instances & S3 Storage
    - Using H2O with Amazon AWS
    - Using H2O with Microsoft Azure Linux Data Science VM
    - Using H2O with IBM Data Science Experience
    - Using H2O with Nimbix
    - Using H2O with Google Cloud Storage
    - Using H2O with Kubernetes
    - Install H2O on the Google Cloud Platform Offering
    - Downloading & Installing H2O
      - Download and Run from the Command Line
      - Install in R
      - Install in Python
      - Install on Hadoop
      - Starting H2O
        
        From R
        
        From Python
        
        From the Command Line
        
        On Spark
        
        Connecting to an H2O Cluster by Name
        
        Best Practices
        
        Getting Data into Your H2O Cluster
        
        Supported File Formats
        
        Data Sources
        
        Default Data Sources
        
        Local File System
        
        Remote File
        
        HDFS-like Data Sources
        
        Alluxio FS
        
        IBM Swift Object Storage
        
        Google Cloud Storage Connector for Hadoop & Spark
        
        Requirements
        
        Limitations
        
        Importing Examples
        
        import_sql_table
        
        import_sql_select
        
        Using the Hive 2 JDBC Driver
        
        Data Manipulation
        
        Uploading a File
        
        Importing a File
        
        Importing Multiple Files
        
        Downloading data
        
        Changing the Column Type
        
        Combining Columns from Two Datasets
        
        Combining Rows from Two Datasets
        
        Fill NAs
        
        Group By
        
        Imputing Data
        
        Merging Two Datasets
        
        Pivoting Tables
        
        Replacing Values in a Frame
        
        Slicing Columns
        
        Slicing Rows
        
        Sorting Columns
        
        Splitting Datasets into Training/Testing/Validating
        
        Tokenize Strings
        
        Feature Engineering
        
        Training Models
        
        Supervised Learning
        
        Unsupervised Learning
        
        Training Segments
        
        Cross-Validation
        
        How Cross-Validation is Calculated
        
        Using Cross-Validated Predictions
        
        Combining Holdout Predictions
        
        Cross-Validation Cleanup
        
        Variable Importance
        
        Training Data Terminology
        
        Decision Tree Terminology
        
        Decision Tree Representations
        
        Building a Decision Tree
        
        Splitting Based on Squared Error
        
        Feature Importance (aka Variable Importance) Plots
        
        Variable Importance Calculation (GBM & DRF)
        
        Tree-Based Algorithms
        
        Non-Tree-Based Algorithms
        
        Retrieving Variable Importance in H2O-3
        
        References
        
        Grid (Hyperparameter) Search
        
        Quick Links
        
        Grid Search in R and Python
        
        Grid Search Java API
        
        REST API
        
        Supported Grid Search Hyperparameters
        
        Grid Testing
        
        Caveats/In Progress
        
        Additional Documentation
        
        Checkpointing Models
        
        Checkpoint with Deep Learning
        
        Checkpoint with DRF
        
        Checkpoint with GBM
        
        Checkpoint with XGBoost
        
        Model Performance
        
        Prediction
        
        H2O AutoML: Automatic Machine Learning
        
        AutoML Interface
        
        Explainability
        
        Code Examples
        
        AutoML Output
        
        Web UI via H2O Wave
        
        Experimental Features
        
        FAQ
        
        Resources
        
        Citation
        
        Random Grid Search Parameters
        
        Additional Information
        
        Model Explainability
        
        Model Explainability Interface
        
        Code Examples
        
        Output: Explanations
        
        Model Metrics
        
        Explanation Plotting Functions
        
        Learning Curve Plot
        
        Pareto Front Plot
        
        Additional Information
        
        Admissible Machine Learning
        
        Infogram
        
        Infogram Interface
        
        Infogram Output
        
        Code Examples
        
        Glossary
        
        References
        
        Saving, Loading, Downloading, and Uploading Models
        
        Binary Models
        
        MOJO Models
        
        Productionizing H2O
        
        About POJOs and MOJOs
        
        MOJO Quick Start
        
        POJO Quick Start
        
        Example Design Patterns
        
        Additional Resources
        
        MOJO Capabilities
        
        About H2O Generated MOJO Models
        
        Predicting Values with MOJO
        
        Metadata Contained in the MOJO Model
        
        Using Flow – H2O’s Web UI
        
        About
        
        Download Flow
        
        Launch Flow
        
        How to Use the Interface
        
        Using Flows
        
        Understanding Cell Modes
        
        Data
        
        Viewing Jobs
        
        Models
        
        Partial Dependence Plots
        
        Predictions
        
        Frames
        
        Troubleshooting Flow
        
        Downloading Logs
        
        Regular Users
        
        Hadoop Users
        
        Accessing YARN
        
        Log File Structure
        
        H2O Architecture
        
        H2O Software Stack
        
        REST API Clients
        
        How R (and Python) Interacts with H2O
        
        Security
        
        Security Model
        
        File Security in H2O
        
        Embedded Web Port (by default port 54321) Security
        
        SSL Internode Security
        
        H2O-3
        
        Open Source, Distributed Machine Learning for Everyone
        
        H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.
        
        Key Features of H2O
        
        Leading Algorithms
        
        Access from R, Python, Flow and more…
        
        AutoML
        
        Distributed, In-Memory Processing
        
        Simple Deployment
        
        Algorithms developed from the ground up for distributed computing and for both supervised and unsupervised approaches including Random Forest, GLM, GBM, XGBoost, GLRM, Word2Vec and many more.
        
        Use the programing language you already know like R, Python and others to build models in H2O, or use H2O Flow, a graphical notebook based interactive user interface that does not require any coding.
        
        H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time limit. Stacked Ensembles will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.
        
        Learn More
        
        In-memory processing with fast serialization between nodes and clusters to support massive datasets. Distributed processing on big data delivers speeds up to 100x faster with fine-grain parallelism, enabling optimal efficiency without introducing degradation in computational accuracy.
        
        Easy to deploy POJOs and MOJOs to deploy models for fast and accurate scoring in any environment, including very large models.
        
        How it Works
        
        Enterprise Data
        
        H2O works on existing big data infrastructure, on bare metal or on top of existing Hadoop, Spark or Kubernetes clusters. It can ingest data directly from HDFS, Spark, S3, Azure Data Lake or any other data source into it’s in-memory distributed key-value store.
        
        Distributed, In-Memory Machine Learning
        
        H2O takes advantage of the computing power of distributed systems and in-memory computing to accelerate machine learning using it’s industry parallelized algorithms which take advantage of fine grained in-memory mapreduce..
        
        Seamless Deployment
        
        Quickly and easily deploy models into production with Java (POJO) and binary formats (MOJO). In addition, H2O models can be productionized in a host of different ways as listed here.
        
        Enterprise Support
        
        When AI becomes mission critical for enterprise success, H2O.ai is there to help. H2O Enterprise Support provides the services you need to optimize your investments in people and technology to deliver on your AI vision. H2O Enterprise Support includes training, a dedicated account manager, 24/7 support, accelerated issue resolution, and direct enhancement requests. Enterprise support also gives you access to H2O experts in data science, the H2O platform, and DevOps/production deployment to accelerate and expand your adoption of AI. In addition, Enterprise Support customers also have access to Enterprise Steam or H2O Sparkling Water to deploy and manage models in their Hadoop, Spark, or Kubernetes clusters.
        
        Learn More
        
        H2O AutoML
        
        AutoML or Automatic Machine Learning is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment. AutoML tools make it easy to train and evaluate machine learning models. Automating the repetitive data science tasks allows people to focus on the data and the business problems they are trying to solve.
        
        Learn More
        
        Featured Use Cases
        
        Predict Out-of-Stock Risk
        
        Out-of-stock issues have been around for decades. We can avoid getting out-of-stock and manage inventory using ML based stock out risk solution.
        
        Learn More
        
        Fraud Detection
        
        Detecting fraud even before it happens can prevent significant losses for financial institutions and prevent headaches for customers that can damage relationships.
        
        Learn More
        
        Claims Management
        
        Finding ways to improve the claims process can save money but also makes sure that customers and patients with legitimate issues are taken care of.
        
        Learn More
        
        Hospital Capacity Simulator
        
        Hospital capacity simulation solution aims to predict census, hospital capacity, and patient stay.
        
        При подготовке материала использовались источники:
        https://deeplearningrussia.wordpress.com/2019/10/12/intro-to-h2o-ai/
        https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
        https://h2o.ai/platform/ai-cloud/make/h2o/