在 Alluxio 上运行 Apache Flink

This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.

Prerequisites

Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
Alluxio has been set up and is running.
Flink has been installed and set up.

Configuration

Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

Set property in `core-site.xml`

If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

In case you don’t have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Specify path to `core-site.xml` in `conf/flink-conf.yaml`

Next, you have to specify the path to the Hadoop configuration in Flink. Open the conf/flink-conf.yaml file in the Flink root directory and set the fs.hdfs.hadoopconf configuration value to the directory containing the core-site.xml. (For newer Hadoop versions, the directory usually ends with etc/hadoop.)

Distribute the Alluxio Client Jar

In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio download page. Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions here. The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar.

We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

There are different ways to achieve that:

Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the lib directory of Flink (for local and standalone cluster setups)
Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the ship directory for Flink on YARN.
Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:

$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar

Translate additional Alluxio site properties to Flink

In addition, if there are any client-related properties specified in conf/alluxio-site.properties, translate those to env.java.opts in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml.

env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH

Note: If there are running flink clusters, stop the flink clusters and restart them to apply the changes to the configuration.

Using Alluxio with Flink

To use Alluxio with Flink, just specify paths with the alluxio:// scheme.

If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

Wordcount Example

This example assumes you have set up Alluxio and Flink as previously described.

Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE

Run the following command from the top level Flink project directory:

$ bin/flink run examples/batch/WordCount.jar \
  --input alluxio://localhost:19998/LICENSE \
  --output alluxio://localhost:19998/output

Open your browser and check http://localhost:19999/browse. There should be an output file output which contains the word counts of the file LICENSE.

您可能会感兴趣

立即下载｜Alluxio助力全球跨境电商构建高性能数据访问平台实战宝典

Alluxio Enterprise AI 3.5 发布：通过创新缓存模式、分布式缓存管理以及Python深度集成，全面提升AI模型训练性能

Alluxio 联手 Solidigm 推出针对 AI 工作负载的高级缓存解决方案

MLPerf基准测试冲出黑马，Alluxio新范式引爆AI存储

所有文章

立即下载｜Alluxio助力全球跨境电商构建高性能数据访问平台实战宝典

2025-02-25

从业务痛点剖析到Alluxio解决方案应用，再到企业价值收益，深入挖掘跨境电商企业实战经验，赋能更多电商领域负责AI和大数据的相关从业者，在自身业务中开展更多的创新应用。

Alluxio Enterprise AI 3.5 发布：通过创新缓存模式、分布式缓存管理以及Python深度集成，全面提升AI模型训练性能

2025-02-11

近日，Alluxio发布Alluxio Enterprise AI 3.5 版本。该版本凭借仅缓存写入模式(Cache Only Write Mode)、高级缓存管理策略以及Python的深度集成等创新功能，大幅加速AI模型训练并简化基础设施运维，助力企业高效处理海量数据集、优化AI工作负载性能。
AI驱动的工作负载常因海量的数据管理复杂度高导致效率瓶颈以及训练周期延长。

Alluxio 联手 Solidigm 推出针对 AI 工作负载的高级缓存解决方案

2025-01-22

“Solidigm 和 Alluxio 联合推出了高效的分布式 AI 缓存方案。通过将 Solidigm 的 D5-P5336 用作读缓存，D7-PS1010 用于 checkpoint 写入，并搭配 Alluxio 的低操作开销解决方案，我们帮助客户实现了大规模 AI 场景下成本和性能的最佳平衡。优化后的方案充分利用了Solidigm D7-PS1010 Gen5 TLC SSD 的写入带宽和 D5-P5336 Gen4 QLC SSD 的读取带宽，同时将 TLC 和 QLC SSD 的写放大系数保持在 1.02。

按场景

AI 机器学习

数据分析

统一命名空间

按行业

自动驾驶

AI 制药

010-82449668