010-82449668

EN 中文

在 Alluxio 上运行 Apache Flink

在 Alluxio 上运行 Apache Flink

Slack Docker Pulls GitHub edit source

This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.

Prerequisites

  • Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
  • Alluxio has been set up and is running.
  • Flink has been installed and set up.

Configuration

Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

Set property in core-site.xml

If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

In case you don’t have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Next, you have to specify the path to the Hadoop configuration in Flink. Open the conf/flink-conf.yaml file in the Flink root directory and set the fs.hdfs.hadoopconf configuration value to the directory containing the core-site.xml. (For newer Hadoop versions, the directory usually ends with etc/hadoop.)

Distribute the Alluxio Client Jar

In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio download page. Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions here. The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar.

We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

There are different ways to achieve that:

  • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the lib directory of Flink (for local and standalone cluster setups)
  • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the ship directory for Flink on YARN.
  • Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar

In addition, if there are any client-related properties specified in conf/alluxio-site.properties, translate those to env.java.opts in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml.

env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH

Note: If there are running flink clusters, stop the flink clusters and restart them to apply the changes to the configuration.

To use Alluxio with Flink, just specify paths with the alluxio:// scheme.

If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

Wordcount Example

This example assumes you have set up Alluxio and Flink as previously described.

Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE

Run the following command from the top level Flink project directory:

$ bin/flink run examples/batch/WordCount.jar \
  --input alluxio://localhost:19998/LICENSE \
  --output alluxio://localhost:19998/output

Open your browser and check http://localhost:19999/browse. There should be an output file output which contains the word counts of the file LICENSE.

MLPerf基准测试冲出黑马,Alluxio新范式引爆AI存储

为了较好地展示 Alluxio 的缓存性能,我们采用了全球首个且唯一的 AI/ML 存储基准测试——MLPerf® Storage 进行验证。MLPerf™ 是影响力最广的国际 AI 性能基准评测,由图灵奖得主大卫•帕特森(David Patterson)联合顶尖学术机构发起成立,并于2023年推出 MLPerf™ Storage 基准性能测试,旨在以架构中立、具有代表性和可重复的方式衡量 AI 工作负载的存储系统性能。

Alluxio在数据索引和模型分发中的核心价值与应用

在当前的技术环境下,搜索、推荐、广告、大模型、自动驾驶等领域的业务依赖于海量数据的处理和复杂模型的训练。这些任务通常涉及从用户行为数据和社交网络数据中提取大量信息,进行模型训练和推理。这一过程需要强大的数据分发能力,尤其是在多个服务器同时拉取同一份数据时,更是考验基础设施的性能。

南方科技大学分享:大数据技术如何赋能大模型训练及开发

南方科技大学是深圳在中国高等教育改革发展的时代背景下创建的一所高起点、高定位的公办新型研究型大学。2022年2月14日,教育部等三部委公布第二轮“双一流”建设高校及建设学科名单,南方科技大学及数学学科入选“双一流”建设高校及建设学科名单。