×
spark

Alibaba Big Data Practices on Cloud-Native – EMR Spark on ACK

This article discusses the practices and challenges of EMR Spark on Alibaba Cloud Kubernetes.

Application of Delta Lake in Soul

This article explains the background of Delta Lake along with practices, problems, and solutions.

Fluid Helps Improve Data Elasticity with Customized Auto Scaling

This article gives step-by-step instructions about auto scaling with Fluid.

Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse

In this article, the author explains building a real-time data warehouse using Apache Flink and Apache Iceberg.

Apache Iceberg 0.11.0: Features and Deep Integration with Flink

In this article, the author discusses how Apache Flink and Apache Iceberg have opened a new chapter in building a data lake architecture featuring stream-batch unification.

Integrating Apache Hudi and Apache Flink for New Data Lake Solutions

This article explains Apache Hudi and Apache Flink and the benefits of implementation.

Cloud-Native Compute Engine: Challenges and Solutions

This article explains some of the challenges in cloud-native compute engines, and discusses some solutions and future directions.

Data Lake: How to Explore the Value of Data Using Multi-engine Integration

This article briefly discusses the metadata service and multi-engine support capabilities of the Alibaba Cloud Data Lake Formation (DLF) service.

EMR Remote Shuffle Service: A Powerful Elastic Tool of Serverless Spark

This article discusses Alibaba Cloud's EMR Remote Shuffle Service and explains how it solves the shuffle stability problems in compute-storage separation architectures.

Efficient Data Lake Formation Based on JindoFS and OSS

This article explains the process of data lake formation based on Alibaba Cloud OSS and JindoFS big data cache acceleration service.

Integrating Real-time Search with SaaS-based Cloud Data Warehouses

This article discusses the integration of Saas Cloud-based Data Warehouses and Real-time Search, as shared by Meng Shuo, product manager of MaxCompute.

Setting Up PySpark on Alibaba Cloud CentOS Instance

This tutorial provides a step-by-step tutorial on how to setup PySpark in Alibaba Cloud ECS instance which is running CentOS 7.x operating system.

Building a Cloud-Native Feed Streaming System with Apache Kafka and Spark on Alibaba Cloud – Part A: Service Setup

In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud.

Building a Cloud-Native Feed Streaming System with Apache Kafka and Spark on Alibaba Cloud – Part B: Streaming Processing

In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud.

Use Spark on MaxCompute to Access Alibaba Cloud HBase

This article describes how to add configuration items in HBase Standard Edition and HBase Enhanced Edition.

The Run-In Period for Flink and Hive

Jason addresses the bugs and compatibility issues with Flink-Hive by operating on a Hive database using Flink SQL to demonstrate some of the features provided.

Hive Finally Has Flink!

Jason introduces the architecture of Hive integration in Flink, discusses problems, and how to solve them.

Setting up Spark on MaxCompute

This post provides a walkthrough on how to set up Spark on MaxCompute on Alibaba Cloud.

Using Data Preorganization for Faster Queries in Spark on EMR

This article looks into how you can accelerate query speeds by using the Spark Relational Cache of Alibaba Cloud E-MapReduce.

Use Apache Arrow to Assist PySpark in Data Processing

This article looks at Apache Arrow and its usage in Spark and how you can use Apache Arrow to assist PySpark in data processing operations.