×
spark

Practices of Simulating IDC Spark Read and Write MaxCompute

This article uses EMR (Cloud Hadoop) to simulate a local Hadoop cluster accessing MaxCompute data.

Big Data Q&A - Friday Blog, Week 65

Friday Q&A is back! Let's take a look at some of the many very interesting questions I was asked during Alibaba Cloud training sessions this week!

The Spark and Delta Lake Engine Enterprise Edition of Databricks Helps Efficiently Access Lake Houses

This article describes how to optimize the performance of the product features provided by the Enterprise Edition to help you efficiently access lake houses.

Zuoyebang's Best Practices for Building Data Lakes Based on Delta Lake

This article aims to solve the performance problems of offline data warehouses (daily and hourly) during production and usage.

Best Practices for Big Data Processing in Spark

This article is an overview of the best practices for big data processing in Spark taken from a lecture.

DLF + DDI Best Practices for One-Stop Data Lake Formation and Analysis

This article aims to give readers a deeper understanding of Alibaba Cloud Data Lake Formation (DLF) and Databricks DataInsight (DDI).

Use Flink Hudi to Build a Streaming Data Lake

This article introduces the optimization and evolution of Flink Hudi's original mini-batch-based incremental computing model through stream computing.

Alibaba Big Data Practices on Cloud-Native – EMR Spark on ACK

This article discusses the practices and challenges of EMR Spark on Alibaba Cloud Kubernetes.

Application of Delta Lake in Soul

This article explains the background of Delta Lake along with practices, problems, and solutions.

Fluid Helps Improve Data Elasticity with Customized Auto Scaling

This article gives step-by-step instructions about auto scaling with Fluid.

Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse

In this article, the author explains building a real-time data warehouse using Apache Flink and Apache Iceberg.

Apache Iceberg 0.11.0: Features and Deep Integration with Flink

In this article, the author discusses how Apache Flink and Apache Iceberg have opened a new chapter in building a data lake architecture featuring stream-batch unification.

Integrating Apache Hudi and Apache Flink for New Data Lake Solutions

This article explains Apache Hudi and Apache Flink and the benefits of implementation.

Cloud-Native Compute Engine: Challenges and Solutions

This article explains some of the challenges in cloud-native compute engines, and discusses some solutions and future directions.

Data Lake: How to Explore the Value of Data Using Multi-engine Integration

This article briefly discusses the metadata service and multi-engine support capabilities of the Alibaba Cloud Data Lake Formation (DLF) service.

EMR Remote Shuffle Service: A Powerful Elastic Tool of Serverless Spark

This article discusses Alibaba Cloud's EMR Remote Shuffle Service and explains how it solves the shuffle stability problems in compute-storage separation architectures.

Efficient Data Lake Formation Based on JindoFS and OSS

This article explains the process of data lake formation based on Alibaba Cloud OSS and JindoFS big data cache acceleration service.

Integrating Real-time Search with SaaS-based Cloud Data Warehouses

This article discusses the integration of Saas Cloud-based Data Warehouses and Real-time Search, as shared by Meng Shuo, product manager of MaxCompute.

Setting Up PySpark on Alibaba Cloud CentOS Instance

This tutorial provides a step-by-step tutorial on how to setup PySpark in Alibaba Cloud ECS instance which is running CentOS 7.x operating system.

Building a Cloud-Native Feed Streaming System with Apache Kafka and Spark on Alibaba Cloud – Part A: Service Setup

In this 3-part blog series, we'll show you how to build a simple, intelligent, cloud-native feed streaming system with Apache Kafka and Spark on Alibaba Cloud.