Public datasets in MaxCompute - MaxCompute - Alibaba Cloud Documentation Center

After MaxCompute is activated, you can use MaxCompute SQL to query and analyze data in public datasets. This helps you quickly get started with MaxCompute. This topic describes the public datasets of MaxCompute and how to use MaxCompute SQL to query and analyze data in the public datasets.

Introduction

MaxCompute provides public datasets based on data categories such as GitHub public event data, national statistics data, TPC performance test data, digital business data, life service data, and financial stock data. All data is stored in different schemas in the public project BIGDATA_PUBLIC_DATASET in MaxCompute.

Category		Description	Dataset name	Schema name
GitHub public event data		A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code.	GitHub public event dataset	github_events
National statistics data		Includes annual gross domestic product (GDP) data for countries around the world and all provinces in the Chinese mainland.	National statistics dataset	national_data
TPC performance data	TPC-DS	TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. TPC-DS enables emerging technologies, such as big data systems, to perform benchmark tests.	TPC-DS 10-GB performance test dataset TPC-DS 100-GB performance test dataset TPC-DS 1-TB performance test dataset TPC-DS 10-TB performance test dataset	tpcds_10g tpcds_100g tpcds_1t tpcds_10t
	TPC-H	TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. TPC-H illustrates decision support systems that perform highly complex queries on large amounts of data and provide answers to critical business questions.	TPC-H 10-GB performance test dataset TPC-H 100-GB performance test dataset TPC-H 1-TB performance test dataset TPC-H 10-TB performance test dataset	tpch_10g tpch_100g tpch_1t tpch_10t
	TPCx-BB	TPCx-BB is a TPC Express benchmark, which is designed to measure the performance of Hadoop-based big data systems. TPCx-BB measures the performance of both hardware and software components by executing 30 frequently performed analytical queries.	TPCx-BB 10-GB performance test dataset TPCx-BB 100-GB performance test dataset TPCx-BB 1-TB performance test dataset TPCx-BB 10-TB performance test dataset	tpcbb_10g tpcbb_100g tpcbb_1t tpcbb_10t
Digital business data		Includes Taobao advertising data, Taobao shopping data, and e-commerce data of Alibaba Group.	Digital business dataset	commerce
Life service data		Includes data of pre-owned houses, movies and box office, mobile phone number attribution, and administrative, urban, and rural division code information.	Life service dataset	life_service
Financial stock data		Includes stock information.	Financial stock dataset	finance

Disclaimer

Data in the public datasets of MaxCompute is only for product testing. The data is not periodically updated and its accuracy is not ensured. Do not use the data in the production process.
TPC data in MaxCompute public datasets is generated and analyzed based on the TPC benchmark test. The test results differ from the released TPC benchmark test results. This is because the test that is performed based on MaxCompute public datasets does not meet all the requirements of the TPC benchmark test.
The TPC performance test data provided by MaxCompute is obtained from TPC. You can also generate TPC performance test data. For more information about how to generate TPC performance test data, see TPC documentation.

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

All data of public datasets is stored in the BIGDATA_PUBLIC_DATASET project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:
```
-- Enable the session-level schema syntax.
set odps.namespace.schema=true; 
-- Query 100 data records from the dwd_github_events_odps table.
select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
```
Important
You are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing.
You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.
Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.

Table details

The following content describes the details of tables in each schema in the public project BIGDATA_PUBLIC_DATASET.

GitHub public event data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	github_events
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)
Table name and description	A large number of developers develop open source projects on GitHub and generate a large number of events during the development process. GitHub records the information about each event, including the event type, event details, developer, and code repository. GitHub also exposes public events, such as the events of starring repositories and submitting code. For more information about event types, see GitHub event types. MaxCompute batch processes and develops large amounts of public event data that is provided by GH Archive and generates the following tables: dwd_github_events_odps: the fact table that stores GitHub public event data. dws_overview_by_repo_month: the aggregate table that stores monthly metric data of GitHub public events. Note Data in the tables is obtained from GH Archive.
Update cycle	dwd_github_events_odps: updated every T+1 hours. dws_overview_by_repo_month: updated every T+1 days.
Schema query	`-- Enable session-level schema syntax. set odps.namespace.schema=true; -- Query the schema of the dwd_github_events_odps table. If you want to query the schema of another table, replace the schema name and table name in the following statement. desc bigdata_public_dataset.github_events.dwd_github_events_odps;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Obtain the rankings of starred repositories in the previous year. Note that the situations such as unstarring are not considered in this example. SELECT repo_id, repo_name, COUNT(actor_login) total FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds>=date_add(getdate(), -365) AND type = 'WatchEvent' GROUP BY repo_id, repo_name ORDER BY total DESC LIMIT 10;`
For more information about data and query examples, see GitHub public event data.

National statistics data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	national_data
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)
Table name and description	annual_gdp_by_province: stores the annual GDP data of each province in the Chinese mainland. annual_gdp_by_country: stores the annual GDP data of each country around the world. Note Data in the annual_gdp_by_province table is obtained from National Bureau of Statistics of China and data in the annual_gdp_by_country table is obtained from International Monetary Fund (IMF).
Update cycle	Fixed data is provided and is not updated.
Schema query	`-- Enable session-level schema syntax. set odps.namespace.schema=true; -- Query the schema of the annual_gdp_by_province table. If you want to query the schema of another table, replace the schema name and table name in the following statement. desc bigdata_public_dataset.national_data.annual_gdp_by_province;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Query the GDP change trend of Beijing over the past 20 years. SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province WHERE region='Beijing' ORDER BY year ASC LIMIT 20;`

TPC-DS data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpcds_10g, tpcds_100g, tpcds_1t, and tpcds_10t
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Singapore, Germany (Frankfurt), Indonesia (Jakarta), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), India (Mumbai), UK (London), UAE (Dubai), Australia (Sydney), and Japan (Tokyo)
Table name and description	The TPC-DS model simulates the sales system of a large-scale national chain retailer. The sales system involves three sales channels: store (brick-and-mortar store), web (online store), and catalog (telephone order). Each channel uses one table to store sales records, another table to store return records, and multiple dimension tables to store information such as product information, promotion information, and user information. Table details: call_center: stores the information about the customer service center. catalog_page: stores the information about catalogs. catalog_returns: stores the return records of products sold by using telephone orders. catalog_sales: stores the sales records of products sold by using telephone orders. customer: stores the customer information. customer_address: stores the customer address information. customer_demographics: stores the basic credit information about customers. date_dim: stores the time dimension information. household_demographics: stores the basic household credit information. income_band: stores the revenue information. inventory: stores the inventory information. item: stores the product information. promotion: stores the product promotion information. reason: stores the reasons for sales returns. ship_mode: stores the product shipment information. store: stores the merchant information. store_returns: stores the product return records for store sales. store_sales: stores the product sales records for store sales. time_dim: stores the time dimension information. warehouse: stores the warehouse information. web_page: stores the product web page information. web_returns: stores the product return records for web sales. web_sales: stores the product sales records for web sales. web_site: stores the basic information about the website. Note The data in the tables is obtained from TPC.
Update cycle	Fixed data is provided and is not updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the call_center table in tpcds_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. DESC bigdata_public_dataset.tpcds_10g.call_center;`
Query example	SET odps.namespace.schema=TRUE; SELECT dt.d_year , item.i_brand_id brand_id , item.i_brand brand , SUM(ss_sales_price) sum_agg FROM bigdata_public_dataset.tpcds_10g.date_dim dt , bigdata_public_dataset.tpcds_10g.store_sales , bigdata_public_dataset.tpcds_10g.item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 190 AND dt.d_moy = 12 GROUP BY dt.d_year , item.i_brand , item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
For more query sample files of different data specifications, see TPC-DS data. For more information about data, see TPC Benchmark DS Standard Specification.

TPC-H data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpch_10g, tpch_100g, tpch_1t, and tpch_10t
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Singapore, Germany (Frankfurt), Indonesia (Jakarta), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), India (Mumbai), UK (London), UAE (Dubai), Australia (Sydney), and Japan (Tokyo)
Table name and description	TPC-H is a benchmark that is used to evaluate online analysis and processing. TPC-H data simulates business behavior between a provider and a buyer. TPC-H data contains information such as order information, product information, and user information. Table details: customer: stores the consumer information. lineitem: stores the online product information. nation: stores the country information. orders: stores the order information. part: stores the part information. partsupp: stores the supplier part information. region: stores the region information. supplier: stores the supplier information. Note The data in the tables is obtained from TPC.
Update cycle	Fixed data is provided and is not updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the lineitem table in tpch_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. DESC bigdata_public_dataset.tpch_10g.lineitem;`
Query example	SET odps.namespace.schema=TRUE; SET odps.sql.validate.orderby.limit=FALSE; SET odps.sql.hive.compatible=TRUE; SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM bigdata_public_dataset.tpch_10g.lineitem WHERE l_shipdate <= date'1998-12-01' - interval '90' DAY GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
For more information about data and sample queries, see TPC Benchmark H Standard Specification.

TPCx-BB data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, and tpcxbb_10t
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Singapore, Germany (Frankfurt), Indonesia (Jakarta), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), India (Mumbai), UK (London), UAE (Dubai), Australia (Sydney), and Japan (Tokyo)
Table name and description	TPCx-BB is a big data benchmark test tool that simulates an online retail scenario. TPCx-BB data includes sales records, return records, product information, and promotion information. Table details: customer: stores the customer information. customer_address: stores the customer address information. customer_demographics: stores the basic credit information about customers. date_dim: stores the time dimension information. household_demographics: stores the basic household credit information. income_band: stores the revenue information. inventory: stores the inventory information. item: stores the product information. item_marketprices: stores the product price information of competitors. product_reviews: stores the product review information. promotion: stores the product promotion information. reason: stores the reasons for sales returns. ship_mode: stores the product shipment information. store: stores the store information. store_returns: stores the product return records for store sales. store_sales: stores the product sales records for store sales. time_dim: stores the time dimension information. warehouse: stores the warehouse information. web_clickstreams: stores the web page click information. web_page: stores the product web page information. web_returns: stores the product return records for web sales. web_sales: stores the product sales records for web sales. web_site: stores the product website information. Note The data in the tables is obtained from TPC.
Update cycle	Fixed data is provided and is not updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the web_sales table in tpcxbb_10g. If you want to query a table schema from another dataset, replace the schema name and table name in the following statement. DESC bigdata_public_dataset.tpcxbb_10g.web_sales;`
Query example	`SET odps.namespace.schema=TRUE; select * FROM bigdata_public_dataset.tpcxbb_10g.web_sales limit 100;`
For more information about data and query examples, see TPCx-BB Standard Specification.

Digital business data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	commerce
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)
Table name and description	adv_raw_sample: stores the raw data that is randomly sampled from the logs for advertisement clicks of more than 1 million users within 8 days. adv_ad_feature: stores the basic information about some advertisements in the raw_sample table. user_profile: stores the basic information about all users in the raw_sample table. behavior_log: stores the shopping behavior of all users in the raw_sample table within 22 days. The behavior includes browsing, adding to the shopping cart, adding to favorites, and purchasing. Note The data in the tables is obtained from Tianchi Lab - Ad Display/Click Data on Taobao.com.
Update cycle	Fixed data is provided and is no longer incrementally updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the behavior_log table. If you want to query the schema of another table, replace the table name in the following statement. DESC bigdata_public_dataset.commerce.behavior_log;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the IDs of the top 3 product categories in the behavior_log table within 22 days based on product sales. SELECT cate, count(btag) sales FROM behavior_log WHERE btag='buy' GROUP BY cate ORDER BY sales DESC LIMIT 3;`

Life service data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	life_service
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)
Table name and description	movie_basic_info: stores the basic information about movies. movie_box: stores the basic information about box office. areacode_basic_info_2020: stores the basic information about administrative, urban, and rural division codes in 2020. phoneno_basic_info_2020: stores the basic information about mobile phone number attribution in 2020.
Update cycle	movie_basic_info and movie_box: Data in date-specific partitions is provided and is no longer incrementally updated. areacode_basic_info_2020 and phoneno_basic_info_2020: Fixed data is provided and is no longer incrementally updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the movie_box table. If you want to query the schema of another table, replace the table name in the following statement. DESC bigdata_public_dataset.life_service.movie_box;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the names of the top 10 most popular movies on January 14, 2017 based on the box office. SELECT moviename FROM bigdata_public_dataset.life_service.movie_box WHERE ds ='20170114' ORDER BY rank ASC LIMIT 10;`

Financial stock data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	finance
Supported regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), and China (Chengdu)
Table name and description	ods_enterprise_share_basic: stores the basic stock information. ods_enterprise_share_quarter_cashflow: stores the quarterly cash flow information. ods_enterprise_share_quarter_growth: stores the quarterly business growth data. ods_enterprise_share_quarter_operation: stores the quarterly financial turnover data. ods_enterprise_share_quarter_profit: stores the quarterly profit information. ods_enterprise_share_quarter_report: stores the quarterly report information. ods_enterprise_share_trade_h: stores the stock price information.
Update cycle	Data in date-specific partitions is provided and is no longer incrementally updated.
Schema query	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the schema of the ods_enterprise_share_basic table. If you want to query the schema of another table, replace the table name in the following statement. DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Query the basic stock information on January 14, 2017. SELECT * FROM bigdata_public_dataset.finance.ods_enterprise_share_basic WHERE ds ='20170114' LIMIT 10;`

Use public datasets

Prerequisites

MaxCompute is activated, and a MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools or platforms

Procedure (use a DataWorks ODPS SQL node)

Log on to the DataWorks console and create a workspace. For more information about how to create a workspace, see Create a workspace.
Associate a MaxCompute compute engine with the workspace. For more information, see Add a data source or register a cluster to a workspace.

Create an ODPS SQL node and enter the following SQL statements. For more information, see Create an ODPS SQL node.

-- Query the GDP change trend of each province in the Chinese mainland over the past 20 years. 
SET odps.namespace.schema=true; 
SET odps.sql.validate.orderby.limit = false;
SELECT
    region,
    gdp,
    year
FROM
    bigdata_public_dataset.national_data.annual_gdp_by_province
ORDER BY
    year ASC;

Click , and view the output result.

Note

The MAXCOMPUTE_PUBLIC_DATA project described in Public dataset reference is no longer maintained or updated. You can continue to use the public datasets of the project based on your business requirements.