What is the best way to learn big data technologies?

Answer by Brent Bai:

I have to say, my career of big data start with "small" data.
You need real big data in hand to understand why these technologies designed.
Most of the big data frameworks are slower than centralised solution when it is about hundreds gigabyte.

Big data is an expensive toy.

What is the best way to learn big data technologies?

What aspect of Chinese characteristics contributed to its huge population throughout history?

Interesting question and answers

Answer by Andy Lee Chaisiri:

Chinese technology was 1,000+ years ahead of everyone else

Like this, but with horses and rice.

Imagine if today's crops suddenly became 30x more productive, that would cause a population boom, right? Agriculture is how human populations exploded in size compared to hunter-gatherer civilizations. So let's talk about some of those tools of agriculture and how population booms were achieved in an era of horse and plow:

Seed Drill: "What if we planted the seeds under the soil?"

Seed drills are tools that bury seeds at a correct depth in a timely manner. Planting seeds at a good depth increases the chances of an individual seed sprouting, without being eaten by birds. The use of seed drills also allows for planting in nice orderly rows with good spacing so the sprouting plants have enough room to draw nutrients from the soil without mutually starving each other. Not every grain will germinate, but using seed drills to plant crops in rows increases the chances of any individual grain germinating. This allows you to eat more grains because you know only a small quantity is needed to replant fields.

Chinese were using metal multi-tubed seed drills as early as 200BC. Seed drills make an appearance in Europe in 1566AD, about 1700 years after their appearance in China. As for how they were planting seeds before that…

Limbourg Brothers for the Duc de Berry (ca. 1415) 'Les Tres Riches Heures

You had a guy with a bag of seeds planting them by hand, then another guy rakes over the earth to cover them. That method leaves a lot of seeds exposed to be eaten by birds, or are planted too shallow to germinate. The crops that do germinate will be competing with other plants that are growing too close to it, and weeding the fields becomes very difficult, if not a waste of time. Out of the grains you wind up harvesting, a larger amount has to be partitioned for future planting, thus less are eaten.

Compared to this hand planting method, using a seed drill to plant crops in rows is 10x-30x more efficient in terms of how much grain you can harvest vs needing to save them for the next planting.

Iron Mouldboard Plough: "Metal cuts better than wood?"

Imagine a plough. You'd probably think of something made out of metal (perhaps with a wedge) right? Well, plows weren't always like that. The earliest ploughs in human history were basically a plank of wood that you knifed into the ground. Around 300BC, Chinese started using plows that were shaped in a way that they simultaneously cut into the earth and turned it too by 100CE, they were made entirely out of iron. Turning the earth is important for getting more nutrients out of your land, and can even turn 'barren' land fertile. 

Around 400AD, a similar mouldboard plow appears in the Roman empire, but widespread adoption is delayed with the fall of the empire. In 1700AD Dutch traders brought Chinese iron mouldboard plows back to Europe, and an agricultural revolution soon followed. Now, what was plowing like without an iron mouldboard plough?

A painting from the 16th century showing a farmer at work, by Pieter Brugel

That is a piece of wood being used to slice into the ground. Because that wooden plough doesn't have a mouldboard the cut soil needs to be tilled through further labor. Iron was expensive and labor intensive to produce, so at best you would have a thin sheet of iron covering the edge of your mostly wooden plough.

So, why did Chinese have all of these iron agricultural tools centuries earlier than Europeans? Because their methods of iron (& steel) production were also centuries ahead.

Blast Furnace: "Like baking a sponge cake made of iron"

The Iron Age is considered to have begun around 1700-1500BC. To extract iron from an ore of iron oxide, the iron has to be separated from oxygen and other impurities in a high temperature process which takes carbon to extract the oxygen out of the ore as carbon dioxide. This is called 'smelting'.

The earliest smelting of iron ore was done at temperatures below the melting point of iron. This left a spongy mass of iron that needed to be shaped by hammering, a very work intensive process.

But some time around 600BC, Chinese developed a furnace that could create a heat intense enough to melt iron, the blast furnace. Once liquified, iron could be poured into casts already in the shape of tools that were needed. The iron casting industry was officially supported by dynastic governments, leading to widespread adoption of iron tools made to a standard.

Now a special note about the difference between iron and steel. Cast iron is very high in carbon content, making it hard but brittle. Steel is iron that has a perfect balance of carbon to retain an edge but also maintain just enough flexibility to avoid brittleness. Around 200BC, Chinese learned that if air was blown over iron as it was being cast the carbon content could be reduced and what you wound up with was steel. Around 600AD steel tools began to widely replace iron ones.

The earliest evidence of blast furnaces in Europe is 1100AD, with widespread adoption occuring in 1400AD. The process of creating steel I described above first appears in the western world in 1855, and there's some contention that the 'inventor' may have actually gotten the idea from Chinese workers in the US.

As another illustration of the difference in iron production, by 1078AD the foundries of northern China could produce 114,000 tons of iron a year. In 1788AD, England produced about 50,000 tons of iron.

Horse Collar: "Over 1,000 years of choking horses"

Imagine a horse pulling a plough. Now, how did you imagine that plough being attached to the horse, with a horse collar, right? Unfortunately for horses, before the collar was invented there was the throat girth harness, which sounds as awful as it is. A plough (or any other load) attached by a throat-girth harness means that a horse is basically pulling with a noose around his trachea. Around 300BC, someone in China thought "What if the horse pulled with its chest instead of its throat?" and so the breast-strap harness was born and horses across China breathed a sigh of relief. This was improved on in 500AD with the horse collar as we know it.

The breast strap harness appears in Russia in 700AD, and shows up further west in Norway around 800AD. The horse collar appears a bit later in 900AD, with widespread adoption by 1200AD.

The difference between China and Europe's population levels throughout history is the difference between their agricultural technology. China had time saving, force multiplying tools (that didn't strangle horses) for centuries, even millenia before adoption in Europe.

What aspect of Chinese characteristics contributed to its huge population throughout history?








那时候的服务叫ActiveLog,每一个PV记录一行,格式跟Apache Combined Log很类似,我们把WebServer的日志集中记录在统一的Server上(是的,比Facebook开源Scribe早半年[2])。为了存储空间的问题,引入了Hadoop,分布的存储在几百台服务器上。也就是这个结构,运行MapReduce占交换机带宽过大就会把生产集群挤死。







多大的数据量敢叫大数据呢?Wikipedia里面有一句话:As of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.[3]

2008年人人网的Log数据一个月有6TB,将就着算half dozen吧,偶尔也要算整年的数据。




谈到大数据应用就涉及三件事:1) Distributed/Parallel computing. 2) Data mining 3) Business Intelligence










[1] http://wiki.apache.org/hadoop/Books
[2] http://en.wikipedia.org/wiki/Scribe_(log_server)
[3]  http://en.wikipedia.org/wiki/Big_data

Practices of using MySQL and DBPool



In large cluster environment, it is always challenge of manage hundreds MySQL databases.

  • Planed down time of database server maintenance
  • Scalability
  • Change of data structures

Ideally, these all handled by proper development process and you have enough software engineers to support.
But in real world, it is more urgent of solving problems.

Here is the 9 best practices to operate thousands of app servers and databases and we don’t have any “user impact” because of database maintenance.

Database design:

1 Global design

We design our database schema like a one-node-cluster, we can run all the service in one box or 1000 boxes.
It is transparent to software engineers.

2 Simple only

Use only basic MySQL features: table, primary key, index, replication.

Application design:

3 Use app server’s CPU.

App servers are scalable, but MySQL is a bottle neck, Use as much as application CPU, which means:
Do not query without index
Do not sort using database
Do not use any query causes temporary tables

4 Use a abstraction layer of tables

We use DBPool for a long time

Ops workflow:

5 Vertical partitioning

When you want to move a few tables to a different master.
a. Setup new master (B) as the slave of the old master (A)
b. Change the DBPool configuration, point master to B.
c. In the middle of the time, (B) will also receive some update requests from old client.
d. Make sure it is no any queries to tables on B
e. Stop the replication and optionally drop old tables on A.

6 Horizontal shading

When you want to distribute data of one table into more physic servers.
a. Estimate how many is needed, find a proper shading key. You should visit only one instance after shading.
b. A good shading number may be 10 or 100. It is human friendly when debug.
c. You don’t need to have 100 physic servers to deploy all tables, DBPool have the ability of route.
d. Use the same method to move tables to new master as showed in (5)

7 Change of data structures

a. We do add column only, no drop column.
b. Application level compatible is required. Make sure new code is working with both old and new data. (if impossible, see (8))
c. Make the changes
d. Update the application to use new column for new feature.

8 Data migration

This situation always involves a big change to the logic, you need to redesign the structure
a. Create a new master (B) of tables using (5)
b. Create new data structures on (B).
c. Create a trigger to update new structure when old data changed on (B).
d. Migrate your old data into the new structure, pay attention to (c) have already moved some recent data.
e. Create a new abstract instance in DBPool, for new structures.
f. Update the application use the new structure for reading.
g. In the same time, old client and new client have the same data for we have (c).
h. Update the application use the new structure for writing.
i. Stop the replication (a), trigger (c) and drop the old tables

9 Planned maintenance

a. DBPool is enough to move MySQL slave servers.
b. Use (5) to make a new master or promote one slave to master.

Free Website Hosting (PaaS)

Update 2015-Mar-06: Appfog的环境里有一个旧版的DBCP包,非常头疼。


  • 支持3个512M节点
    • 可以建立2个App,1个MySQL;
    • 两个App可以AutoScaleUp
  • 可以ssh登录;
  • 负载均衡采用HAProxy,不计费;
  • 支持自定义域名;
  • 默认域名rhcloud.com支持SSL;
  • 填个信用卡注册了Bronze账号,控制好用量,也可以免费,但是可以支持自定义域名的SSL了。


OpenSSL and cURL for iOS


I have updated the dependency libraries of my iPhone app, OpenSSL and cURL.
Added support of iPhone5s new 64bit arm64 CPU.
Upgraded to latest version.



The key of cross compiling are two parameters: -isysroot and -miphoneos-version-min
cURL is configured using parameters.
While OpenSSL has a build-in target called iphoneos-cross, I added 64 bit support based on it.


这两个项目的代码提交在了GitHub,目前用iOS SDK7.1在MacOSX 10.8验证通过.
Here are two projects on GitHub, tested on iOS SDK 7.1 and MacOSX 10.8 .


这两个脚本不需要git clone再使用,下载好OpenSSL和cURL的源代码并解压缩,直接运行github上的脚本,就会编译好放在桌面上。
It is not necessary clone the code locally, download the sources from OpenSSL and cURL official website.
Run following scripts, results will be on the desktop.

curl -O http://www.openssl.org/source/openssl-1.0.1f.tar.gz
tar xf openssl-1.0.1f.tar.gz
cd openssl-1.0.1f
curl https://raw.githubusercontent.com/sinofool/build-openssl-ios/master/build_openssl_dist.sh |bash


curl -O http://curl.haxx.se/download/curl-7.35.0.tar.gz
tar xf curl-7.35.0.tar.gz
cd curl-7.35.0
curl https://raw.githubusercontent.com/sinofool/build-libcurl-ios/master/build_libcurl_dist.sh |bash

new Life(location)