| | |
| |
| 看看都谁在使用hadoop都在用hadoop做什么! |
|
[ 2011/1/26 20:58:00 | By: 梦翔儿 ] |
A
B
-
BabaCar
- 4 nodes cluster (32 cores, 1TB).
- We use Hadoop for searching and analysis of millions of rental bookings.
-
backdocsearch.com - search engine for chiropractic information, local chiropractors, products and schools
-
Baidu - the leading Chinese language search engine
- Hadoop used to analyze the log of search and do some mining work on web page database
- We handle about 3000TB per week
- Our clusters vary from 10 to 500 nodes
- Hypertable is also supported by Baidu
-
Beebler
- 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
- We use hadoop for matching dating profiles
-
Benipal Technologies - Outsourcing, Consulting, Innovation
-
Bixo Labs - Elastic web mining
-
BrainPad - Data mining and analysis
- We use Hadoop to summarize of user's tracking data.
- And use analyzing.
C
-
Contextweb - Ad Exchange
- We use Hadoop to store ad serving logs and use it as a source for ad optimizations, analytics, reporting and machine learning.
- Currently we have a 50 machine cluster with 400 cores and about 140TB raw storage. Each (commodity) node has 8 cores and 16GB of RAM.
-
Cooliris - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive.
- We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
- We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.
-
Cornell University Web Lab
- Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
-
CRS4
D
E
-
EBay
-
Enet, 'Eleftherotypia' newspaper, Greece
- Experimental installation - storage for logs and digital assets
- Currently 5 nodes cluster
- Using hadoop for log analysis/data mining/machine learning
-
Enormo
- 4 nodes cluster (32 cores, 1TB).
- We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
- We plan to use Pig very shortly to produce statistics.
-
ESPOL University (Escuela Superior Politécnica del Litoral) in Guayaquil, Ecuador
- 4 nodes proof-of-concept cluster.
- We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
- The students use on-demand clusters launched using Amazon's EC2 and EMR services, thanks to its AWS in Education program.
-
ETH Zurich Systems Group
-
Eyealike - Visual Media Search Platform
- Facial similarity and recognition across large datasets.
- Image content based advertising and auto-tagging for social media.
- Image based video copyright protection.
F
G
H
I
J
K
L
-
Last.fm
- 44 nodes
- Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
- Used for charts calculation, log analysis, A/B testing
-
Legolas Media
-
Lineberger Comprehensive Cancer Center - Bioinformatics Group This is the cancer center at UNC Chapel Hill. We are using Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups. This development is based on the SeqWare open source project which includes SeqWare Query Engine, a database and web service built on top of HBase that stores sequence data types. Our prototype cluster includes:
-
LinkedIn
- We have multiple grids divided up based upon purpose. They are composed of the following types of hardware:
- 100 Nehalem-based nodes, with 2x4 cores, 24GB RAM, 8x1TB storage using ZFS in a JBOD configuration on Solaris.
- 120 Westmere-based nodes, with 2x4 cores, 24GB RAM, 6x2TB storage using ext4 in a JBOD configuration on CentOS 5.5
- We use Hadoop and Pig for discovering People You May Know and other fun facts.
-
Lookery
- We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
- Our cluster runs across Amazon's EC2 webservice and makes use of the streaming module to use Python for most operations.
-
Lotame
- Using Hadoop and Hbase for storage, log analysis, and pattern discovery/analysis.
M
N
-
NAVTEQ Media Solutions
- We use Hadoop/Mahout to process user interactions with advertisements to optimize ad selection.
-
Neptune
- Another Bigtable cloning project using Hadoop to store large structured data set.
- 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
-
NetSeer -
-
The New York Times
-
Ning
- We use Hadoop to store and process our log files
-
We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries
- We use commodity hardware, with 8 cores and 16 GB of RAM per machine
O
P
-
PARC - Used Hadoop to analyze Wikipedia conflicts paper.
-
Performable - Web Analytics Software
-
We use Hadoop to process web clickstream, marketing, CRM, & email data in order to create multi-channel analytic reports.
- Our cluster runs on Amazon's EC2 webservice and makes use of Python for most of our codebase.
-
Pharm2Phork Project - Agricultural Traceability
- Using Hadoop on EC2 to process observation messages generated by RFID/Barcode readers as items move through supply chain.
- Analysis of BPEL generated log files for monitoring and tuning of workflow processes.
-
Powerset / Microsoft - Natural Language Search
-
Pressflip - Personalized Persistent Search
- Using Hadoop on EC2 to process documents from a continuous web crawl and distributed training of support vector machines
- Using HDFS for large archival data storage
-
Pronux
- 4 nodes cluster (32 cores, 1TB).
- We use Hadoop for searching and analysis of millions of bookkeeping postings
- Also used as a proof of concept cluster for a cloud based ERP system
-
PSG Tech, Coimbatore, India
- Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. Scalable nature of Hadoop makes it apt to solve large scale alignment problems.
- Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
Q
-
Quantcast
- 3000 cores, 3500TB. 1PB+ processing each day.
- Hadoop scheduler with fully custom data path / sorter
- Significant contributions to KFS filesystem
R
S
-
SARA, Netherlands
- SARA has initiated a Proof-of-Concept project to evaluate the Hadoop software stack for scientific use.
-
Search Wikia
- A project to help develop open source social search tools. We run a 125 node hadoop cluster.
-
SEDNS - Security Enhanced DNS Group
-
SLC Security Services LLC
-
Sling Media
-
Socialmedia.com
- 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
- We use hadoop to process log data and perform on-demand analytics
-
Spadac.com
-
We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing.
-
We use HDFS and MapReduce to store, process, and index geospatial imagery and vector data.
-
MrGeo is soon to be open sourced as well.
-
Stampede Data Solutions (Stampedehost.com)
- Hosted Hadoop data warehouse solution provider
T
-
Taragana - Web 2.0 Product development and outsourcing services
- We are using 16 consumer grade computers to create the cluster, connected by 100 Mbps network.
- Used for testing ideas for blog and other data mining.
-
The Lydia News Analysis Project - Stony Brook University
- We are using Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources.
-
Tailsweep - Ad network for blogs and social media
- 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 storage)
- Used as a proof of concept cluster
- Handling i.e. data mining and blog crawling
-
Technical analysis and Stock Research
- Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36GB Hard Drive)
-
Telefonica Research
- We use Hadoop in our data mining and user modeling, multimedia, and internet research groups.
- 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine.
-
Tianya
- We use Hadoop for log analysis.
-
Twitter
- We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera's CDH2 distribution of Hadoop, and store all data as compressed LZO files.
-
We use both Scala and Java to access Hadoop's MapReduce APIs
- We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
-
We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo)
-
For more on our use of hadoop, see the following presentations: Hadoop and Pig at Twitter and Protocol Buffers and Hadoop at Twitter
-
Tynt
- We use Hadoop to assemble web publishers' summaries of what users are copying from their websites, and to analyze user engagement on the web.
- We use Pig and custom Java map-reduce code, as well as chukwa.
- We have 94 nodes (752 cores) in our clusters, as of July 2010, but the number grows regularly.
U
-
Universidad Distrital Francisco Jose de Caldas (Grupo GICOGE/Grupo Linux UD GLUD/Grupo GIGA
- 5 node low-profile cluster. We use Hadoop to support the research project: Territorial Intelligence System of Bogota City.
-
University of Glasgow - Terrier Team
- 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage).
We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce.
-
University of Maryland
- We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
-
University of Nebraska Lincoln, Research Computing Facility
- We currently run one medium-sized Hadoop cluster (200TB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Hadoop.
V
-
Veoh
- We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
-
Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.
-
VK Solutions
W
-
Web Alliance
- We use Hadoop for our internal search engine optimization (SEO) tools. It allows us to store, index, search data in a much faster way.
- We also use it for logs analysis and trends prediction.
-
WorldLingo
- Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage, 8GB RAM)
- Each server runs Xen with one Hadoop/HBase instance and another instance with web or application servers, giving us 88 usable virtual machines.
- We run two separate Hadoop/HBase clusters with 22 nodes each.
- Hadoop is primarily used to run HBase and Map/Reduce jobs scanning over the HBase tables to perform specific tasks.
- HBase is used as a scalable and fast storage back end for millions of documents.
- Currently we store 12million documents with a target of 450million in the near future.
X
Y
-
Yahoo!
-
More than 100,000 CPUs in >36,000 computers running Hadoop
-
Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
- Used to support research for Ad Systems and Web Search
- Also used to do scaling tests to support development of Hadoop on larger clusters
-
Our Blog - Learn more about how we use Hadoop.
-
>60% of Hadoop Jobs within Yahoo are Pig jobs.
Z
-
Zvents
- 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
- Run Naive Bayes classifiers in parallel over crawl data to discover event information
When applicable, please include details about your cluster hardware and size.
From:http://wiki.apache.org/hadoop/PoweredBy |
|
| | | |