java - Is Hadoop right for running my simulations? -
has written a staunchic simulation in Java, which loads data of some CSV files on the disk (approx. 100MB total ) And the result is that there is also a parameter file in any other output file (not just data, only a boolean and some number), and for different parameters, the distribution of simulation output will be expected to change. To run multiple simulations in multiple input parameter configurations, check out the correct / best input parameters, and see the distribution of output in each group. Depending on the parameters and randomness, each simulation takes 0.1-10 minutes.
I am studying about Hadop and thinking whether it can help to run many of my simulations; In near future, I can have access to about 8 network desktop machines. If I understand correctly, then the map function can run my simulation and remove the result, and reducers can be identified.
The thing that I am worried about is HDFS, which is for large files, not a mess of small CSV files (none of which has a minimum size of 64 MB minimum size size Will also be enough to make bigger). In addition, each simulation will only require a uniform copy of each CSV file.
Is there a wrong tool for me?
I get a lot of answers here that are basically saying, "No, Hadad should not be used because it was not created for simulation. "I believe that this is a small blind scene and in 1985 someone would say," You have a PC for word processing Can not use, for PC spreadsheets! "
Hadoop is a brilliant framework for building a simulation engine. I have been using it for months for this purpose and have had great success with small data / large amount of problems. Here are the top 5 reasons which I have moved to simulation for Hadoop (using R as simulation for Microsoft, for BTW):
- Access: I can either lease HADOP clusters through Amazon, reduce elastic maps and I do not have to invest any time and energy in the administration of a cluster. This meant that I could actually start simulation on a distributed structure without getting administrative approval in my organization!
- Administration: Hadoop handles job control issues, such as node failure, invisibly I do not have code for these conditions. If a node fails, Ensures that the scheduled components for that node will be run on the second node.
- Upgradable: Due to being a great general file, the engine can be reduced with a great distributed file system, if you have later problems that include large data If you do not use Hadoop to use a new solution to flee then Hadop gives you a simulation platform which will scale (free) for a large data platform as well!
- Support : Being an open source and used by many companies, many of these resources are number of resources, both on line and off, for Hadop Written with the notion of "big data", but they are still useful to learn to reduce the way they think in the map.
- Portability: I created an analysis on the top of a proprietary engine through proprietary equipment that used to learn a lot to do. When I later changed the job and found myself on the firm without the same proprietary stack, then I had to learn new equipment and new simulation stack. Never again. I traded in our old grid framework for SAS and Hadoop for R. Both are open sources and I know that I can land on any job in the future and keep equipment on my fingertips to start killing the ass.
If you are interested in hearing more information about your experiences, then with Hadoop R, here in May 2010:
Comments
Post a Comment