python - How should I optimize this filesystem I/O bound program? -
I have a Python program that does something like:
- Read a line from < / Li>
- Make some changes on this.
- Break it into real lines as they will be written in the database.
- Write those lines to individual CSV files.
- Back to step 1, until the file is completely read.
- Run the SQL * loader and load those files in the database. Step 6 is actually not taking too much time It seems that the highest time in step 4 is taking up for the most part, I have a quad with any type of RAID setup I want to optimize this to handle a set of records in the millions of dollars running on the Core server.
There are some ideas that I have to solve:
- Read the whole file from one step (or at least read it in a great deal) and the file is full in the disk Or write as a very large volume It is being thought that it will spend less time going back and forth between hard disk files and files. Will it do something that will not buffering?
- Parameters 1, 2 and 3, and 4 in separate process. It will not have to wait for 4 to complete steps 1, 2 and 3.
- Break the load files in different parts and process them at parallel. It is not necessary to handle the rows in any sequential order. It can be somehow likely to be combined with step 2.
Of course, the correct answer to this question is "Doing so that you get the fastest by the trial." However, I am mainly trying to think about why I should spend my time first. Do any people with more experience have any advice in these cases?
Python already buffers the IO and the OS should < It does not require RAM for any other thing or it becomes uncomfortable about having dirty data in RAM for a very long time. Unless you compel the OS to write immediately, like writing a file or closing the file after opening the file in OSCYN mode.
If the OS is not working properly, you can try to increase the buffer size (open the third parameter
If it is being executed then you will need to spend more time, which is the culprit and optimize it or divide the work into separate processes.()
). For some guidance on reasonable prices, 100 MB / s 10 MMS latency Io system will be given a 1 MB IO size, which will result in overhead 50% latency, while 10 MB of IO size will result in 9% overhead. If Io is still bound, then you probably need more bandwidth to check that it is useful to check whether Step 4 is taking too long or waiting on IO.
Comments
Post a Comment