Home > Uncategorized > weekly report 8 – Barış Akşanlı

weekly report 8 – Barış Akşanlı

After finishing the implementation of the linked list version of our simulation algorithm, we also finished the parallel implementation of this version. Then we performed some tests on AIGER benchmarks, and we saw performance increase in the parallel version. However, after performing on some specific test cases, we realized that our simulation logic dow not always give the correct answer. So, we started to investigate the reason that causes this problem. And we saw that there some cases in which our simulation logic does not behave correct.

After this observation, we began to investigate the simulation code of AIGER itself and achieved to understand the specific structure of AIGER itself. Then we rewrote our simulation algorithm in correspondence with this simulation algorithm of AIGER itself. After finishing the implementation, we saw that the new sequential code works really fast. So this has explained the reason why AIGER simulation works quite fast even for big input data.

After seeing that the old global level-local level logic is wrong, we constructed a new parallelism method for the newly generated sequential algorithm. This method again includes a level based simulation. The base for level is required to construct the necessary independence for parallelism. Then we generated two distinct kernel functions, one for simulating the and gates and the other for simulating the latches.

We planned to perform the and gate simulation in one kernel. And obtained good timing results at first. However, when we wanted to validate the answers, we saw that there are some inconsistency between the AIGER answers and ours. So, we debugged our code and realized that the technique we used for block synchronization is not working as we want. This made us to make a detailed search about the block synchronization in CUDA and the fact that we learned is that there is no way to synchronize the blocks in CUDA. The only way to synchronize the blocks is to divide the kernel into multiple parts so that the block synchronization can be handled implicitly by CUDA architecture.

Then we divided our kernel into different parts and we were able to validate the correctness of the answers in parallel execution. However, in this time the time efficiency decreased dramatically, such that we obtained 4 times worse results. We tried to use some properties of CUDA architecture in order to increase the performance, however the time efficiency did not improve as we want.

So, this situation made us to develop another algorithm for parallel simulation. Since we did not want to divide the kernel into multiple parts, we should construct the independence between the blocks(as we can not synchronize them manually). This method includes the duplication of the gates that shall be used by more than one block. In this method, we will generate more blocks and use them efficiently.

Next Goal: We will start to implement the new parallel algorithm and test it in terms of correctness and time efficiency.

Barış Akşanlı

Advertisement
Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.
You must be logged in to post a comment.
Follow

Get every new post delivered to your Inbox.