{"id":1997661,"date":"2021-09-26T22:13:00","date_gmt":"2021-09-26T20:13:00","guid":{"rendered":"http:\/\/nhub.news\/?p=1997661"},"modified":"2021-09-26T23:03:09","modified_gmt":"2021-09-26T21:03:09","slug":"processing-3d-data-using-python-multiprocessing-library","status":"publish","type":"post","link":"http:\/\/nhub.news\/fr\/2021\/09\/processing-3d-data-using-python-multiprocessing-library\/","title":{"rendered":"Processing 3D Data Using Python Multiprocessing Library"},"content":{"rendered":"

Large amounts of data reveal problems that require creative approaches. Fortunately, Python language and its extensive set of libraries can help.<\/b>
\nJoin the DZone community and get the full member experience. Today we\u2019ll cover the tools that are very handy with large amount of data. I’m not going to tell you only general information that might be found in manuals but share some little tricks that I\u2019ve discovered, such as using tqdm with multiprocessing imap, working with archives in parallel, plotting and processing 3D data, and how to search for a similar object within object meshes if you have a point cloud. So why should we resort to parallel computing? Nowadays, if you work with any kind of data you might face problems related to \u00ab\u00a0big data\u00a0\u00bb. Each time we have the data that doesn\u2019t fit the RAM we need to process it piece by piece. Fortunately, modern programming languages allow us to spawn multiple processes (or even threads) that work perfectly on multi-core processors. ( NB: That doesn\u2019t mean that single-core processors cannot handle multiprocessing. Here\u2019s the Stack Overflow thread on that topic.) Today we\u2019ll try our hand at the frequently occurring 3D computer vision task of computing distances between mesh and point cloud. You might face this problem, for example, when you need to find a mesh within all available meshes that defines the same 3D object as the given point cloud. Our data consist of.obj files stored in.7z archive, which is great in terms of storage efficiency. But when we need to access the exact portion of it, we should make an effort. Here I define the class that wraps up the 7-zip archive and provides an interface to the underlying data. This class hardly relies on py7zlib package that allows us to decompress data each time we call get method and give us the number of files inside an archive. We also define __iter__ that will help us to start multiprocessing map on that object as on the iterable. As you might know, it is possible to create a Python class from which one can instantiate iterable objects. Such class should meet the following conditions: override __getitem__ to return self and __next__ to return following element. And we are definitely following this rule. The above definition provides us a possibility to iterate over the archive but does it allow us to do a random access to contents in parallel? It\u2019s an interesting question, to which I haven\u2019t found an answer online, but we can research the source code of py7zlib and try to answer by ourselves. Here I provide reduced snippets of the code from pylzma: In the code, you can see methods that are called during reading the next object from the archive. I believe it is clear from above that there\u2019s no reason for the archive being blocked whenever it is read multiple times simultaneously. Next, let\u2019s quickly introduce what are the meshes and the point clouds. Firstly, meshes are the sets of vertices, edges, and faces. Vertices are defined by (x,y,z) coordinates in space and assigned with unique numbers. Edges and faces are the groups of point pairs and triplets accordingly and defined with mentioned unique point ids. Commonly, when we talk about \u201cmesh\u201d we mean \u201ctriangular mesh\u201d, i.e. the surface consisting of triangles. Work with meshes in Python is much easier with trimesh library. For example, it provides an interface to load.obj files in memory. To display and interact with 3D objects in jupyter notebook one can use k3d library. So, with the following code snippet I answer the question: \u201cHow do you plot a trimesh object in jupyter with k3d?\u201d Stanford Bunny mesh displayed by k3d Secondly, point clouds are arrays of 3D points that represent objects in space. Many 3D scanners produce point clouds as a representation of a scanned object. For demonstration purposes, we can read the same mesh and display its vertices as a point cloud. Point cloud drawn by k3d As mentioned above, a 3D scanner provides us a point cloud. Let\u2019s assume that we have a database of meshes and we want to find a mesh within our database that is aligned with the scanned object, aka point cloud. To address this problem we can suggest a na\u00efve approach. We\u2019ll search for the largest distance between points of the given point cloud and each mesh from our archive. And if such distance will be less for 1e-4 for some mesh, we\u2019ll consider this mesh as aligned with the point cloud. Finally, we\u2019ve come to the multiprocessing section. Remembering that our archive has plenty of files that might not fit in memory together, as we prefer to process them in parallel. To achieve that we\u2019ll use a multiprocessing Pool, which handles multiple calls of user-defined function with map or imap\/imap_unordered methods. The difference between map and imap that affects us is that map converts an iterable to a list before sending it to worker processes. If an archive is too big to be written in the RAM it shouldn\u2019t be unpacked to a Python list. In other words, the execution speed of both is similar. Above you see the results of simple reading from the archive of meshes that fit in memory. Moving further with imap: Let\u2019s discuss how to accomplish our goal of finding a mesh close to the point cloud. Here is the data. There we have 5 different meshes from Stanford models. We\u2019ll simulate 3D scanning by adding noise to vertices of Stanford bunny mesh. Of course, we previously normalize point cloud and the mesh vertices in the following to scale them in a 3D cube. To compute distances between a point cloud and the mesh we\u2019ll use igl. To finalize we need to write a function that will call in each process and its dependencies. Let\u2019s sum up with the following snippet. Here read_meshes_get_distances_pool_imap is a central function where the following is done: Note how we pass arguments to imap creating a new itearable from archive and point_cloud using zip(archive, itertools.repeat(point_cloud)). That allows us to stick a point cloud array to each entry of the archive avoiding converting archive to a list. The result of execution looks like this: We can eyeball that Stanford bunny is the closest mesh to the given point cloud. It is also seen that we do not use a large amount of data, but we\u2019ve shown that this solution would work even if we have an extensive amount of meshes inside an archive. Multiprocessing allows data scientists to achieve a great performance not only in 3D computer vision but also in the other fields of machine learning. It is very important to understand that parallel execution is much faster than execution within a loop. The difference becomes significant, especially when an algorithm is written correctly. Large amounts of data reveal problems that won\u2019t be addressed without creative approaches on how to use limited resources. And fortunately, Python language and its extensive set of libraries help us data scientists solve such problems. Published at DZone with permission of Emil Bogomolov. See the original article here. Opinions expressed by DZone contributors are their own.<\/p>\n